* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle [not found] ` <20190509061237.GA15229@lst.de> @ 2019-05-09 6:48 ` Kai-Heng Feng 2019-05-09 6:48 ` Kai-Heng Feng ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 6:48 UTC (permalink / raw) To: Christoph Hellwig, rafael.j.wysocki Cc: Mario.Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML Cc Rafael and linux-pm at 14:12, Christoph Hellwig <hch@lst.de> wrote: > On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com wrote: >> You might think this would be adding runtime_suspend/runtime_resume >> callbacks, but those also get called actually at runtime which is not >> the goal here. At runtime, these types of disks should rely on APST which >> should calculate the appropriate latencies around the different power >> states. >> >> This code path is only applicable in the suspend to idle state, which >> /does/ >> call suspend/resume functions associated with dev_pm_ops. There isn't >> a dedicated function in there for use only in suspend to idle, which is >> why pm_suspend_via_s2idle() needs to get called. > > The problem is that it also gets called for others paths: > > #ifdef CONFIG_PM_SLEEP > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > .suspend = suspend_fn, \ > .resume = resume_fn, \ > .freeze = suspend_fn, \ > .thaw = resume_fn, \ > .poweroff = suspend_fn, \ > .restore = resume_fn, > #else > else > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) > #endif > > #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ > const struct dev_pm_ops name = { \ > SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > } > > And at least for poweroff this new code seems completely wrong, even > for freeze it looks rather borderline. Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so the old code path will be taken. > > And more to the points - if these "modern MS standby" systems are > becoming common, which it looks they are, we need support in the PM core > for those instead of working around the decisions in low-level drivers. Rafael, what do you think about this? Including this patch, there are five drivers that use pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. So I think maybe it’s time to introduce a new suspend callback for S2I? > >> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and >> freeze (hibernate), so to avoid any changes to the hibernate case it seems >> to me that there needs to be a new nvme_freeze() that calls into the >> existing >> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the >> existing nvme_reset_ctrl for the thaw pm op. > > At least, yes. Hibernation should remain the same as stated above. > >>> enterprise class NVMe devices >>> that don't do APST and don't really do different power states at >>> all in many cases. >> >> Enterprise class NVMe devices that don't do APST - do they typically >> have a non-zero value for ndev->ctrl.npss? >> >> If not, they wouldn't enter this new codepath even if the server entered >> into S2I. > > No, devices that do set NPSS will have at least some power states > per definition, although they might not be too useful. I suspect checking > APSTA might be safer, but if we don't want to rely on APST we should > check for a power state supporting the condition that the MS document > quoted in the original document supports. If Modern Standby or Connected Standby is not supported by servers, I don’t think the design documents mean much here. We probably should check if the platform firmware really supports S2I instead. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 6:48 ` [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng @ 2019-05-09 6:48 ` Kai-Heng Feng 2019-05-09 6:52 ` Christoph Hellwig 2019-05-09 9:07 ` Rafael J. Wysocki 2 siblings, 0 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 6:48 UTC (permalink / raw) To: Christoph Hellwig, rafael.j.wysocki Cc: Mario.Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML Cc Rafael and linux-pm at 14:12, Christoph Hellwig <hch@lst.de> wrote: > On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com wrote: >> You might think this would be adding runtime_suspend/runtime_resume >> callbacks, but those also get called actually at runtime which is not >> the goal here. At runtime, these types of disks should rely on APST which >> should calculate the appropriate latencies around the different power >> states. >> >> This code path is only applicable in the suspend to idle state, which >> /does/ >> call suspend/resume functions associated with dev_pm_ops. There isn't >> a dedicated function in there for use only in suspend to idle, which is >> why pm_suspend_via_s2idle() needs to get called. > > The problem is that it also gets called for others paths: > > #ifdef CONFIG_PM_SLEEP > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > .suspend = suspend_fn, \ > .resume = resume_fn, \ > .freeze = suspend_fn, \ > .thaw = resume_fn, \ > .poweroff = suspend_fn, \ > .restore = resume_fn, > #else > else > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) > #endif > > #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ > const struct dev_pm_ops name = { \ > SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > } > > And at least for poweroff this new code seems completely wrong, even > for freeze it looks rather borderline. Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so the old code path will be taken. > > And more to the points - if these "modern MS standby" systems are > becoming common, which it looks they are, we need support in the PM core > for those instead of working around the decisions in low-level drivers. Rafael, what do you think about this? Including this patch, there are five drivers that use pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. So I think maybe it’s time to introduce a new suspend callback for S2I? > >> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and >> freeze (hibernate), so to avoid any changes to the hibernate case it seems >> to me that there needs to be a new nvme_freeze() that calls into the >> existing >> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the >> existing nvme_reset_ctrl for the thaw pm op. > > At least, yes. Hibernation should remain the same as stated above. > >>> enterprise class NVMe devices >>> that don't do APST and don't really do different power states at >>> all in many cases. >> >> Enterprise class NVMe devices that don't do APST - do they typically >> have a non-zero value for ndev->ctrl.npss? >> >> If not, they wouldn't enter this new codepath even if the server entered >> into S2I. > > No, devices that do set NPSS will have at least some power states > per definition, although they might not be too useful. I suspect checking > APSTA might be safer, but if we don't want to rely on APST we should > check for a power state supporting the condition that the MS document > quoted in the original document supports. If Modern Standby or Connected Standby is not supported by servers, I don’t think the design documents mean much here. We probably should check if the platform firmware really supports S2I instead. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 6:48 ` [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng 2019-05-09 6:48 ` Kai-Heng Feng @ 2019-05-09 6:52 ` Christoph Hellwig 2019-05-09 6:52 ` Christoph Hellwig 2019-05-09 9:19 ` Rafael J. Wysocki 2019-05-09 9:07 ` Rafael J. Wysocki 2 siblings, 2 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 6:52 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, rafael.j.wysocki, Mario.Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 02:48:59PM +0800, Kai-Heng Feng wrote: > Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so > the old code path will be taken. > >> >> And more to the points - if these "modern MS standby" systems are >> becoming common, which it looks they are, we need support in the PM core >> for those instead of working around the decisions in low-level drivers. > > Rafael, what do you think about this? > Including this patch, there are five drivers that use > pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > So I think maybe it’s time to introduce a new suspend callback for S2I? We also really need something like that to avoid the PCI_DEV_FLAGS_NO_D3 abuse - that flag is a quirk statically set on a device at probe time to prevent any entering of D3 state. >> per definition, although they might not be too useful. I suspect checking >> APSTA might be safer, but if we don't want to rely on APST we should >> check for a power state supporting the condition that the MS document >> quoted in the original document supports. > > If Modern Standby or Connected Standby is not supported by servers, I > don’t think the design documents mean much here. > We probably should check if the platform firmware really supports S2I > instead. That too. As said this really is a platform decision, and needs to be managed by the platform code through the PM core. Individual drivers like nvme can just implement the behavior, but are the absolute wrong place to make decisions on what kinds of suspend to enter. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 6:52 ` Christoph Hellwig @ 2019-05-09 6:52 ` Christoph Hellwig 2019-05-09 9:19 ` Rafael J. Wysocki 1 sibling, 0 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 6:52 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, rafael.j.wysocki, Mario.Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 02:48:59PM +0800, Kai-Heng Feng wrote: > Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so > the old code path will be taken. > >> >> And more to the points - if these "modern MS standby" systems are >> becoming common, which it looks they are, we need support in the PM core >> for those instead of working around the decisions in low-level drivers. > > Rafael, what do you think about this? > Including this patch, there are five drivers that use > pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > So I think maybe it’s time to introduce a new suspend callback for S2I? We also really need something like that to avoid the PCI_DEV_FLAGS_NO_D3 abuse - that flag is a quirk statically set on a device at probe time to prevent any entering of D3 state. >> per definition, although they might not be too useful. I suspect checking >> APSTA might be safer, but if we don't want to rely on APST we should >> check for a power state supporting the condition that the MS document >> quoted in the original document supports. > > If Modern Standby or Connected Standby is not supported by servers, I > don’t think the design documents mean much here. > We probably should check if the platform firmware really supports S2I > instead. That too. As said this really is a platform decision, and needs to be managed by the platform code through the PM core. Individual drivers like nvme can just implement the behavior, but are the absolute wrong place to make decisions on what kinds of suspend to enter. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 6:52 ` Christoph Hellwig 2019-05-09 6:52 ` Christoph Hellwig @ 2019-05-09 9:19 ` Rafael J. Wysocki 2019-05-09 9:19 ` Rafael J. Wysocki 2019-05-09 9:25 ` Christoph Hellwig 1 sibling, 2 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 9:19 UTC (permalink / raw) To: Christoph Hellwig Cc: Kai-Heng Feng, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 9, 2019 at 8:52 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, May 09, 2019 at 02:48:59PM +0800, Kai-Heng Feng wrote: > > Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so > > the old code path will be taken. > > > >> > >> And more to the points - if these "modern MS standby" systems are > >> becoming common, which it looks they are, we need support in the PM core > >> for those instead of working around the decisions in low-level drivers. > > > > Rafael, what do you think about this? > > Including this patch, there are five drivers that use > > pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > > So I think maybe it’s time to introduce a new suspend callback for S2I? > > We also really need something like that to avoid the PCI_DEV_FLAGS_NO_D3 > abuse - that flag is a quirk statically set on a device at probe time > to prevent any entering of D3 state. I agree that PCI_DEV_FLAGS_NO_D3 has to be avoided. However, IMO introducing a new set of suspend (and resume) callbacks for S2I would not be practical, because (a) the only difference between S2I and S2R from a driver perspective is that it may be expected to do something "special" about setting the device power state in the S2I case (the rest of what needs to be done during system-wide suspend/resume remains the same in both cases), (b) the new callbacks would only be really useful for a handful of drivers. > >> per definition, although they might not be too useful. I suspect checking > >> APSTA might be safer, but if we don't want to rely on APST we should > >> check for a power state supporting the condition that the MS document > >> quoted in the original document supports. > > > > If Modern Standby or Connected Standby is not supported by servers, I > > don’t think the design documents mean much here. > > We probably should check if the platform firmware really supports S2I > > instead. > > That too. As said this really is a platform decision, and needs to > be managed by the platform code through the PM core. I'm not what you mean by "platform decision" here. > Individual drivers like nvme can just implement the behavior, but are the absolute wrong > place to make decisions on what kinds of suspend to enter. Right, the choice of the target system state has already been made when their callbacks get invoked (and it has been made by user space, not by the platform). ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:19 ` Rafael J. Wysocki @ 2019-05-09 9:19 ` Rafael J. Wysocki 2019-05-09 9:25 ` Christoph Hellwig 1 sibling, 0 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 9:19 UTC (permalink / raw) To: Christoph Hellwig Cc: Kai-Heng Feng, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 9, 2019 at 8:52 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, May 09, 2019 at 02:48:59PM +0800, Kai-Heng Feng wrote: > > Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so > > the old code path will be taken. > > > >> > >> And more to the points - if these "modern MS standby" systems are > >> becoming common, which it looks they are, we need support in the PM core > >> for those instead of working around the decisions in low-level drivers. > > > > Rafael, what do you think about this? > > Including this patch, there are five drivers that use > > pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > > So I think maybe it’s time to introduce a new suspend callback for S2I? > > We also really need something like that to avoid the PCI_DEV_FLAGS_NO_D3 > abuse - that flag is a quirk statically set on a device at probe time > to prevent any entering of D3 state. I agree that PCI_DEV_FLAGS_NO_D3 has to be avoided. However, IMO introducing a new set of suspend (and resume) callbacks for S2I would not be practical, because (a) the only difference between S2I and S2R from a driver perspective is that it may be expected to do something "special" about setting the device power state in the S2I case (the rest of what needs to be done during system-wide suspend/resume remains the same in both cases), (b) the new callbacks would only be really useful for a handful of drivers. > >> per definition, although they might not be too useful. I suspect checking > >> APSTA might be safer, but if we don't want to rely on APST we should > >> check for a power state supporting the condition that the MS document > >> quoted in the original document supports. > > > > If Modern Standby or Connected Standby is not supported by servers, I > > don’t think the design documents mean much here. > > We probably should check if the platform firmware really supports S2I > > instead. > > That too. As said this really is a platform decision, and needs to > be managed by the platform code through the PM core. I'm not what you mean by "platform decision" here. > Individual drivers like nvme can just implement the behavior, but are the absolute wrong > place to make decisions on what kinds of suspend to enter. Right, the choice of the target system state has already been made when their callbacks get invoked (and it has been made by user space, not by the platform). ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:19 ` Rafael J. Wysocki 2019-05-09 9:19 ` Rafael J. Wysocki @ 2019-05-09 9:25 ` Christoph Hellwig 2019-05-09 9:25 ` Christoph Hellwig 2019-05-09 20:48 ` Rafael J. Wysocki 1 sibling, 2 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 9:25 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Christoph Hellwig, Kai-Heng Feng, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 11:19:37AM +0200, Rafael J. Wysocki wrote: > Right, the choice of the target system state has already been made > when their callbacks get invoked (and it has been made by user space, > not by the platform). >From a previous discussion I remember the main problem here is that a lot of consumer NVMe use more power when put into D3hot than just letting the device itself manage the power state transitions themselves. Based on this patch there also might be some other device that want an explicit power state transition from the host, but still not be put into D3hot. The avoid D3hot at all cost thing seems to be based on the Windows broken^H^H^H^H^H^Hmodern standby principles. So for platforms that follow the modern standby model we need to avoid putting NVMe devices that support power management into D3hot somehow. This patch doesa a few more things, but at least for the device where I was involved in the earlier discussion those are not needed, and from the Linux point of view many of them seem wrong too. How do you think we best make that distinction? Are the pm_ops enough if we don't use the simple version? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:25 ` Christoph Hellwig @ 2019-05-09 9:25 ` Christoph Hellwig 2019-05-09 20:48 ` Rafael J. Wysocki 1 sibling, 0 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 9:25 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Christoph Hellwig, Kai-Heng Feng, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 11:19:37AM +0200, Rafael J. Wysocki wrote: > Right, the choice of the target system state has already been made > when their callbacks get invoked (and it has been made by user space, > not by the platform). From a previous discussion I remember the main problem here is that a lot of consumer NVMe use more power when put into D3hot than just letting the device itself manage the power state transitions themselves. Based on this patch there also might be some other device that want an explicit power state transition from the host, but still not be put into D3hot. The avoid D3hot at all cost thing seems to be based on the Windows broken^H^H^H^H^H^Hmodern standby principles. So for platforms that follow the modern standby model we need to avoid putting NVMe devices that support power management into D3hot somehow. This patch doesa a few more things, but at least for the device where I was involved in the earlier discussion those are not needed, and from the Linux point of view many of them seem wrong too. How do you think we best make that distinction? Are the pm_ops enough if we don't use the simple version? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:25 ` Christoph Hellwig 2019-05-09 9:25 ` Christoph Hellwig @ 2019-05-09 20:48 ` Rafael J. Wysocki 2019-05-09 20:48 ` Rafael J. Wysocki 1 sibling, 1 reply; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 20:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Rafael J. Wysocki, Kai-Heng Feng, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 9, 2019 at 11:25 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, May 09, 2019 at 11:19:37AM +0200, Rafael J. Wysocki wrote: > > Right, the choice of the target system state has already been made > > when their callbacks get invoked (and it has been made by user space, > > not by the platform). > > From a previous discussion I remember the main problem here is that > a lot of consumer NVMe use more power when put into D3hot than just > letting the device itself manage the power state transitions themselves. > Based on this patch there also might be some other device that want > an explicit power state transition from the host, but still not be > put into D3hot. > > The avoid D3hot at all cost thing seems to be based on the Windows > broken^H^H^H^H^H^Hmodern standby principles. So for platforms that > follow the modern standby model we need to avoid putting NVMe devices > that support power management into D3hot somehow. This patch doesa a > few more things, but at least for the device where I was involved in > the earlier discussion those are not needed, and from the Linux > point of view many of them seem wrong too. > > How do you think we best make that distinction? Are the pm_ops > enough if we don't use the simple version? First, I think that it is instructive to look at what happens without the patch: nvme_suspend() gets called by pci_pm_suspend() (which basically causes the device to be "stopped" IIUC) and then pci_pm_suspend_noirq() is expected to put the device into the right power state through pci_prepare_to_sleep(). In theory, this should work for both S2R and S2I as long as the standard PCIe PM plus possibly ACPI PM is sufficient for the device. [Of course, the platform firmware invoked at the last stage of S2R can "fix up" things to reduce power further, but that should not be necessary if all is handled properly up to this point.] The claim in the patch changelog is that one design choice in Windows related to "Modern Standby" has caused our default PCI PM to not apply to NVMe devices in general (or to apply to them, but without much effect, which is practically equivalent IMO). This is not about a "different paradigm" (as Mario put it) or a different type of system suspend, but about the default PCI PM being basically useless for those devices at least in some configurations. And BTW, the same problem would have affected PM-runtime, had it been supported by the nvme driver, because Linux uses the combination of the standard PCIe PM and ACPI PM for PM-runtime too, and the "paradigm" in there is pretty much the same as for S2I, so let's not confuse things, pretty please. All of this means that the driver needs to override the default PCI PM like in the patch that Keith has just posted. Unfortunately, it looks like the "suspend via firmware" check needs to be there, because the platform firmware doing S3 on some platforms may get confused by the custom PM in the driver. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 20:48 ` Rafael J. Wysocki @ 2019-05-09 20:48 ` Rafael J. Wysocki 0 siblings, 0 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 20:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Rafael J. Wysocki, Kai-Heng Feng, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 9, 2019 at 11:25 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, May 09, 2019 at 11:19:37AM +0200, Rafael J. Wysocki wrote: > > Right, the choice of the target system state has already been made > > when their callbacks get invoked (and it has been made by user space, > > not by the platform). > > From a previous discussion I remember the main problem here is that > a lot of consumer NVMe use more power when put into D3hot than just > letting the device itself manage the power state transitions themselves. > Based on this patch there also might be some other device that want > an explicit power state transition from the host, but still not be > put into D3hot. > > The avoid D3hot at all cost thing seems to be based on the Windows > broken^H^H^H^H^H^Hmodern standby principles. So for platforms that > follow the modern standby model we need to avoid putting NVMe devices > that support power management into D3hot somehow. This patch doesa a > few more things, but at least for the device where I was involved in > the earlier discussion those are not needed, and from the Linux > point of view many of them seem wrong too. > > How do you think we best make that distinction? Are the pm_ops > enough if we don't use the simple version? First, I think that it is instructive to look at what happens without the patch: nvme_suspend() gets called by pci_pm_suspend() (which basically causes the device to be "stopped" IIUC) and then pci_pm_suspend_noirq() is expected to put the device into the right power state through pci_prepare_to_sleep(). In theory, this should work for both S2R and S2I as long as the standard PCIe PM plus possibly ACPI PM is sufficient for the device. [Of course, the platform firmware invoked at the last stage of S2R can "fix up" things to reduce power further, but that should not be necessary if all is handled properly up to this point.] The claim in the patch changelog is that one design choice in Windows related to "Modern Standby" has caused our default PCI PM to not apply to NVMe devices in general (or to apply to them, but without much effect, which is practically equivalent IMO). This is not about a "different paradigm" (as Mario put it) or a different type of system suspend, but about the default PCI PM being basically useless for those devices at least in some configurations. And BTW, the same problem would have affected PM-runtime, had it been supported by the nvme driver, because Linux uses the combination of the standard PCIe PM and ACPI PM for PM-runtime too, and the "paradigm" in there is pretty much the same as for S2I, so let's not confuse things, pretty please. All of this means that the driver needs to override the default PCI PM like in the patch that Keith has just posted. Unfortunately, it looks like the "suspend via firmware" check needs to be there, because the platform firmware doing S3 on some platforms may get confused by the custom PM in the driver. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 6:48 ` [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng 2019-05-09 6:48 ` Kai-Heng Feng 2019-05-09 6:52 ` Christoph Hellwig @ 2019-05-09 9:07 ` Rafael J. Wysocki 2019-05-09 9:07 ` Rafael J. Wysocki 2019-05-09 9:42 ` Kai-Heng Feng 2 siblings, 2 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 9:07 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > Cc Rafael and linux-pm I would have been much more useful to CC the patch to linux-pm at least from the outset. > at 14:12, Christoph Hellwig <hch@lst.de> wrote: > > > On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com wrote: > >> You might think this would be adding runtime_suspend/runtime_resume > >> callbacks, but those also get called actually at runtime which is not > >> the goal here. At runtime, these types of disks should rely on APST which > >> should calculate the appropriate latencies around the different power > >> states. > >> > >> This code path is only applicable in the suspend to idle state, which > >> /does/ > >> call suspend/resume functions associated with dev_pm_ops. There isn't > >> a dedicated function in there for use only in suspend to idle, which is > >> why pm_suspend_via_s2idle() needs to get called. > > > > The problem is that it also gets called for others paths: > > > > #ifdef CONFIG_PM_SLEEP > > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > > .suspend = suspend_fn, \ > > .resume = resume_fn, \ > > .freeze = suspend_fn, \ > > .thaw = resume_fn, \ > > .poweroff = suspend_fn, \ > > .restore = resume_fn, > > #else > > else > > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) > > #endif > > > > #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ > > const struct dev_pm_ops name = { \ > > SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > > } > > > > And at least for poweroff this new code seems completely wrong, even > > for freeze it looks rather borderline. > > Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so > the old code path will be taken. > > > > > And more to the points - if these "modern MS standby" systems are > > becoming common, which it looks they are, we need support in the PM core > > for those instead of working around the decisions in low-level drivers. > > Rafael, what do you think about this? The difference between suspend-to-idle and suspend-to-RAM (S3) boils down to the fact that at the end of S3 suspend all control of the system is passed to the platform firmware. Then, the firmware can take care of some things that may not be taken care of by drivers (it sometimes assumes that drivers will not take care of those things even which is generally problematic). For suspend-to-idle the final physical state of the system should (in theory) be the same as the deepest possible physical idle state of it achievable through working-state PM (combination of PM-runtime and cpuidle, roughly speaking). However, in practice the working-state PM may not even be able to get there, as it generally requires many things to happen exactly at the right time in a specific order and the probability of that in the working-state PM situation is practically 0. Suspend-to-idle helps here by quiescing everything in an ordered fashion which makes all of the requisite conditions more likely to be met together. So yes, from an individual driver perspective, the device handling for s2idle should be more like for PM-runtime than for S3 (s2R), but this really shouldn't matter (and it doesn't matter for the vast majority of drivers). Unfortunately, the "modern MS standby" concept makes it matter, because "modern MS standby" causes system-wide transitions to be "special" and it appears to expect drivers to take care of the "extra bit" that would have been taken care of by the platform firmware in the S3 case. [Note that in the Windows world the "modern MS standby" systems don't support S3 ("modern MS standby" and S3 support are mutually exclusive in Windows AFAICS) while Linux needs to support S3 and is expected to achieve the minimum power state through s2idle (generally, even on the same platform) at the same time.] > Including this patch, there are five drivers that use > pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. Well, that is not a large number relative to the total number of drivers in Linux. > So I think maybe it’s time to introduce a new suspend callback for S2I? That would be a set of 6 new suspend and resume callbacks, mind you, and there's quite a few of them already. And the majority of drivers would not need to use them anyway. Also, please note that, possibly apart from the device power state setting, the S2I and S2R handling really aren't that different at all. You basically need to carry out the same preparations during suspend and reverse them during resume in both cases. That said I admit that there are cases in which device drivers need to know that the system-wide transition under way is into s2idle and so they should do extra stuff. If pm_suspend_via_firmware() is not sufficient for that, then I'm open to other suggestions, but introducing a new set of callbacks for that alone would be rather excessive IMO. > >> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and > >> freeze (hibernate), so to avoid any changes to the hibernate case it seems > >> to me that there needs to be a new nvme_freeze() that calls into the > >> existing > >> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the > >> existing nvme_reset_ctrl for the thaw pm op. > > > > At least, yes. > > Hibernation should remain the same as stated above. Depending on what check is used in that code path. pm_suspend_via_s2idle() will return "true" in the hibernation path too, for one. > >>> enterprise class NVMe devices > >>> that don't do APST and don't really do different power states at > >>> all in many cases. > >> > >> Enterprise class NVMe devices that don't do APST - do they typically > >> have a non-zero value for ndev->ctrl.npss? > >> > >> If not, they wouldn't enter this new codepath even if the server entered > >> into S2I. > > > > No, devices that do set NPSS will have at least some power states > > per definition, although they might not be too useful. I suspect checking > > APSTA might be safer, but if we don't want to rely on APST we should > > check for a power state supporting the condition that the MS document > > quoted in the original document supports. > > If Modern Standby or Connected Standby is not supported by servers, I don’t > think the design documents mean much here. > We probably should check if the platform firmware really supports S2I > instead. S2I is expected to work regardless of the platform firmware and there is nothing like "platform firmware support for S2I". IOW, that check would always return "false". What you really need to know is if the given particular transition is S2I. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:07 ` Rafael J. Wysocki @ 2019-05-09 9:07 ` Rafael J. Wysocki 2019-05-09 9:42 ` Kai-Heng Feng 1 sibling, 0 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 9:07 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > Cc Rafael and linux-pm I would have been much more useful to CC the patch to linux-pm at least from the outset. > at 14:12, Christoph Hellwig <hch@lst.de> wrote: > > > On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com wrote: > >> You might think this would be adding runtime_suspend/runtime_resume > >> callbacks, but those also get called actually at runtime which is not > >> the goal here. At runtime, these types of disks should rely on APST which > >> should calculate the appropriate latencies around the different power > >> states. > >> > >> This code path is only applicable in the suspend to idle state, which > >> /does/ > >> call suspend/resume functions associated with dev_pm_ops. There isn't > >> a dedicated function in there for use only in suspend to idle, which is > >> why pm_suspend_via_s2idle() needs to get called. > > > > The problem is that it also gets called for others paths: > > > > #ifdef CONFIG_PM_SLEEP > > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > > .suspend = suspend_fn, \ > > .resume = resume_fn, \ > > .freeze = suspend_fn, \ > > .thaw = resume_fn, \ > > .poweroff = suspend_fn, \ > > .restore = resume_fn, > > #else > > else > > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) > > #endif > > > > #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ > > const struct dev_pm_ops name = { \ > > SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ > > } > > > > And at least for poweroff this new code seems completely wrong, even > > for freeze it looks rather borderline. > > Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so > the old code path will be taken. > > > > > And more to the points - if these "modern MS standby" systems are > > becoming common, which it looks they are, we need support in the PM core > > for those instead of working around the decisions in low-level drivers. > > Rafael, what do you think about this? The difference between suspend-to-idle and suspend-to-RAM (S3) boils down to the fact that at the end of S3 suspend all control of the system is passed to the platform firmware. Then, the firmware can take care of some things that may not be taken care of by drivers (it sometimes assumes that drivers will not take care of those things even which is generally problematic). For suspend-to-idle the final physical state of the system should (in theory) be the same as the deepest possible physical idle state of it achievable through working-state PM (combination of PM-runtime and cpuidle, roughly speaking). However, in practice the working-state PM may not even be able to get there, as it generally requires many things to happen exactly at the right time in a specific order and the probability of that in the working-state PM situation is practically 0. Suspend-to-idle helps here by quiescing everything in an ordered fashion which makes all of the requisite conditions more likely to be met together. So yes, from an individual driver perspective, the device handling for s2idle should be more like for PM-runtime than for S3 (s2R), but this really shouldn't matter (and it doesn't matter for the vast majority of drivers). Unfortunately, the "modern MS standby" concept makes it matter, because "modern MS standby" causes system-wide transitions to be "special" and it appears to expect drivers to take care of the "extra bit" that would have been taken care of by the platform firmware in the S3 case. [Note that in the Windows world the "modern MS standby" systems don't support S3 ("modern MS standby" and S3 support are mutually exclusive in Windows AFAICS) while Linux needs to support S3 and is expected to achieve the minimum power state through s2idle (generally, even on the same platform) at the same time.] > Including this patch, there are five drivers that use > pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. Well, that is not a large number relative to the total number of drivers in Linux. > So I think maybe it’s time to introduce a new suspend callback for S2I? That would be a set of 6 new suspend and resume callbacks, mind you, and there's quite a few of them already. And the majority of drivers would not need to use them anyway. Also, please note that, possibly apart from the device power state setting, the S2I and S2R handling really aren't that different at all. You basically need to carry out the same preparations during suspend and reverse them during resume in both cases. That said I admit that there are cases in which device drivers need to know that the system-wide transition under way is into s2idle and so they should do extra stuff. If pm_suspend_via_firmware() is not sufficient for that, then I'm open to other suggestions, but introducing a new set of callbacks for that alone would be rather excessive IMO. > >> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and > >> freeze (hibernate), so to avoid any changes to the hibernate case it seems > >> to me that there needs to be a new nvme_freeze() that calls into the > >> existing > >> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the > >> existing nvme_reset_ctrl for the thaw pm op. > > > > At least, yes. > > Hibernation should remain the same as stated above. Depending on what check is used in that code path. pm_suspend_via_s2idle() will return "true" in the hibernation path too, for one. > >>> enterprise class NVMe devices > >>> that don't do APST and don't really do different power states at > >>> all in many cases. > >> > >> Enterprise class NVMe devices that don't do APST - do they typically > >> have a non-zero value for ndev->ctrl.npss? > >> > >> If not, they wouldn't enter this new codepath even if the server entered > >> into S2I. > > > > No, devices that do set NPSS will have at least some power states > > per definition, although they might not be too useful. I suspect checking > > APSTA might be safer, but if we don't want to rely on APST we should > > check for a power state supporting the condition that the MS document > > quoted in the original document supports. > > If Modern Standby or Connected Standby is not supported by servers, I don’t > think the design documents mean much here. > We probably should check if the platform firmware really supports S2I > instead. S2I is expected to work regardless of the platform firmware and there is nothing like "platform firmware support for S2I". IOW, that check would always return "false". What you really need to know is if the given particular transition is S2I. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:07 ` Rafael J. Wysocki 2019-05-09 9:07 ` Rafael J. Wysocki @ 2019-05-09 9:42 ` Kai-Heng Feng 2019-05-09 9:42 ` Kai-Heng Feng 2019-05-09 9:56 ` Christoph Hellwig 1 sibling, 2 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 9:42 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Christoph Hellwig, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML at 17:07, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng > <kai.heng.feng@canonical.com> wrote: >> Cc Rafael and linux-pm > > I would have been much more useful to CC the patch to linux-pm at > least from the outset. > >> at 14:12, Christoph Hellwig <hch@lst.de> wrote: >> >>> On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com >>> wrote: >>>> You might think this would be adding runtime_suspend/runtime_resume >>>> callbacks, but those also get called actually at runtime which is not >>>> the goal here. At runtime, these types of disks should rely on APST >>>> which >>>> should calculate the appropriate latencies around the different power >>>> states. >>>> >>>> This code path is only applicable in the suspend to idle state, which >>>> /does/ >>>> call suspend/resume functions associated with dev_pm_ops. There isn't >>>> a dedicated function in there for use only in suspend to idle, which is >>>> why pm_suspend_via_s2idle() needs to get called. >>> >>> The problem is that it also gets called for others paths: >>> >>> #ifdef CONFIG_PM_SLEEP >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> .suspend = suspend_fn, \ >>> .resume = resume_fn, \ >>> .freeze = suspend_fn, \ >>> .thaw = resume_fn, \ >>> .poweroff = suspend_fn, \ >>> .restore = resume_fn, >>> #else >>> else >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) >>> #endif >>> >>> #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ >>> const struct dev_pm_ops name = { \ >>> SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> } >>> >>> And at least for poweroff this new code seems completely wrong, even >>> for freeze it looks rather borderline. >> >> Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so >> the old code path will be taken. >> >>> And more to the points - if these "modern MS standby" systems are >>> becoming common, which it looks they are, we need support in the PM core >>> for those instead of working around the decisions in low-level drivers. >> >> Rafael, what do you think about this? > > The difference between suspend-to-idle and suspend-to-RAM (S3) boils > down to the fact that at the end of S3 suspend all control of the > system is passed to the platform firmware. Then, the firmware can > take care of some things that may not be taken care of by drivers (it > sometimes assumes that drivers will not take care of those things even > which is generally problematic). > > For suspend-to-idle the final physical state of the system should (in > theory) be the same as the deepest possible physical idle state of it > achievable through working-state PM (combination of PM-runtime and > cpuidle, roughly speaking). However, in practice the working-state PM > may not even be able to get there, as it generally requires many > things to happen exactly at the right time in a specific order and the > probability of that in the working-state PM situation is practically > 0. Suspend-to-idle helps here by quiescing everything in an ordered > fashion which makes all of the requisite conditions more likely to be > met together. > > So yes, from an individual driver perspective, the device handling for > s2idle should be more like for PM-runtime than for S3 (s2R), but this > really shouldn't matter (and it doesn't matter for the vast majority > of drivers). > > Unfortunately, the "modern MS standby" concept makes it matter, > because "modern MS standby" causes system-wide transitions to be > "special" and it appears to expect drivers to take care of the "extra > bit" that would have been taken care of by the platform firmware in > the S3 case. [Note that in the Windows world the "modern MS standby" > systems don't support S3 ("modern MS standby" and S3 support are > mutually exclusive in Windows AFAICS) while Linux needs to support S3 > and is expected to achieve the minimum power state through s2idle > (generally, even on the same platform) at the same time.] > >> Including this patch, there are five drivers that use >> pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > > Well, that is not a large number relative to the total number of > drivers in Linux. That’s right, but I think we are going to see more of similar cases. > >> So I think maybe it’s time to introduce a new suspend callback for S2I? > > That would be a set of 6 new suspend and resume callbacks, mind you, > and there's quite a few of them already. And the majority of drivers > would not need to use them anyway. I think suspend_to_idle() and resume_from_idle() should be enough? What are other 4 callbacks? > > Also, please note that, possibly apart from the device power state > setting, the S2I and S2R handling really aren't that different at all. > You basically need to carry out the same preparations during suspend > and reverse them during resume in both cases. But for this case, it’s quite different to the original suspend and resume callbacks. > > That said I admit that there are cases in which device drivers need to > know that the system-wide transition under way is into s2idle and so > they should do extra stuff. If pm_suspend_via_firmware() is not > sufficient for that, then I'm open to other suggestions, but > introducing a new set of callbacks for that alone would be rather > excessive IMO. From drivers’ perspective nothing changed, as PM core can prioritize suspend_to_idle() over suspend() when it’s actually S2I. > >>>> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and >>>> freeze (hibernate), so to avoid any changes to the hibernate case it >>>> seems >>>> to me that there needs to be a new nvme_freeze() that calls into the >>>> existing >>>> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into >>>> the >>>> existing nvme_reset_ctrl for the thaw pm op. >>> >>> At least, yes. >> >> Hibernation should remain the same as stated above. > > Depending on what check is used in that code path. > pm_suspend_via_s2idle() will return "true" in the hibernation path > too, for one. You are right, I should use !pm_suspend_via_firmware() instead. > >>>>> enterprise class NVMe devices >>>>> that don't do APST and don't really do different power states at >>>>> all in many cases. >>>> >>>> Enterprise class NVMe devices that don't do APST - do they typically >>>> have a non-zero value for ndev->ctrl.npss? >>>> >>>> If not, they wouldn't enter this new codepath even if the server entered >>>> into S2I. >>> >>> No, devices that do set NPSS will have at least some power states >>> per definition, although they might not be too useful. I suspect >>> checking >>> APSTA might be safer, but if we don't want to rely on APST we should >>> check for a power state supporting the condition that the MS document >>> quoted in the original document supports. >> >> If Modern Standby or Connected Standby is not supported by servers, I >> don’t >> think the design documents mean much here. >> We probably should check if the platform firmware really supports S2I >> instead. > > S2I is expected to work regardless of the platform firmware and there > is nothing like "platform firmware support for S2I". IOW, that check > would always return "false". > > What you really need to know is if the given particular transition is S2I. Maybe a helper based on FADT flag and _DSM can do this thing? Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:42 ` Kai-Heng Feng @ 2019-05-09 9:42 ` Kai-Heng Feng 2019-05-09 9:56 ` Christoph Hellwig 1 sibling, 0 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 9:42 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Christoph Hellwig, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML at 17:07, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng > <kai.heng.feng@canonical.com> wrote: >> Cc Rafael and linux-pm > > I would have been much more useful to CC the patch to linux-pm at > least from the outset. > >> at 14:12, Christoph Hellwig <hch@lst.de> wrote: >> >>> On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com >>> wrote: >>>> You might think this would be adding runtime_suspend/runtime_resume >>>> callbacks, but those also get called actually at runtime which is not >>>> the goal here. At runtime, these types of disks should rely on APST >>>> which >>>> should calculate the appropriate latencies around the different power >>>> states. >>>> >>>> This code path is only applicable in the suspend to idle state, which >>>> /does/ >>>> call suspend/resume functions associated with dev_pm_ops. There isn't >>>> a dedicated function in there for use only in suspend to idle, which is >>>> why pm_suspend_via_s2idle() needs to get called. >>> >>> The problem is that it also gets called for others paths: >>> >>> #ifdef CONFIG_PM_SLEEP >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> .suspend = suspend_fn, \ >>> .resume = resume_fn, \ >>> .freeze = suspend_fn, \ >>> .thaw = resume_fn, \ >>> .poweroff = suspend_fn, \ >>> .restore = resume_fn, >>> #else >>> else >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) >>> #endif >>> >>> #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ >>> const struct dev_pm_ops name = { \ >>> SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> } >>> >>> And at least for poweroff this new code seems completely wrong, even >>> for freeze it looks rather borderline. >> >> Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so >> the old code path will be taken. >> >>> And more to the points - if these "modern MS standby" systems are >>> becoming common, which it looks they are, we need support in the PM core >>> for those instead of working around the decisions in low-level drivers. >> >> Rafael, what do you think about this? > > The difference between suspend-to-idle and suspend-to-RAM (S3) boils > down to the fact that at the end of S3 suspend all control of the > system is passed to the platform firmware. Then, the firmware can > take care of some things that may not be taken care of by drivers (it > sometimes assumes that drivers will not take care of those things even > which is generally problematic). > > For suspend-to-idle the final physical state of the system should (in > theory) be the same as the deepest possible physical idle state of it > achievable through working-state PM (combination of PM-runtime and > cpuidle, roughly speaking). However, in practice the working-state PM > may not even be able to get there, as it generally requires many > things to happen exactly at the right time in a specific order and the > probability of that in the working-state PM situation is practically > 0. Suspend-to-idle helps here by quiescing everything in an ordered > fashion which makes all of the requisite conditions more likely to be > met together. > > So yes, from an individual driver perspective, the device handling for > s2idle should be more like for PM-runtime than for S3 (s2R), but this > really shouldn't matter (and it doesn't matter for the vast majority > of drivers). > > Unfortunately, the "modern MS standby" concept makes it matter, > because "modern MS standby" causes system-wide transitions to be > "special" and it appears to expect drivers to take care of the "extra > bit" that would have been taken care of by the platform firmware in > the S3 case. [Note that in the Windows world the "modern MS standby" > systems don't support S3 ("modern MS standby" and S3 support are > mutually exclusive in Windows AFAICS) while Linux needs to support S3 > and is expected to achieve the minimum power state through s2idle > (generally, even on the same platform) at the same time.] > >> Including this patch, there are five drivers that use >> pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > > Well, that is not a large number relative to the total number of > drivers in Linux. That’s right, but I think we are going to see more of similar cases. > >> So I think maybe it’s time to introduce a new suspend callback for S2I? > > That would be a set of 6 new suspend and resume callbacks, mind you, > and there's quite a few of them already. And the majority of drivers > would not need to use them anyway. I think suspend_to_idle() and resume_from_idle() should be enough? What are other 4 callbacks? > > Also, please note that, possibly apart from the device power state > setting, the S2I and S2R handling really aren't that different at all. > You basically need to carry out the same preparations during suspend > and reverse them during resume in both cases. But for this case, it’s quite different to the original suspend and resume callbacks. > > That said I admit that there are cases in which device drivers need to > know that the system-wide transition under way is into s2idle and so > they should do extra stuff. If pm_suspend_via_firmware() is not > sufficient for that, then I'm open to other suggestions, but > introducing a new set of callbacks for that alone would be rather > excessive IMO. From drivers’ perspective nothing changed, as PM core can prioritize suspend_to_idle() over suspend() when it’s actually S2I. > >>>> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and >>>> freeze (hibernate), so to avoid any changes to the hibernate case it >>>> seems >>>> to me that there needs to be a new nvme_freeze() that calls into the >>>> existing >>>> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into >>>> the >>>> existing nvme_reset_ctrl for the thaw pm op. >>> >>> At least, yes. >> >> Hibernation should remain the same as stated above. > > Depending on what check is used in that code path. > pm_suspend_via_s2idle() will return "true" in the hibernation path > too, for one. You are right, I should use !pm_suspend_via_firmware() instead. > >>>>> enterprise class NVMe devices >>>>> that don't do APST and don't really do different power states at >>>>> all in many cases. >>>> >>>> Enterprise class NVMe devices that don't do APST - do they typically >>>> have a non-zero value for ndev->ctrl.npss? >>>> >>>> If not, they wouldn't enter this new codepath even if the server entered >>>> into S2I. >>> >>> No, devices that do set NPSS will have at least some power states >>> per definition, although they might not be too useful. I suspect >>> checking >>> APSTA might be safer, but if we don't want to rely on APST we should >>> check for a power state supporting the condition that the MS document >>> quoted in the original document supports. >> >> If Modern Standby or Connected Standby is not supported by servers, I >> don’t >> think the design documents mean much here. >> We probably should check if the platform firmware really supports S2I >> instead. > > S2I is expected to work regardless of the platform firmware and there > is nothing like "platform firmware support for S2I". IOW, that check > would always return "false". > > What you really need to know is if the given particular transition is S2I. Maybe a helper based on FADT flag and _DSM can do this thing? Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:42 ` Kai-Heng Feng 2019-05-09 9:42 ` Kai-Heng Feng @ 2019-05-09 9:56 ` Christoph Hellwig 2019-05-09 9:56 ` Christoph Hellwig 2019-05-09 10:28 ` Kai-Heng Feng 1 sibling, 2 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 9:56 UTC (permalink / raw) To: Kai-Heng Feng Cc: Rafael J. Wysocki, Christoph Hellwig, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 05:42:30PM +0800, Kai-Heng Feng wrote: >> That would be a set of 6 new suspend and resume callbacks, mind you, >> and there's quite a few of them already. And the majority of drivers >> would not need to use them anyway. > > I think suspend_to_idle() and resume_from_idle() should be enough? > What are other 4 callbacks? > >> >> Also, please note that, possibly apart from the device power state >> setting, the S2I and S2R handling really aren't that different at all. >> You basically need to carry out the same preparations during suspend >> and reverse them during resume in both cases. > > But for this case, it’s quite different to the original suspend and > resume callbacks. Let's think of what cases we needed. The "classic" suspend in the nvme driver basically shuts down the device entirely. This is useful for: a) device that have no power management b) System power states that eventually power off the entire PCIe bus. I think that would: - suspend to disk (hibernate) - classic suspend to ram The we have the sequence in your patch. This seems to be related to some of the MS wording, but I'm not sure what for example tearing down the queues buys us. Can you explain a bit more where those bits make a difference? Otherwise I think we should use a "no-op" suspend, just leaving the power management to the device, or a simple setting the device to the deepest power state for everything else, where everything else is suspend, or suspend to idle. And of course than we have windows modern standby actually mandating runtime D3 in some case, and vague handwaving mentions of this being forced on the platforms, which I'm not entirely sure how they fit into the above picture. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:56 ` Christoph Hellwig @ 2019-05-09 9:56 ` Christoph Hellwig 2019-05-09 10:28 ` Kai-Heng Feng 1 sibling, 0 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 9:56 UTC (permalink / raw) To: Kai-Heng Feng Cc: Rafael J. Wysocki, Christoph Hellwig, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 05:42:30PM +0800, Kai-Heng Feng wrote: >> That would be a set of 6 new suspend and resume callbacks, mind you, >> and there's quite a few of them already. And the majority of drivers >> would not need to use them anyway. > > I think suspend_to_idle() and resume_from_idle() should be enough? > What are other 4 callbacks? > >> >> Also, please note that, possibly apart from the device power state >> setting, the S2I and S2R handling really aren't that different at all. >> You basically need to carry out the same preparations during suspend >> and reverse them during resume in both cases. > > But for this case, it’s quite different to the original suspend and > resume callbacks. Let's think of what cases we needed. The "classic" suspend in the nvme driver basically shuts down the device entirely. This is useful for: a) device that have no power management b) System power states that eventually power off the entire PCIe bus. I think that would: - suspend to disk (hibernate) - classic suspend to ram The we have the sequence in your patch. This seems to be related to some of the MS wording, but I'm not sure what for example tearing down the queues buys us. Can you explain a bit more where those bits make a difference? Otherwise I think we should use a "no-op" suspend, just leaving the power management to the device, or a simple setting the device to the deepest power state for everything else, where everything else is suspend, or suspend to idle. And of course than we have windows modern standby actually mandating runtime D3 in some case, and vague handwaving mentions of this being forced on the platforms, which I'm not entirely sure how they fit into the above picture. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 9:56 ` Christoph Hellwig 2019-05-09 9:56 ` Christoph Hellwig @ 2019-05-09 10:28 ` Kai-Heng Feng 2019-05-09 10:28 ` Kai-Heng Feng ` (2 more replies) 1 sibling, 3 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 10:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML at 17:56, Christoph Hellwig <hch@lst.de> wrote: > On Thu, May 09, 2019 at 05:42:30PM +0800, Kai-Heng Feng wrote: >>> That would be a set of 6 new suspend and resume callbacks, mind you, >>> and there's quite a few of them already. And the majority of drivers >>> would not need to use them anyway. >> >> I think suspend_to_idle() and resume_from_idle() should be enough? >> What are other 4 callbacks? >> >>> Also, please note that, possibly apart from the device power state >>> setting, the S2I and S2R handling really aren't that different at all. >>> You basically need to carry out the same preparations during suspend >>> and reverse them during resume in both cases. >> >> But for this case, it’s quite different to the original suspend and >> resume callbacks. > > Let's think of what cases we needed. > > The "classic" suspend in the nvme driver basically shuts down the > device entirely. This is useful for: > > a) device that have no power management > b) System power states that eventually power off the entire PCIe bus. > I think that would: > > - suspend to disk (hibernate) > - classic suspend to ram > > The we have the sequence in your patch. This seems to be related to > some of the MS wording, but I'm not sure what for example tearing down > the queues buys us. Can you explain a bit more where those bits > make a difference? Based on my testing if queues (IRQ) are not disabled, NVMe controller won’t be quiesced. Symptoms can be high power drain or system freeze. I can check with vendors whether this also necessary under Windows. > > Otherwise I think we should use a "no-op" suspend, just leaving the > power management to the device, or a simple setting the device to the > deepest power state for everything else, where everything else is > suspend, or suspend to idle. I am not sure I get your idea. Does this “no-op” suspend happen in NVMe driver or PM core? > > And of course than we have windows modern standby actually mandating > runtime D3 in some case, and vague handwaving mentions of this being > forced on the platforms, which I'm not entirely sure how they fit > into the above picture. I was told that Windows doesn’t use runtime D3, APST is used exclusively. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 10:28 ` Kai-Heng Feng @ 2019-05-09 10:28 ` Kai-Heng Feng 2019-05-09 10:31 ` Christoph Hellwig 2019-05-09 16:20 ` Keith Busch 2 siblings, 0 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 10:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML at 17:56, Christoph Hellwig <hch@lst.de> wrote: > On Thu, May 09, 2019 at 05:42:30PM +0800, Kai-Heng Feng wrote: >>> That would be a set of 6 new suspend and resume callbacks, mind you, >>> and there's quite a few of them already. And the majority of drivers >>> would not need to use them anyway. >> >> I think suspend_to_idle() and resume_from_idle() should be enough? >> What are other 4 callbacks? >> >>> Also, please note that, possibly apart from the device power state >>> setting, the S2I and S2R handling really aren't that different at all. >>> You basically need to carry out the same preparations during suspend >>> and reverse them during resume in both cases. >> >> But for this case, it’s quite different to the original suspend and >> resume callbacks. > > Let's think of what cases we needed. > > The "classic" suspend in the nvme driver basically shuts down the > device entirely. This is useful for: > > a) device that have no power management > b) System power states that eventually power off the entire PCIe bus. > I think that would: > > - suspend to disk (hibernate) > - classic suspend to ram > > The we have the sequence in your patch. This seems to be related to > some of the MS wording, but I'm not sure what for example tearing down > the queues buys us. Can you explain a bit more where those bits > make a difference? Based on my testing if queues (IRQ) are not disabled, NVMe controller won’t be quiesced. Symptoms can be high power drain or system freeze. I can check with vendors whether this also necessary under Windows. > > Otherwise I think we should use a "no-op" suspend, just leaving the > power management to the device, or a simple setting the device to the > deepest power state for everything else, where everything else is > suspend, or suspend to idle. I am not sure I get your idea. Does this “no-op” suspend happen in NVMe driver or PM core? > > And of course than we have windows modern standby actually mandating > runtime D3 in some case, and vague handwaving mentions of this being > forced on the platforms, which I'm not entirely sure how they fit > into the above picture. I was told that Windows doesn’t use runtime D3, APST is used exclusively. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 10:28 ` Kai-Heng Feng 2019-05-09 10:28 ` Kai-Heng Feng @ 2019-05-09 10:31 ` Christoph Hellwig 2019-05-09 10:31 ` Christoph Hellwig 2019-05-09 11:59 ` Kai-Heng Feng 2019-05-09 16:20 ` Keith Busch 2 siblings, 2 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 10:31 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 06:28:32PM +0800, Kai-Heng Feng wrote: > Based on my testing if queues (IRQ) are not disabled, NVMe controller > won’t be quiesced. > Symptoms can be high power drain or system freeze. > > I can check with vendors whether this also necessary under Windows. System freeze sounds odd. And we had a patch from a person on the Cc list here that was handed to me through a few indirections that just skipps the suspend entirely for some cases, which seemd to work fine with the controllers in question. >> Otherwise I think we should use a "no-op" suspend, just leaving the >> power management to the device, or a simple setting the device to the >> deepest power state for everything else, where everything else is >> suspend, or suspend to idle. > > I am not sure I get your idea. Does this “no-op” suspend happen in NVMe > driver or PM core? no-op means we don't want to do anything in nvme. If that happens by not calling nvme or stubbing out the method for that particular case does not matter. >> And of course than we have windows modern standby actually mandating >> runtime D3 in some case, and vague handwaving mentions of this being >> forced on the platforms, which I'm not entirely sure how they fit >> into the above picture. > > I was told that Windows doesn’t use runtime D3, APST is used exclusively. As far as I know the default power management modes in the Microsoft NVMe driver is explicit power management transitions, and in the Intel RST driver that is commonly used it is APST. But both could still be comined with runtime D3 in theory, I'm just not sure if they are. Microsoft has been pushing for aggressive runtime D3 for a while, but I don't know if that includes NVMe devices. > > Kai-Heng > ---end quoted text--- ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 10:31 ` Christoph Hellwig @ 2019-05-09 10:31 ` Christoph Hellwig 2019-05-09 11:59 ` Kai-Heng Feng 1 sibling, 0 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-09 10:31 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 06:28:32PM +0800, Kai-Heng Feng wrote: > Based on my testing if queues (IRQ) are not disabled, NVMe controller > won’t be quiesced. > Symptoms can be high power drain or system freeze. > > I can check with vendors whether this also necessary under Windows. System freeze sounds odd. And we had a patch from a person on the Cc list here that was handed to me through a few indirections that just skipps the suspend entirely for some cases, which seemd to work fine with the controllers in question. >> Otherwise I think we should use a "no-op" suspend, just leaving the >> power management to the device, or a simple setting the device to the >> deepest power state for everything else, where everything else is >> suspend, or suspend to idle. > > I am not sure I get your idea. Does this “no-op” suspend happen in NVMe > driver or PM core? no-op means we don't want to do anything in nvme. If that happens by not calling nvme or stubbing out the method for that particular case does not matter. >> And of course than we have windows modern standby actually mandating >> runtime D3 in some case, and vague handwaving mentions of this being >> forced on the platforms, which I'm not entirely sure how they fit >> into the above picture. > > I was told that Windows doesn’t use runtime D3, APST is used exclusively. As far as I know the default power management modes in the Microsoft NVMe driver is explicit power management transitions, and in the Intel RST driver that is commonly used it is APST. But both could still be comined with runtime D3 in theory, I'm just not sure if they are. Microsoft has been pushing for aggressive runtime D3 for a while, but I don't know if that includes NVMe devices. > > Kai-Heng > ---end quoted text--- ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 10:31 ` Christoph Hellwig 2019-05-09 10:31 ` Christoph Hellwig @ 2019-05-09 11:59 ` Kai-Heng Feng 2019-05-09 11:59 ` Kai-Heng Feng 2019-05-09 18:57 ` Mario.Limonciello 1 sibling, 2 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 11:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML at 18:31, Christoph Hellwig <hch@lst.de> wrote: > On Thu, May 09, 2019 at 06:28:32PM +0800, Kai-Heng Feng wrote: >> Based on my testing if queues (IRQ) are not disabled, NVMe controller >> won’t be quiesced. >> Symptoms can be high power drain or system freeze. >> >> I can check with vendors whether this also necessary under Windows. > > System freeze sounds odd. And we had a patch from a person on the > Cc list here that was handed to me through a few indirections that > just skipps the suspend entirely for some cases, which seemd to > work fine with the controllers in question. That works fine for some devices, but for Toshiba NVMes this said scenario freezes the system, hence the new patch here. And for all NVMes I tested this new suspend routine saves even more power than simply skipping suspend. > >>> Otherwise I think we should use a "no-op" suspend, just leaving the >>> power management to the device, or a simple setting the device to the >>> deepest power state for everything else, where everything else is >>> suspend, or suspend to idle. >> >> I am not sure I get your idea. Does this “no-op” suspend happen in NVMe >> driver or PM core? > > no-op means we don't want to do anything in nvme. If that happens > by not calling nvme or stubbing out the method for that particular > case does not matter. Ok, but we still need to figure out how to prevent the device device from tradition to D3. > >>> And of course than we have windows modern standby actually mandating >>> runtime D3 in some case, and vague handwaving mentions of this being >>> forced on the platforms, which I'm not entirely sure how they fit >>> into the above picture. >> >> I was told that Windows doesn’t use runtime D3, APST is used exclusively. > > As far as I know the default power management modes in the Microsoft > NVMe driver is explicit power management transitions, and in the Intel > RST driver that is commonly used it is APST. But both could still > be comined with runtime D3 in theory, I'm just not sure if they are. > > Microsoft has been pushing for aggressive runtime D3 for a while, but > I don't know if that includes NVMe devices. Ok, I’ll check with vendors about this. Kai-Heng > >> Kai-Heng > ---end quoted text— ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 11:59 ` Kai-Heng Feng @ 2019-05-09 11:59 ` Kai-Heng Feng 2019-05-09 18:57 ` Mario.Limonciello 1 sibling, 0 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-09 11:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML at 18:31, Christoph Hellwig <hch@lst.de> wrote: > On Thu, May 09, 2019 at 06:28:32PM +0800, Kai-Heng Feng wrote: >> Based on my testing if queues (IRQ) are not disabled, NVMe controller >> won’t be quiesced. >> Symptoms can be high power drain or system freeze. >> >> I can check with vendors whether this also necessary under Windows. > > System freeze sounds odd. And we had a patch from a person on the > Cc list here that was handed to me through a few indirections that > just skipps the suspend entirely for some cases, which seemd to > work fine with the controllers in question. That works fine for some devices, but for Toshiba NVMes this said scenario freezes the system, hence the new patch here. And for all NVMes I tested this new suspend routine saves even more power than simply skipping suspend. > >>> Otherwise I think we should use a "no-op" suspend, just leaving the >>> power management to the device, or a simple setting the device to the >>> deepest power state for everything else, where everything else is >>> suspend, or suspend to idle. >> >> I am not sure I get your idea. Does this “no-op” suspend happen in NVMe >> driver or PM core? > > no-op means we don't want to do anything in nvme. If that happens > by not calling nvme or stubbing out the method for that particular > case does not matter. Ok, but we still need to figure out how to prevent the device device from tradition to D3. > >>> And of course than we have windows modern standby actually mandating >>> runtime D3 in some case, and vague handwaving mentions of this being >>> forced on the platforms, which I'm not entirely sure how they fit >>> into the above picture. >> >> I was told that Windows doesn’t use runtime D3, APST is used exclusively. > > As far as I know the default power management modes in the Microsoft > NVMe driver is explicit power management transitions, and in the Intel > RST driver that is commonly used it is APST. But both could still > be comined with runtime D3 in theory, I'm just not sure if they are. > > Microsoft has been pushing for aggressive runtime D3 for a while, but > I don't know if that includes NVMe devices. Ok, I’ll check with vendors about this. Kai-Heng > >> Kai-Heng > ---end quoted text— ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 11:59 ` Kai-Heng Feng 2019-05-09 11:59 ` Kai-Heng Feng @ 2019-05-09 18:57 ` Mario.Limonciello 2019-05-09 18:57 ` Mario.Limonciello 2019-05-09 19:28 ` Keith Busch 1 sibling, 2 replies; 60+ messages in thread From: Mario.Limonciello @ 2019-05-09 18:57 UTC (permalink / raw) To: kai.heng.feng, hch Cc: rafael, rafael.j.wysocki, kbusch, keith.busch, axboe, sagi, linux-nvme, linux-pm, linux-kernel > > > >>> Otherwise I think we should use a "no-op" suspend, just leaving the > >>> power management to the device, or a simple setting the device to the > >>> deepest power state for everything else, where everything else is > >>> suspend, or suspend to idle. > >> > >> I am not sure I get your idea. Does this “no-op” suspend happen in NVMe > >> driver or PM core? > > > > no-op means we don't want to do anything in nvme. If that happens > > by not calling nvme or stubbing out the method for that particular > > case does not matter. > > Ok, but we still need to figure out how to prevent the device device from > tradition to D3. This so-called no-op was something that we had experimented with while developing this patch, but found that it would not help power consumption on all drives. That's why we have explicit call to go into deepest power state in current patch. > > > > >>> And of course than we have windows modern standby actually mandating > >>> runtime D3 in some case, and vague handwaving mentions of this being > >>> forced on the platforms, which I'm not entirely sure how they fit > >>> into the above picture. > >> > >> I was told that Windows doesn’t use runtime D3, APST is used exclusively. > > > > As far as I know the default power management modes in the Microsoft > > NVMe driver is explicit power management transitions, and in the Intel > > RST driver that is commonly used it is APST. But both could still > > be comined with runtime D3 in theory, I'm just not sure if they are. > > > > Microsoft has been pushing for aggressive runtime D3 for a while, but > > I don't know if that includes NVMe devices. > > Ok, I’ll check with vendors about this. > No, current Windows versions don't transition to D3 with inbox NVME driver. You're correct, it's explicit state transitions even if APST was enabled (as this patch is currently doing as well). ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 18:57 ` Mario.Limonciello @ 2019-05-09 18:57 ` Mario.Limonciello 2019-05-09 19:28 ` Keith Busch 1 sibling, 0 replies; 60+ messages in thread From: Mario.Limonciello @ 2019-05-09 18:57 UTC (permalink / raw) To: kai.heng.feng, hch Cc: rafael, rafael.j.wysocki, kbusch, keith.busch, axboe, sagi, linux-nvme, linux-pm, linux-kernel > > > >>> Otherwise I think we should use a "no-op" suspend, just leaving the > >>> power management to the device, or a simple setting the device to the > >>> deepest power state for everything else, where everything else is > >>> suspend, or suspend to idle. > >> > >> I am not sure I get your idea. Does this “no-op” suspend happen in NVMe > >> driver or PM core? > > > > no-op means we don't want to do anything in nvme. If that happens > > by not calling nvme or stubbing out the method for that particular > > case does not matter. > > Ok, but we still need to figure out how to prevent the device device from > tradition to D3. This so-called no-op was something that we had experimented with while developing this patch, but found that it would not help power consumption on all drives. That's why we have explicit call to go into deepest power state in current patch. > > > > >>> And of course than we have windows modern standby actually mandating > >>> runtime D3 in some case, and vague handwaving mentions of this being > >>> forced on the platforms, which I'm not entirely sure how they fit > >>> into the above picture. > >> > >> I was told that Windows doesn’t use runtime D3, APST is used exclusively. > > > > As far as I know the default power management modes in the Microsoft > > NVMe driver is explicit power management transitions, and in the Intel > > RST driver that is commonly used it is APST. But both could still > > be comined with runtime D3 in theory, I'm just not sure if they are. > > > > Microsoft has been pushing for aggressive runtime D3 for a while, but > > I don't know if that includes NVMe devices. > > Ok, I’ll check with vendors about this. > No, current Windows versions don't transition to D3 with inbox NVME driver. You're correct, it's explicit state transitions even if APST was enabled (as this patch is currently doing as well). ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 18:57 ` Mario.Limonciello 2019-05-09 18:57 ` Mario.Limonciello @ 2019-05-09 19:28 ` Keith Busch 2019-05-09 19:28 ` Keith Busch ` (3 more replies) 1 sibling, 4 replies; 60+ messages in thread From: Keith Busch @ 2019-05-09 19:28 UTC (permalink / raw) To: Mario.Limonciello Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote: > No, current Windows versions don't transition to D3 with inbox NVME driver. > You're correct, it's explicit state transitions even if APST was enabled > (as this patch is currently doing as well). The proposed patch does too much, and your resume latency will be worse off for doing an unnecessary controller reset. The following should be all that's needed if the device is spec compliant. The resume part isn't necessary if npss is non-operational, but we're not saving that info, and it shouldn't hurt to be explicit anyway. I don't have any PS capable devices, so this is just compile tested. --- diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 6265d9225ec8..ce8b9bc949b9 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword return ret; } +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) +{ + int ret; + + mutex_lock(&ctrl->scan_lock); + nvme_start_freeze(ctrl); + nvme_wait_freeze(ctrl); + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, + NULL); + nvme_unfreeze(ctrl); + mutex_unlock(&ctrl->scan_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(nvme_set_power); + int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) { u32 q_count = (*count - 1) | ((*count - 1) << 16); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 527d64545023..f2be6aad9804 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, unsigned timeout, int qid, int at_head, blk_mq_req_flags_t flags, bool poll); int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss); void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); int nvme_reset_ctrl(struct nvme_ctrl *ctrl); int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a90cf5d63aac..2c4154cb4e79 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -18,6 +18,7 @@ #include <linux/mutex.h> #include <linux/once.h> #include <linux/pci.h> +#include <linux/suspend.h> #include <linux/t10-pi.h> #include <linux/types.h> #include <linux/io-64-nonatomic-lo-hi.h> @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_suspend_via_firmware()) + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); nvme_dev_disable(ndev, true); return 0; } @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_suspend_via_firmware()) + return nvme_set_power(&ndev->ctrl, 0); nvme_reset_ctrl(&ndev->ctrl); return 0; } -- ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 19:28 ` Keith Busch @ 2019-05-09 19:28 ` Keith Busch 2019-05-09 20:54 ` Rafael J. Wysocki ` (2 subsequent siblings) 3 siblings, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-09 19:28 UTC (permalink / raw) To: Mario.Limonciello Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote: > No, current Windows versions don't transition to D3 with inbox NVME driver. > You're correct, it's explicit state transitions even if APST was enabled > (as this patch is currently doing as well). The proposed patch does too much, and your resume latency will be worse off for doing an unnecessary controller reset. The following should be all that's needed if the device is spec compliant. The resume part isn't necessary if npss is non-operational, but we're not saving that info, and it shouldn't hurt to be explicit anyway. I don't have any PS capable devices, so this is just compile tested. --- diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 6265d9225ec8..ce8b9bc949b9 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword return ret; } +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) +{ + int ret; + + mutex_lock(&ctrl->scan_lock); + nvme_start_freeze(ctrl); + nvme_wait_freeze(ctrl); + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, + NULL); + nvme_unfreeze(ctrl); + mutex_unlock(&ctrl->scan_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(nvme_set_power); + int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) { u32 q_count = (*count - 1) | ((*count - 1) << 16); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 527d64545023..f2be6aad9804 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, unsigned timeout, int qid, int at_head, blk_mq_req_flags_t flags, bool poll); int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss); void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); int nvme_reset_ctrl(struct nvme_ctrl *ctrl); int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a90cf5d63aac..2c4154cb4e79 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -18,6 +18,7 @@ #include <linux/mutex.h> #include <linux/once.h> #include <linux/pci.h> +#include <linux/suspend.h> #include <linux/t10-pi.h> #include <linux/types.h> #include <linux/io-64-nonatomic-lo-hi.h> @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_suspend_via_firmware()) + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); nvme_dev_disable(ndev, true); return 0; } @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_suspend_via_firmware()) + return nvme_set_power(&ndev->ctrl, 0); nvme_reset_ctrl(&ndev->ctrl); return 0; } -- ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 19:28 ` Keith Busch 2019-05-09 19:28 ` Keith Busch @ 2019-05-09 20:54 ` Rafael J. Wysocki 2019-05-09 20:54 ` Rafael J. Wysocki 2019-05-09 21:16 ` Keith Busch 2019-05-09 21:37 ` Mario.Limonciello 2019-05-10 5:30 ` Christoph Hellwig 3 siblings, 2 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 20:54 UTC (permalink / raw) To: Keith Busch Cc: Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Rafael J. Wysocki, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote: > > On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote: > > No, current Windows versions don't transition to D3 with inbox NVME driver. > > You're correct, it's explicit state transitions even if APST was enabled > > (as this patch is currently doing as well). > > The proposed patch does too much, and your resume latency will be worse > off for doing an unnecessary controller reset. > > The following should be all that's needed if the device is spec > compliant. The resume part isn't necessary if npss is non-operational, but > we're not saving that info, and it shouldn't hurt to be explicit anyway. > > I don't have any PS capable devices, so this is just compile tested. > > --- > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 6265d9225ec8..ce8b9bc949b9 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword > return ret; > } > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > +{ > + int ret; > + > + mutex_lock(&ctrl->scan_lock); > + nvme_start_freeze(ctrl); > + nvme_wait_freeze(ctrl); > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > + NULL); > + nvme_unfreeze(ctrl); > + mutex_unlock(&ctrl->scan_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(nvme_set_power); > + > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) > { > u32 q_count = (*count - 1) | ((*count - 1) << 16); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 527d64545023..f2be6aad9804 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, > unsigned timeout, int qid, int at_head, > blk_mq_req_flags_t flags, bool poll); > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss); > void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index a90cf5d63aac..2c4154cb4e79 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -18,6 +18,7 @@ > #include <linux/mutex.h> > #include <linux/once.h> > #include <linux/pci.h> > +#include <linux/suspend.h> > #include <linux/t10-pi.h> > #include <linux/types.h> > #include <linux/io-64-nonatomic-lo-hi.h> > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); You probably want to call pci_save_state(pdev) in the branch above to prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep() going forward, so I would write this routine as if (pm_suspend_via_firmware()) { nvme_dev_disable(ndev, true); return 0; } pci_save_state(pdev) return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > nvme_dev_disable(ndev, true); > return 0; > } > @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, 0); > nvme_reset_ctrl(&ndev->ctrl); > return 0; > } The rest of the patch LGTM. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 20:54 ` Rafael J. Wysocki @ 2019-05-09 20:54 ` Rafael J. Wysocki 2019-05-09 21:16 ` Keith Busch 1 sibling, 0 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 20:54 UTC (permalink / raw) To: Keith Busch Cc: Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Rafael J. Wysocki, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote: > > On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote: > > No, current Windows versions don't transition to D3 with inbox NVME driver. > > You're correct, it's explicit state transitions even if APST was enabled > > (as this patch is currently doing as well). > > The proposed patch does too much, and your resume latency will be worse > off for doing an unnecessary controller reset. > > The following should be all that's needed if the device is spec > compliant. The resume part isn't necessary if npss is non-operational, but > we're not saving that info, and it shouldn't hurt to be explicit anyway. > > I don't have any PS capable devices, so this is just compile tested. > > --- > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 6265d9225ec8..ce8b9bc949b9 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword > return ret; > } > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > +{ > + int ret; > + > + mutex_lock(&ctrl->scan_lock); > + nvme_start_freeze(ctrl); > + nvme_wait_freeze(ctrl); > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > + NULL); > + nvme_unfreeze(ctrl); > + mutex_unlock(&ctrl->scan_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(nvme_set_power); > + > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) > { > u32 q_count = (*count - 1) | ((*count - 1) << 16); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 527d64545023..f2be6aad9804 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, > unsigned timeout, int qid, int at_head, > blk_mq_req_flags_t flags, bool poll); > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss); > void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index a90cf5d63aac..2c4154cb4e79 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -18,6 +18,7 @@ > #include <linux/mutex.h> > #include <linux/once.h> > #include <linux/pci.h> > +#include <linux/suspend.h> > #include <linux/t10-pi.h> > #include <linux/types.h> > #include <linux/io-64-nonatomic-lo-hi.h> > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); You probably want to call pci_save_state(pdev) in the branch above to prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep() going forward, so I would write this routine as if (pm_suspend_via_firmware()) { nvme_dev_disable(ndev, true); return 0; } pci_save_state(pdev) return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > nvme_dev_disable(ndev, true); > return 0; > } > @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, 0); > nvme_reset_ctrl(&ndev->ctrl); > return 0; > } The rest of the patch LGTM. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 20:54 ` Rafael J. Wysocki 2019-05-09 20:54 ` Rafael J. Wysocki @ 2019-05-09 21:16 ` Keith Busch 2019-05-09 21:16 ` Keith Busch 2019-05-09 21:39 ` Rafael J. Wysocki 1 sibling, 2 replies; 60+ messages in thread From: Keith Busch @ 2019-05-09 21:16 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Thu, May 09, 2019 at 10:54:04PM +0200, Rafael J. Wysocki wrote: > On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote: > > #include <linux/io-64-nonatomic-lo-hi.h> > > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > > struct pci_dev *pdev = to_pci_dev(dev); > > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > > > + if (!pm_suspend_via_firmware()) > > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > > You probably want to call pci_save_state(pdev) in the branch above to > prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep() > going forward, so I would write this routine as > > if (pm_suspend_via_firmware()) { > nvme_dev_disable(ndev, true); > return 0; > } > > pci_save_state(pdev) > return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); Ah, good point. I'll make sure that's added and will wait to see hear if there's any other feedback. I am trying to test the paths by faking out PS capabilities, and have a question on how to force each: Running "rtcwake -m freeze ...", that takes the !pm_suspend_via_firmware() path as I expected. But trying to test the original path, I thought using "-m mem" would have been a suspend via firmware, but that is still returning false. Is that expected? I've only tried this on one platform so far, so might just be this particular one is missing a firmware capability. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 21:16 ` Keith Busch @ 2019-05-09 21:16 ` Keith Busch 2019-05-09 21:39 ` Rafael J. Wysocki 1 sibling, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-09 21:16 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Thu, May 09, 2019 at 10:54:04PM +0200, Rafael J. Wysocki wrote: > On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote: > > #include <linux/io-64-nonatomic-lo-hi.h> > > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > > struct pci_dev *pdev = to_pci_dev(dev); > > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > > > + if (!pm_suspend_via_firmware()) > > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > > You probably want to call pci_save_state(pdev) in the branch above to > prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep() > going forward, so I would write this routine as > > if (pm_suspend_via_firmware()) { > nvme_dev_disable(ndev, true); > return 0; > } > > pci_save_state(pdev) > return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); Ah, good point. I'll make sure that's added and will wait to see hear if there's any other feedback. I am trying to test the paths by faking out PS capabilities, and have a question on how to force each: Running "rtcwake -m freeze ...", that takes the !pm_suspend_via_firmware() path as I expected. But trying to test the original path, I thought using "-m mem" would have been a suspend via firmware, but that is still returning false. Is that expected? I've only tried this on one platform so far, so might just be this particular one is missing a firmware capability. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 21:16 ` Keith Busch 2019-05-09 21:16 ` Keith Busch @ 2019-05-09 21:39 ` Rafael J. Wysocki 2019-05-09 21:39 ` Rafael J. Wysocki 1 sibling, 1 reply; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 21:39 UTC (permalink / raw) To: Keith Busch Cc: Rafael J. Wysocki, Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Thu, May 9, 2019 at 11:21 PM Keith Busch <kbusch@kernel.org> wrote: > > On Thu, May 09, 2019 at 10:54:04PM +0200, Rafael J. Wysocki wrote: > > On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote: > > > #include <linux/io-64-nonatomic-lo-hi.h> > > > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > > > struct pci_dev *pdev = to_pci_dev(dev); > > > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > > > > > + if (!pm_suspend_via_firmware()) > > > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > > > > You probably want to call pci_save_state(pdev) in the branch above to > > prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep() > > going forward, so I would write this routine as > > > > if (pm_suspend_via_firmware()) { > > nvme_dev_disable(ndev, true); > > return 0; > > } > > > > pci_save_state(pdev) > > return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > > Ah, good point. I'll make sure that's added and will wait to see hear if > there's any other feedback. > > I am trying to test the paths by faking out PS capabilities, and have > a question on how to force each: > > Running "rtcwake -m freeze ...", that takes the !pm_suspend_via_firmware() > path as I expected. > > But trying to test the original path, I thought using "-m mem" would > have been a suspend via firmware, but that is still returning false. > > Is that expected? Yes, if s2idle is the default on that platform. You should be able to switch over to S3 by writing "deep" into /sys/power/mem_sleep as long as it is supported on that platform at all. > I've only tried this on one platform so far, so might > just be this particular one is missing a firmware capability. You can check that by looking into /sys/power/mem_sleep (if there is only "[s2idle]" in there, S3 is not supported). ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 21:39 ` Rafael J. Wysocki @ 2019-05-09 21:39 ` Rafael J. Wysocki 0 siblings, 0 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-09 21:39 UTC (permalink / raw) To: Keith Busch Cc: Rafael J. Wysocki, Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Thu, May 9, 2019 at 11:21 PM Keith Busch <kbusch@kernel.org> wrote: > > On Thu, May 09, 2019 at 10:54:04PM +0200, Rafael J. Wysocki wrote: > > On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote: > > > #include <linux/io-64-nonatomic-lo-hi.h> > > > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > > > struct pci_dev *pdev = to_pci_dev(dev); > > > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > > > > > + if (!pm_suspend_via_firmware()) > > > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > > > > You probably want to call pci_save_state(pdev) in the branch above to > > prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep() > > going forward, so I would write this routine as > > > > if (pm_suspend_via_firmware()) { > > nvme_dev_disable(ndev, true); > > return 0; > > } > > > > pci_save_state(pdev) > > return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > > Ah, good point. I'll make sure that's added and will wait to see hear if > there's any other feedback. > > I am trying to test the paths by faking out PS capabilities, and have > a question on how to force each: > > Running "rtcwake -m freeze ...", that takes the !pm_suspend_via_firmware() > path as I expected. > > But trying to test the original path, I thought using "-m mem" would > have been a suspend via firmware, but that is still returning false. > > Is that expected? Yes, if s2idle is the default on that platform. You should be able to switch over to S3 by writing "deep" into /sys/power/mem_sleep as long as it is supported on that platform at all. > I've only tried this on one platform so far, so might > just be this particular one is missing a firmware capability. You can check that by looking into /sys/power/mem_sleep (if there is only "[s2idle]" in there, S3 is not supported). ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 19:28 ` Keith Busch 2019-05-09 19:28 ` Keith Busch 2019-05-09 20:54 ` Rafael J. Wysocki @ 2019-05-09 21:37 ` Mario.Limonciello 2019-05-09 21:37 ` Mario.Limonciello 2019-05-09 21:54 ` Keith Busch 2019-05-10 5:30 ` Christoph Hellwig 3 siblings, 2 replies; 60+ messages in thread From: Mario.Limonciello @ 2019-05-09 21:37 UTC (permalink / raw) To: kbusch Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch > On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote: > > No, current Windows versions don't transition to D3 with inbox NVME driver. > > You're correct, it's explicit state transitions even if APST was enabled > > (as this patch is currently doing as well). > > The proposed patch does too much, and your resume latency will be worse > off for doing an unnecessary controller reset. > > The following should be all that's needed if the device is spec > compliant. The resume part isn't necessary if npss is non-operational, but > we're not saving that info, and it shouldn't hurt to be explicit anyway. > > I don't have any PS capable devices, so this is just compile tested. > > --- > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 6265d9225ec8..ce8b9bc949b9 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, > unsigned fid, unsigned dword > return ret; > } > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > +{ > + int ret; > + > + mutex_lock(&ctrl->scan_lock); > + nvme_start_freeze(ctrl); > + nvme_wait_freeze(ctrl); > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > + NULL); > + nvme_unfreeze(ctrl); > + mutex_unlock(&ctrl->scan_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(nvme_set_power); I believe without memory barriers at the end disks with HMB this will still kernel panic (Such as Toshiba BG3). This still allows D3 which we found at least failed to go into deepest state and blocked platform s0ix for the following SSDs (maybe others): Hynix PC601 LiteOn CL1 > + > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) > { > u32 q_count = (*count - 1) | ((*count - 1) << 16); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 527d64545023..f2be6aad9804 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, > struct nvme_command *cmd, > unsigned timeout, int qid, int at_head, > blk_mq_req_flags_t flags, bool poll); > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss); > void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index a90cf5d63aac..2c4154cb4e79 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -18,6 +18,7 @@ > #include <linux/mutex.h> > #include <linux/once.h> > #include <linux/pci.h> > +#include <linux/suspend.h> > #include <linux/t10-pi.h> > #include <linux/types.h> > #include <linux/io-64-nonatomic-lo-hi.h> > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > nvme_dev_disable(ndev, true); > return 0; > } > @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, 0); > nvme_reset_ctrl(&ndev->ctrl); > return 0; > } > -- ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 21:37 ` Mario.Limonciello @ 2019-05-09 21:37 ` Mario.Limonciello 2019-05-09 21:54 ` Keith Busch 1 sibling, 0 replies; 60+ messages in thread From: Mario.Limonciello @ 2019-05-09 21:37 UTC (permalink / raw) To: kbusch Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch > On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote: > > No, current Windows versions don't transition to D3 with inbox NVME driver. > > You're correct, it's explicit state transitions even if APST was enabled > > (as this patch is currently doing as well). > > The proposed patch does too much, and your resume latency will be worse > off for doing an unnecessary controller reset. > > The following should be all that's needed if the device is spec > compliant. The resume part isn't necessary if npss is non-operational, but > we're not saving that info, and it shouldn't hurt to be explicit anyway. > > I don't have any PS capable devices, so this is just compile tested. > > --- > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 6265d9225ec8..ce8b9bc949b9 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, > unsigned fid, unsigned dword > return ret; > } > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > +{ > + int ret; > + > + mutex_lock(&ctrl->scan_lock); > + nvme_start_freeze(ctrl); > + nvme_wait_freeze(ctrl); > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > + NULL); > + nvme_unfreeze(ctrl); > + mutex_unlock(&ctrl->scan_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(nvme_set_power); I believe without memory barriers at the end disks with HMB this will still kernel panic (Such as Toshiba BG3). This still allows D3 which we found at least failed to go into deepest state and blocked platform s0ix for the following SSDs (maybe others): Hynix PC601 LiteOn CL1 > + > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) > { > u32 q_count = (*count - 1) | ((*count - 1) << 16); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 527d64545023..f2be6aad9804 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, > struct nvme_command *cmd, > unsigned timeout, int qid, int at_head, > blk_mq_req_flags_t flags, bool poll); > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss); > void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index a90cf5d63aac..2c4154cb4e79 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -18,6 +18,7 @@ > #include <linux/mutex.h> > #include <linux/once.h> > #include <linux/pci.h> > +#include <linux/suspend.h> > #include <linux/t10-pi.h> > #include <linux/types.h> > #include <linux/io-64-nonatomic-lo-hi.h> > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > nvme_dev_disable(ndev, true); > return 0; > } > @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, 0); > nvme_reset_ctrl(&ndev->ctrl); > return 0; > } > -- ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 21:37 ` Mario.Limonciello 2019-05-09 21:37 ` Mario.Limonciello @ 2019-05-09 21:54 ` Keith Busch 2019-05-09 21:54 ` Keith Busch 2019-05-09 22:19 ` Mario.Limonciello 1 sibling, 2 replies; 60+ messages in thread From: Keith Busch @ 2019-05-09 21:54 UTC (permalink / raw) To: Mario.Limonciello Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com wrote: > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > > +{ > > + int ret; > > + > > + mutex_lock(&ctrl->scan_lock); > > + nvme_start_freeze(ctrl); > > + nvme_wait_freeze(ctrl); > > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > > + NULL); > > + nvme_unfreeze(ctrl); > > + mutex_unlock(&ctrl->scan_lock); > > + > > + return ret; > > +} > > +EXPORT_SYMBOL_GPL(nvme_set_power); > > I believe without memory barriers at the end disks with HMB this will > still kernel panic (Such as Toshiba BG3). Well, the mutex has an implied memory barrier, but your HMB explanation doesn't make much sense to me anyway. The "mb()" in this thread's original patch is a CPU memory barrier, and the CPU had better not be accessing HMB memory. Is there something else going on here? > This still allows D3 which we found at least failed to go into deepest state and blocked > platform s0ix for the following SSDs (maybe others): > Hynix PC601 > LiteOn CL1 We usually write features to spec first, then quirk non-compliant devices after. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 21:54 ` Keith Busch @ 2019-05-09 21:54 ` Keith Busch 2019-05-09 22:19 ` Mario.Limonciello 1 sibling, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-09 21:54 UTC (permalink / raw) To: Mario.Limonciello Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com wrote: > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > > +{ > > + int ret; > > + > > + mutex_lock(&ctrl->scan_lock); > > + nvme_start_freeze(ctrl); > > + nvme_wait_freeze(ctrl); > > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > > + NULL); > > + nvme_unfreeze(ctrl); > > + mutex_unlock(&ctrl->scan_lock); > > + > > + return ret; > > +} > > +EXPORT_SYMBOL_GPL(nvme_set_power); > > I believe without memory barriers at the end disks with HMB this will > still kernel panic (Such as Toshiba BG3). Well, the mutex has an implied memory barrier, but your HMB explanation doesn't make much sense to me anyway. The "mb()" in this thread's original patch is a CPU memory barrier, and the CPU had better not be accessing HMB memory. Is there something else going on here? > This still allows D3 which we found at least failed to go into deepest state and blocked > platform s0ix for the following SSDs (maybe others): > Hynix PC601 > LiteOn CL1 We usually write features to spec first, then quirk non-compliant devices after. ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 21:54 ` Keith Busch 2019-05-09 21:54 ` Keith Busch @ 2019-05-09 22:19 ` Mario.Limonciello 2019-05-09 22:19 ` Mario.Limonciello 2019-05-10 6:05 ` Kai-Heng Feng 1 sibling, 2 replies; 60+ messages in thread From: Mario.Limonciello @ 2019-05-09 22:19 UTC (permalink / raw) To: kbusch Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch > -----Original Message----- > From: Keith Busch <kbusch@kernel.org> > Sent: Thursday, May 9, 2019 4:54 PM > To: Limonciello, Mario > Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; > sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; > rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- > nvme@lists.infradead.org; keith.busch@intel.com > Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on > Suspend-to-Idle > > > [EXTERNAL EMAIL] > > On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com wrote: > > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > > > +{ > > > + int ret; > > > + > > > + mutex_lock(&ctrl->scan_lock); > > > + nvme_start_freeze(ctrl); > > > + nvme_wait_freeze(ctrl); > > > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > > > + NULL); > > > + nvme_unfreeze(ctrl); > > > + mutex_unlock(&ctrl->scan_lock); > > > + > > > + return ret; > > > +} > > > +EXPORT_SYMBOL_GPL(nvme_set_power); > > > > I believe without memory barriers at the end disks with HMB this will > > still kernel panic (Such as Toshiba BG3). > > Well, the mutex has an implied memory barrier, but your HMB explanation > doesn't make much sense to me anyway. The "mb()" in this thread's original > patch is a CPU memory barrier, and the CPU had better not be accessing > HMB memory. Is there something else going on here? Kai Heng will need to speak up a bit in his time zone as he has this disk on hand, but what I recall from our discussion was that DMA operation MemRd after resume was the source of the hang. > > > This still allows D3 which we found at least failed to go into deepest state and > blocked > > platform s0ix for the following SSDs (maybe others): > > Hynix PC601 > > LiteOn CL1 > > We usually write features to spec first, then quirk non-compliant > devices after. NVME spec doesn't talk about a relationship between SetFeatures w/ NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. This is why we opened a dialog with storage vendors, including contrasting the behavior of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. Those two I mention that come to mind immediately because they were most recently tested to fail. Our discussion with storage vendors overwhelmingly requested that we don't use D3 under S2I because their current firmware architecture won't support it. For example one vendor told us with current implementation that receiving D3hot after NVME shutdown will prevent being able to enter L1.2. D3hot entry was supported by an IRQ handler that isn't serviced in NVME shutdown state. Another vendor told us that with current implementation it's impossible to transition to PS4 (at least via APST) while L1.2 D3hot is active. ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 22:19 ` Mario.Limonciello @ 2019-05-09 22:19 ` Mario.Limonciello 2019-05-10 6:05 ` Kai-Heng Feng 1 sibling, 0 replies; 60+ messages in thread From: Mario.Limonciello @ 2019-05-09 22:19 UTC (permalink / raw) To: kbusch Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch > -----Original Message----- > From: Keith Busch <kbusch@kernel.org> > Sent: Thursday, May 9, 2019 4:54 PM > To: Limonciello, Mario > Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; > sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; > rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- > nvme@lists.infradead.org; keith.busch@intel.com > Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on > Suspend-to-Idle > > > [EXTERNAL EMAIL] > > On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com wrote: > > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > > > +{ > > > + int ret; > > > + > > > + mutex_lock(&ctrl->scan_lock); > > > + nvme_start_freeze(ctrl); > > > + nvme_wait_freeze(ctrl); > > > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > > > + NULL); > > > + nvme_unfreeze(ctrl); > > > + mutex_unlock(&ctrl->scan_lock); > > > + > > > + return ret; > > > +} > > > +EXPORT_SYMBOL_GPL(nvme_set_power); > > > > I believe without memory barriers at the end disks with HMB this will > > still kernel panic (Such as Toshiba BG3). > > Well, the mutex has an implied memory barrier, but your HMB explanation > doesn't make much sense to me anyway. The "mb()" in this thread's original > patch is a CPU memory barrier, and the CPU had better not be accessing > HMB memory. Is there something else going on here? Kai Heng will need to speak up a bit in his time zone as he has this disk on hand, but what I recall from our discussion was that DMA operation MemRd after resume was the source of the hang. > > > This still allows D3 which we found at least failed to go into deepest state and > blocked > > platform s0ix for the following SSDs (maybe others): > > Hynix PC601 > > LiteOn CL1 > > We usually write features to spec first, then quirk non-compliant > devices after. NVME spec doesn't talk about a relationship between SetFeatures w/ NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. This is why we opened a dialog with storage vendors, including contrasting the behavior of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. Those two I mention that come to mind immediately because they were most recently tested to fail. Our discussion with storage vendors overwhelmingly requested that we don't use D3 under S2I because their current firmware architecture won't support it. For example one vendor told us with current implementation that receiving D3hot after NVME shutdown will prevent being able to enter L1.2. D3hot entry was supported by an IRQ handler that isn't serviced in NVME shutdown state. Another vendor told us that with current implementation it's impossible to transition to PS4 (at least via APST) while L1.2 D3hot is active. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 22:19 ` Mario.Limonciello 2019-05-09 22:19 ` Mario.Limonciello @ 2019-05-10 6:05 ` Kai-Heng Feng 2019-05-10 6:05 ` Kai-Heng Feng ` (2 more replies) 1 sibling, 3 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-10 6:05 UTC (permalink / raw) To: Mario.Limonciello Cc: Keith Busch, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote: >> -----Original Message----- >> From: Keith Busch <kbusch@kernel.org> >> Sent: Thursday, May 9, 2019 4:54 PM >> To: Limonciello, Mario >> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; >> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; >> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- >> nvme@lists.infradead.org; keith.busch@intel.com >> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead >> of D3 on >> Suspend-to-Idle >> >> >> [EXTERNAL EMAIL] >> >> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com >> wrote: >>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) >>>> +{ >>>> + int ret; >>>> + >>>> + mutex_lock(&ctrl->scan_lock); >>>> + nvme_start_freeze(ctrl); >>>> + nvme_wait_freeze(ctrl); >>>> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, >>>> + NULL); >>>> + nvme_unfreeze(ctrl); >>>> + mutex_unlock(&ctrl->scan_lock); >>>> + >>>> + return ret; >>>> +} >>>> +EXPORT_SYMBOL_GPL(nvme_set_power); >>> >>> I believe without memory barriers at the end disks with HMB this will >>> still kernel panic (Such as Toshiba BG3). >> >> Well, the mutex has an implied memory barrier, but your HMB explanation >> doesn't make much sense to me anyway. The "mb()" in this thread's original >> patch is a CPU memory barrier, and the CPU had better not be accessing >> HMB memory. Is there something else going on here? > > Kai Heng will need to speak up a bit in his time zone as he has this disk > on hand, > but what I recall from our discussion was that DMA operation MemRd after > resume was the source of the hang. Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a memory barrier. If mb() shouldn’t be used here, what’s the correct variant to use in this context? > >>> This still allows D3 which we found at least failed to go into deepest >>> state and >> blocked >>> platform s0ix for the following SSDs (maybe others): >>> Hynix PC601 >>> LiteOn CL1 >> >> We usually write features to spec first, then quirk non-compliant >> devices after. > > NVME spec doesn't talk about a relationship between SetFeatures w/ > NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. > > This is why we opened a dialog with storage vendors, including > contrasting the behavior > of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. > > Those two I mention that come to mind immediately because they were most > recently > tested to fail. Our discussion with storage vendors overwhelmingly > requested > that we don't use D3 under S2I because their current firmware > architecture won't > support it. > > For example one vendor told us with current implementation that receiving > D3hot > after NVME shutdown will prevent being able to enter L1.2. D3hot entry > was supported > by an IRQ handler that isn't serviced in NVME shutdown state. > > Another vendor told us that with current implementation it's impossible > to transition > to PS4 (at least via APST) while L1.2 D3hot is active. I tested the patch from Keith and it has two issues just as simply skipping nvme_dev_disable(): 1) It consumes more power in S2I 2) System freeze after resume Also I don’t think it’s a spec. It’s a guide to let OEM/ODM pick and assemble Modern Standby compliant machines, so what Windows NVMe driver really does is still opaque to us. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 6:05 ` Kai-Heng Feng @ 2019-05-10 6:05 ` Kai-Heng Feng 2019-05-10 8:23 ` Rafael J. Wysocki 2019-05-10 14:02 ` Keith Busch 2 siblings, 0 replies; 60+ messages in thread From: Kai-Heng Feng @ 2019-05-10 6:05 UTC (permalink / raw) To: Mario.Limonciello Cc: Keith Busch, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote: >> -----Original Message----- >> From: Keith Busch <kbusch@kernel.org> >> Sent: Thursday, May 9, 2019 4:54 PM >> To: Limonciello, Mario >> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; >> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; >> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- >> nvme@lists.infradead.org; keith.busch@intel.com >> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead >> of D3 on >> Suspend-to-Idle >> >> >> [EXTERNAL EMAIL] >> >> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com >> wrote: >>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) >>>> +{ >>>> + int ret; >>>> + >>>> + mutex_lock(&ctrl->scan_lock); >>>> + nvme_start_freeze(ctrl); >>>> + nvme_wait_freeze(ctrl); >>>> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, >>>> + NULL); >>>> + nvme_unfreeze(ctrl); >>>> + mutex_unlock(&ctrl->scan_lock); >>>> + >>>> + return ret; >>>> +} >>>> +EXPORT_SYMBOL_GPL(nvme_set_power); >>> >>> I believe without memory barriers at the end disks with HMB this will >>> still kernel panic (Such as Toshiba BG3). >> >> Well, the mutex has an implied memory barrier, but your HMB explanation >> doesn't make much sense to me anyway. The "mb()" in this thread's original >> patch is a CPU memory barrier, and the CPU had better not be accessing >> HMB memory. Is there something else going on here? > > Kai Heng will need to speak up a bit in his time zone as he has this disk > on hand, > but what I recall from our discussion was that DMA operation MemRd after > resume was the source of the hang. Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a memory barrier. If mb() shouldn’t be used here, what’s the correct variant to use in this context? > >>> This still allows D3 which we found at least failed to go into deepest >>> state and >> blocked >>> platform s0ix for the following SSDs (maybe others): >>> Hynix PC601 >>> LiteOn CL1 >> >> We usually write features to spec first, then quirk non-compliant >> devices after. > > NVME spec doesn't talk about a relationship between SetFeatures w/ > NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. > > This is why we opened a dialog with storage vendors, including > contrasting the behavior > of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. > > Those two I mention that come to mind immediately because they were most > recently > tested to fail. Our discussion with storage vendors overwhelmingly > requested > that we don't use D3 under S2I because their current firmware > architecture won't > support it. > > For example one vendor told us with current implementation that receiving > D3hot > after NVME shutdown will prevent being able to enter L1.2. D3hot entry > was supported > by an IRQ handler that isn't serviced in NVME shutdown state. > > Another vendor told us that with current implementation it's impossible > to transition > to PS4 (at least via APST) while L1.2 D3hot is active. I tested the patch from Keith and it has two issues just as simply skipping nvme_dev_disable(): 1) It consumes more power in S2I 2) System freeze after resume Also I don’t think it’s a spec. It’s a guide to let OEM/ODM pick and assemble Modern Standby compliant machines, so what Windows NVMe driver really does is still opaque to us. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 6:05 ` Kai-Heng Feng 2019-05-10 6:05 ` Kai-Heng Feng @ 2019-05-10 8:23 ` Rafael J. Wysocki 2019-05-10 8:23 ` Rafael J. Wysocki ` (2 more replies) 2019-05-10 14:02 ` Keith Busch 2 siblings, 3 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-10 8:23 UTC (permalink / raw) To: Kai-Heng Feng Cc: Mario Limonciello, Keith Busch, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Rafael J. Wysocki, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote: > > >> -----Original Message----- > >> From: Keith Busch <kbusch@kernel.org> > >> Sent: Thursday, May 9, 2019 4:54 PM > >> To: Limonciello, Mario > >> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; > >> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; > >> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- > >> nvme@lists.infradead.org; keith.busch@intel.com > >> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead > >> of D3 on > >> Suspend-to-Idle > >> > >> > >> [EXTERNAL EMAIL] > >> > >> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com > >> wrote: > >>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > >>>> +{ > >>>> + int ret; > >>>> + > >>>> + mutex_lock(&ctrl->scan_lock); > >>>> + nvme_start_freeze(ctrl); > >>>> + nvme_wait_freeze(ctrl); > >>>> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > >>>> + NULL); > >>>> + nvme_unfreeze(ctrl); > >>>> + mutex_unlock(&ctrl->scan_lock); > >>>> + > >>>> + return ret; > >>>> +} > >>>> +EXPORT_SYMBOL_GPL(nvme_set_power); > >>> > >>> I believe without memory barriers at the end disks with HMB this will > >>> still kernel panic (Such as Toshiba BG3). > >> > >> Well, the mutex has an implied memory barrier, but your HMB explanation > >> doesn't make much sense to me anyway. The "mb()" in this thread's original > >> patch is a CPU memory barrier, and the CPU had better not be accessing > >> HMB memory. Is there something else going on here? > > > > Kai Heng will need to speak up a bit in his time zone as he has this disk > > on hand, > > but what I recall from our discussion was that DMA operation MemRd after > > resume was the source of the hang. > > Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a > memory barrier. > If mb() shouldn’t be used here, what’s the correct variant to use in this > context? > > > > >>> This still allows D3 which we found at least failed to go into deepest > >>> state and > >> blocked > >>> platform s0ix for the following SSDs (maybe others): > >>> Hynix PC601 > >>> LiteOn CL1 > >> > >> We usually write features to spec first, then quirk non-compliant > >> devices after. > > > > NVME spec doesn't talk about a relationship between SetFeatures w/ > > NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. > > > > This is why we opened a dialog with storage vendors, including > > contrasting the behavior > > of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. > > > > Those two I mention that come to mind immediately because they were most > > recently > > tested to fail. Our discussion with storage vendors overwhelmingly > > requested > > that we don't use D3 under S2I because their current firmware > > architecture won't > > support it. > > > > For example one vendor told us with current implementation that receiving > > D3hot > > after NVME shutdown will prevent being able to enter L1.2. D3hot entry > > was supported > > by an IRQ handler that isn't serviced in NVME shutdown state. > > > > Another vendor told us that with current implementation it's impossible > > to transition > > to PS4 (at least via APST) while L1.2 D3hot is active. > > I tested the patch from Keith and it has two issues just as simply skipping > nvme_dev_disable(): > 1) It consumes more power in S2I > 2) System freeze after resume Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from asking for D3 and both of the symptoms above may be consequences of that in principle. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 8:23 ` Rafael J. Wysocki @ 2019-05-10 8:23 ` Rafael J. Wysocki 2019-05-10 13:52 ` Keith Busch 2019-05-10 15:15 ` Kai Heng Feng 2 siblings, 0 replies; 60+ messages in thread From: Rafael J. Wysocki @ 2019-05-10 8:23 UTC (permalink / raw) To: Kai-Heng Feng Cc: Mario Limonciello, Keith Busch, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Rafael J. Wysocki, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote: > > >> -----Original Message----- > >> From: Keith Busch <kbusch@kernel.org> > >> Sent: Thursday, May 9, 2019 4:54 PM > >> To: Limonciello, Mario > >> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; > >> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; > >> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- > >> nvme@lists.infradead.org; keith.busch@intel.com > >> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead > >> of D3 on > >> Suspend-to-Idle > >> > >> > >> [EXTERNAL EMAIL] > >> > >> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com > >> wrote: > >>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > >>>> +{ > >>>> + int ret; > >>>> + > >>>> + mutex_lock(&ctrl->scan_lock); > >>>> + nvme_start_freeze(ctrl); > >>>> + nvme_wait_freeze(ctrl); > >>>> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > >>>> + NULL); > >>>> + nvme_unfreeze(ctrl); > >>>> + mutex_unlock(&ctrl->scan_lock); > >>>> + > >>>> + return ret; > >>>> +} > >>>> +EXPORT_SYMBOL_GPL(nvme_set_power); > >>> > >>> I believe without memory barriers at the end disks with HMB this will > >>> still kernel panic (Such as Toshiba BG3). > >> > >> Well, the mutex has an implied memory barrier, but your HMB explanation > >> doesn't make much sense to me anyway. The "mb()" in this thread's original > >> patch is a CPU memory barrier, and the CPU had better not be accessing > >> HMB memory. Is there something else going on here? > > > > Kai Heng will need to speak up a bit in his time zone as he has this disk > > on hand, > > but what I recall from our discussion was that DMA operation MemRd after > > resume was the source of the hang. > > Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a > memory barrier. > If mb() shouldn’t be used here, what’s the correct variant to use in this > context? > > > > >>> This still allows D3 which we found at least failed to go into deepest > >>> state and > >> blocked > >>> platform s0ix for the following SSDs (maybe others): > >>> Hynix PC601 > >>> LiteOn CL1 > >> > >> We usually write features to spec first, then quirk non-compliant > >> devices after. > > > > NVME spec doesn't talk about a relationship between SetFeatures w/ > > NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. > > > > This is why we opened a dialog with storage vendors, including > > contrasting the behavior > > of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. > > > > Those two I mention that come to mind immediately because they were most > > recently > > tested to fail. Our discussion with storage vendors overwhelmingly > > requested > > that we don't use D3 under S2I because their current firmware > > architecture won't > > support it. > > > > For example one vendor told us with current implementation that receiving > > D3hot > > after NVME shutdown will prevent being able to enter L1.2. D3hot entry > > was supported > > by an IRQ handler that isn't serviced in NVME shutdown state. > > > > Another vendor told us that with current implementation it's impossible > > to transition > > to PS4 (at least via APST) while L1.2 D3hot is active. > > I tested the patch from Keith and it has two issues just as simply skipping > nvme_dev_disable(): > 1) It consumes more power in S2I > 2) System freeze after resume Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from asking for D3 and both of the symptoms above may be consequences of that in principle. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 8:23 ` Rafael J. Wysocki 2019-05-10 8:23 ` Rafael J. Wysocki @ 2019-05-10 13:52 ` Keith Busch 2019-05-10 13:52 ` Keith Busch 2019-05-10 15:15 ` Kai Heng Feng 2 siblings, 1 reply; 60+ messages in thread From: Keith Busch @ 2019-05-10 13:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Kai-Heng Feng, Mario Limonciello, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Wysocki, Rafael J, Linux Kernel Mailing List, linux-nvme, Busch, Keith On Fri, May 10, 2019 at 01:23:11AM -0700, Rafael J. Wysocki wrote: > On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng > > I tested the patch from Keith and it has two issues just as simply skipping > > nvme_dev_disable(): > > 1) It consumes more power in S2I > > 2) System freeze after resume > > Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from > asking for D3 and both of the symptoms above may be consequences of > that in principle. Right, I'll fix up the kernel's PCI D3 control and resend for consideration. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 13:52 ` Keith Busch @ 2019-05-10 13:52 ` Keith Busch 0 siblings, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-10 13:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Kai-Heng Feng, Mario Limonciello, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Wysocki, Rafael J, Linux Kernel Mailing List, linux-nvme, Busch, Keith On Fri, May 10, 2019 at 01:23:11AM -0700, Rafael J. Wysocki wrote: > On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng > > I tested the patch from Keith and it has two issues just as simply skipping > > nvme_dev_disable(): > > 1) It consumes more power in S2I > > 2) System freeze after resume > > Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from > asking for D3 and both of the symptoms above may be consequences of > that in principle. Right, I'll fix up the kernel's PCI D3 control and resend for consideration. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 8:23 ` Rafael J. Wysocki 2019-05-10 8:23 ` Rafael J. Wysocki 2019-05-10 13:52 ` Keith Busch @ 2019-05-10 15:15 ` Kai Heng Feng 2019-05-10 15:15 ` Kai Heng Feng 2019-05-10 15:36 ` Keith Busch 2 siblings, 2 replies; 60+ messages in thread From: Kai Heng Feng @ 2019-05-10 15:15 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Mario Limonciello, Keith Busch, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch > On May 10, 2019, at 4:23 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng > <kai.heng.feng@canonical.com> wrote: >> >> at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote: >> >>>> -----Original Message----- >>>> From: Keith Busch <kbusch@kernel.org> >>>> Sent: Thursday, May 9, 2019 4:54 PM >>>> To: Limonciello, Mario >>>> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; >>>> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; >>>> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- >>>> nvme@lists.infradead.org; keith.busch@intel.com >>>> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead >>>> of D3 on >>>> Suspend-to-Idle >>>> >>>> >>>> [EXTERNAL EMAIL] >>>> >>>> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com >>>> wrote: >>>>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) >>>>>> +{ >>>>>> + int ret; >>>>>> + >>>>>> + mutex_lock(&ctrl->scan_lock); >>>>>> + nvme_start_freeze(ctrl); >>>>>> + nvme_wait_freeze(ctrl); >>>>>> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, >>>>>> + NULL); >>>>>> + nvme_unfreeze(ctrl); >>>>>> + mutex_unlock(&ctrl->scan_lock); >>>>>> + >>>>>> + return ret; >>>>>> +} >>>>>> +EXPORT_SYMBOL_GPL(nvme_set_power); >>>>> >>>>> I believe without memory barriers at the end disks with HMB this will >>>>> still kernel panic (Such as Toshiba BG3). >>>> >>>> Well, the mutex has an implied memory barrier, but your HMB explanation >>>> doesn't make much sense to me anyway. The "mb()" in this thread's original >>>> patch is a CPU memory barrier, and the CPU had better not be accessing >>>> HMB memory. Is there something else going on here? >>> >>> Kai Heng will need to speak up a bit in his time zone as he has this disk >>> on hand, >>> but what I recall from our discussion was that DMA operation MemRd after >>> resume was the source of the hang. >> >> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a >> memory barrier. >> If mb() shouldn’t be used here, what’s the correct variant to use in this >> context? >> >>> >>>>> This still allows D3 which we found at least failed to go into deepest >>>>> state and >>>> blocked >>>>> platform s0ix for the following SSDs (maybe others): >>>>> Hynix PC601 >>>>> LiteOn CL1 >>>> >>>> We usually write features to spec first, then quirk non-compliant >>>> devices after. >>> >>> NVME spec doesn't talk about a relationship between SetFeatures w/ >>> NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. >>> >>> This is why we opened a dialog with storage vendors, including >>> contrasting the behavior >>> of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. >>> >>> Those two I mention that come to mind immediately because they were most >>> recently >>> tested to fail. Our discussion with storage vendors overwhelmingly >>> requested >>> that we don't use D3 under S2I because their current firmware >>> architecture won't >>> support it. >>> >>> For example one vendor told us with current implementation that receiving >>> D3hot >>> after NVME shutdown will prevent being able to enter L1.2. D3hot entry >>> was supported >>> by an IRQ handler that isn't serviced in NVME shutdown state. >>> >>> Another vendor told us that with current implementation it's impossible >>> to transition >>> to PS4 (at least via APST) while L1.2 D3hot is active. >> >> I tested the patch from Keith and it has two issues just as simply skipping >> nvme_dev_disable(): >> 1) It consumes more power in S2I >> 2) System freeze after resume > > Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from > asking for D3 and both of the symptoms above may be consequences of > that in principle. Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 3e4fb891a95a..ece428ce6876 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -18,6 +18,7 @@ #include <linux/mutex.h> #include <linux/once.h> #include <linux/pci.h> +#include <linux/suspend.h> #include <linux/t10-pi.h> #include <linux/types.h> #include <linux/io-64-nonatomic-lo-hi.h> @@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_suspend_via_firmware()) { + nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); + pci_save_state(pdev); + } + nvme_dev_disable(ndev, true); return 0; } @@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_resume_via_firmware()) { + return nvme_set_power(&ndev->ctrl, 0); + } + nvme_reset_ctrl(&ndev->ctrl); return 0; } Does pci_save_state() here enough to prevent the device enter to D3? Kai-Heng ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 15:15 ` Kai Heng Feng @ 2019-05-10 15:15 ` Kai Heng Feng 2019-05-10 15:36 ` Keith Busch 1 sibling, 0 replies; 60+ messages in thread From: Kai Heng Feng @ 2019-05-10 15:15 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Mario Limonciello, Keith Busch, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch > On May 10, 2019, at 4:23 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng > <kai.heng.feng@canonical.com> wrote: >> >> at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote: >> >>>> -----Original Message----- >>>> From: Keith Busch <kbusch@kernel.org> >>>> Sent: Thursday, May 9, 2019 4:54 PM >>>> To: Limonciello, Mario >>>> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com; >>>> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org; >>>> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux- >>>> nvme@lists.infradead.org; keith.busch@intel.com >>>> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead >>>> of D3 on >>>> Suspend-to-Idle >>>> >>>> >>>> [EXTERNAL EMAIL] >>>> >>>> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com >>>> wrote: >>>>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) >>>>>> +{ >>>>>> + int ret; >>>>>> + >>>>>> + mutex_lock(&ctrl->scan_lock); >>>>>> + nvme_start_freeze(ctrl); >>>>>> + nvme_wait_freeze(ctrl); >>>>>> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, >>>>>> + NULL); >>>>>> + nvme_unfreeze(ctrl); >>>>>> + mutex_unlock(&ctrl->scan_lock); >>>>>> + >>>>>> + return ret; >>>>>> +} >>>>>> +EXPORT_SYMBOL_GPL(nvme_set_power); >>>>> >>>>> I believe without memory barriers at the end disks with HMB this will >>>>> still kernel panic (Such as Toshiba BG3). >>>> >>>> Well, the mutex has an implied memory barrier, but your HMB explanation >>>> doesn't make much sense to me anyway. The "mb()" in this thread's original >>>> patch is a CPU memory barrier, and the CPU had better not be accessing >>>> HMB memory. Is there something else going on here? >>> >>> Kai Heng will need to speak up a bit in his time zone as he has this disk >>> on hand, >>> but what I recall from our discussion was that DMA operation MemRd after >>> resume was the source of the hang. >> >> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a >> memory barrier. >> If mb() shouldn’t be used here, what’s the correct variant to use in this >> context? >> >>> >>>>> This still allows D3 which we found at least failed to go into deepest >>>>> state and >>>> blocked >>>>> platform s0ix for the following SSDs (maybe others): >>>>> Hynix PC601 >>>>> LiteOn CL1 >>>> >>>> We usually write features to spec first, then quirk non-compliant >>>> devices after. >>> >>> NVME spec doesn't talk about a relationship between SetFeatures w/ >>> NVME_FEAT_POWER_MGMGT and D3 support, nor order of events. >>> >>> This is why we opened a dialog with storage vendors, including >>> contrasting the behavior >>> of Microsoft Windows inbox NVME driver and Intel's Windows RST driver. >>> >>> Those two I mention that come to mind immediately because they were most >>> recently >>> tested to fail. Our discussion with storage vendors overwhelmingly >>> requested >>> that we don't use D3 under S2I because their current firmware >>> architecture won't >>> support it. >>> >>> For example one vendor told us with current implementation that receiving >>> D3hot >>> after NVME shutdown will prevent being able to enter L1.2. D3hot entry >>> was supported >>> by an IRQ handler that isn't serviced in NVME shutdown state. >>> >>> Another vendor told us that with current implementation it's impossible >>> to transition >>> to PS4 (at least via APST) while L1.2 D3hot is active. >> >> I tested the patch from Keith and it has two issues just as simply skipping >> nvme_dev_disable(): >> 1) It consumes more power in S2I >> 2) System freeze after resume > > Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from > asking for D3 and both of the symptoms above may be consequences of > that in principle. Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 3e4fb891a95a..ece428ce6876 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -18,6 +18,7 @@ #include <linux/mutex.h> #include <linux/once.h> #include <linux/pci.h> +#include <linux/suspend.h> #include <linux/t10-pi.h> #include <linux/types.h> #include <linux/io-64-nonatomic-lo-hi.h> @@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_suspend_via_firmware()) { + nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); + pci_save_state(pdev); + } + nvme_dev_disable(ndev, true); return 0; } @@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev) struct pci_dev *pdev = to_pci_dev(dev); struct nvme_dev *ndev = pci_get_drvdata(pdev); + if (!pm_resume_via_firmware()) { + return nvme_set_power(&ndev->ctrl, 0); + } + nvme_reset_ctrl(&ndev->ctrl); return 0; } Does pci_save_state() here enough to prevent the device enter to D3? Kai-Heng ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 15:15 ` Kai Heng Feng 2019-05-10 15:15 ` Kai Heng Feng @ 2019-05-10 15:36 ` Keith Busch 2019-05-10 15:36 ` Keith Busch 1 sibling, 1 reply; 60+ messages in thread From: Keith Busch @ 2019-05-10 15:36 UTC (permalink / raw) To: Kai Heng Feng Cc: Rafael J. Wysocki, Mario Limonciello, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Fri, May 10, 2019 at 11:15:05PM +0800, Kai Heng Feng wrote: > Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c: > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 3e4fb891a95a..ece428ce6876 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -18,6 +18,7 @@ > #include <linux/mutex.h> > #include <linux/once.h> > #include <linux/pci.h> > +#include <linux/suspend.h> > #include <linux/t10-pi.h> > #include <linux/types.h> > #include <linux/io-64-nonatomic-lo-hi.h> > @@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) { > + nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > + pci_save_state(pdev); > + } > + > nvme_dev_disable(ndev, true); This won't work because you're falling through to nvme_dev_disable after setting the power, so the resume side would certainly fail. This is what you'd want: if (!pm_suspend_via_firmware()) { pci_save_state(pdev); return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); } > return 0; > } > @@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_resume_via_firmware()) { > + return nvme_set_power(&ndev->ctrl, 0); > + } > + > nvme_reset_ctrl(&ndev->ctrl); > return 0; > } > > Does pci_save_state() here enough to prevent the device enter to D3? Yes, saving the state during suspend will prevent pci pm from assuming control over this device's power. It's a bit non-intuitive as Christoph mentioned, so we'll need to make that clear in the comments. For reference, here's the relevant part in pci-driver.c: --- static int pci_pm_suspend_noirq(struct device *dev) { ... if (!pci_dev->state_saved) { pci_save_state(pci_dev); if (pci_power_manageable(pci_dev)) pci_prepare_to_sleep(pci_dev); } ... } -- So by saving the state in our suspend callback, pci will skip pci_prepare_to_sleep(), which is what's setting your device most likely to a D3hot, undermining our nvme power state setting. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 15:36 ` Keith Busch @ 2019-05-10 15:36 ` Keith Busch 0 siblings, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-10 15:36 UTC (permalink / raw) To: Kai Heng Feng Cc: Rafael J. Wysocki, Mario Limonciello, Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki, Linux Kernel Mailing List, linux-nvme, Keith Busch On Fri, May 10, 2019 at 11:15:05PM +0800, Kai Heng Feng wrote: > Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c: > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 3e4fb891a95a..ece428ce6876 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -18,6 +18,7 @@ > #include <linux/mutex.h> > #include <linux/once.h> > #include <linux/pci.h> > +#include <linux/suspend.h> > #include <linux/t10-pi.h> > #include <linux/types.h> > #include <linux/io-64-nonatomic-lo-hi.h> > @@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_suspend_via_firmware()) { > + nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); > + pci_save_state(pdev); > + } > + > nvme_dev_disable(ndev, true); This won't work because you're falling through to nvme_dev_disable after setting the power, so the resume side would certainly fail. This is what you'd want: if (!pm_suspend_via_firmware()) { pci_save_state(pdev); return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); } > return 0; > } > @@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_resume_via_firmware()) { > + return nvme_set_power(&ndev->ctrl, 0); > + } > + > nvme_reset_ctrl(&ndev->ctrl); > return 0; > } > > Does pci_save_state() here enough to prevent the device enter to D3? Yes, saving the state during suspend will prevent pci pm from assuming control over this device's power. It's a bit non-intuitive as Christoph mentioned, so we'll need to make that clear in the comments. For reference, here's the relevant part in pci-driver.c: --- static int pci_pm_suspend_noirq(struct device *dev) { ... if (!pci_dev->state_saved) { pci_save_state(pci_dev); if (pci_power_manageable(pci_dev)) pci_prepare_to_sleep(pci_dev); } ... } -- So by saving the state in our suspend callback, pci will skip pci_prepare_to_sleep(), which is what's setting your device most likely to a D3hot, undermining our nvme power state setting. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 6:05 ` Kai-Heng Feng 2019-05-10 6:05 ` Kai-Heng Feng 2019-05-10 8:23 ` Rafael J. Wysocki @ 2019-05-10 14:02 ` Keith Busch 2019-05-10 14:02 ` Keith Busch 2019-05-10 15:18 ` Kai Heng Feng 2 siblings, 2 replies; 60+ messages in thread From: Keith Busch @ 2019-05-10 14:02 UTC (permalink / raw) To: Kai-Heng Feng Cc: Mario.Limonciello@dell.com, hch@lst.de, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith On Thu, May 09, 2019 at 11:05:42PM -0700, Kai-Heng Feng wrote: > Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a > memory barrier. > If mb() shouldn’t be used here, what’s the correct variant to use in this > context? I'm afraid the requirement is still not clear to me. AFAIK, all our barriers routines ensure data is visible either between CPUs, or between CPU and devices. The CPU never accesses HMB memory, so there must be some other reasoning if this barrier is a real requirement for this device. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 14:02 ` Keith Busch @ 2019-05-10 14:02 ` Keith Busch 2019-05-10 15:18 ` Kai Heng Feng 1 sibling, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-10 14:02 UTC (permalink / raw) To: Kai-Heng Feng Cc: Mario.Limonciello@dell.com, hch@lst.de, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith On Thu, May 09, 2019 at 11:05:42PM -0700, Kai-Heng Feng wrote: > Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a > memory barrier. > If mb() shouldn’t be used here, what’s the correct variant to use in this > context? I'm afraid the requirement is still not clear to me. AFAIK, all our barriers routines ensure data is visible either between CPUs, or between CPU and devices. The CPU never accesses HMB memory, so there must be some other reasoning if this barrier is a real requirement for this device. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 14:02 ` Keith Busch 2019-05-10 14:02 ` Keith Busch @ 2019-05-10 15:18 ` Kai Heng Feng 2019-05-10 15:18 ` Kai Heng Feng 2019-05-10 15:49 ` hch 1 sibling, 2 replies; 60+ messages in thread From: Kai Heng Feng @ 2019-05-10 15:18 UTC (permalink / raw) To: Keith Busch Cc: Mario.Limonciello@dell.com, hch@lst.de, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith > On May 10, 2019, at 10:02 PM, Keith Busch <kbusch@kernel.org> wrote: > > On Thu, May 09, 2019 at 11:05:42PM -0700, Kai-Heng Feng wrote: >> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a >> memory barrier. >> If mb() shouldn’t be used here, what’s the correct variant to use in this >> context? > > I'm afraid the requirement is still not clear to me. AFAIK, all our > barriers routines ensure data is visible either between CPUs, or between > CPU and devices. The CPU never accesses HMB memory, so there must be some > other reasoning if this barrier is a real requirement for this device. Sure, I’ll ask vendor what that MemRd is for. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 15:18 ` Kai Heng Feng @ 2019-05-10 15:18 ` Kai Heng Feng 2019-05-10 15:49 ` hch 1 sibling, 0 replies; 60+ messages in thread From: Kai Heng Feng @ 2019-05-10 15:18 UTC (permalink / raw) To: Keith Busch Cc: Mario.Limonciello@dell.com, hch@lst.de, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith > On May 10, 2019, at 10:02 PM, Keith Busch <kbusch@kernel.org> wrote: > > On Thu, May 09, 2019 at 11:05:42PM -0700, Kai-Heng Feng wrote: >> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a >> memory barrier. >> If mb() shouldn’t be used here, what’s the correct variant to use in this >> context? > > I'm afraid the requirement is still not clear to me. AFAIK, all our > barriers routines ensure data is visible either between CPUs, or between > CPU and devices. The CPU never accesses HMB memory, so there must be some > other reasoning if this barrier is a real requirement for this device. Sure, I’ll ask vendor what that MemRd is for. Kai-Heng ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 15:18 ` Kai Heng Feng 2019-05-10 15:18 ` Kai Heng Feng @ 2019-05-10 15:49 ` hch 2019-05-10 15:49 ` hch 1 sibling, 1 reply; 60+ messages in thread From: hch @ 2019-05-10 15:49 UTC (permalink / raw) To: Kai Heng Feng Cc: Keith Busch, Mario.Limonciello@dell.com, hch@lst.de, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith On Fri, May 10, 2019 at 11:18:52PM +0800, Kai Heng Feng wrote: > > I'm afraid the requirement is still not clear to me. AFAIK, all our > > barriers routines ensure data is visible either between CPUs, or between > > CPU and devices. The CPU never accesses HMB memory, so there must be some > > other reasoning if this barrier is a real requirement for this device. > > Sure, I’ll ask vendor what that MemRd is for. I'd like to understand this bug, but this thread leaves me a little confused. So we have a NVMe driver with HMB. Something crashes - the kernel or the firmware? When does it crash? suspend or resume? That crash seems to be related to a related to a PCIe TLP that reads memory from the host, probably due to the HMB. But a device with a HMB has been told that it can access that memory at any time. So if in any given suspend state TLP to access RAM are not allowed we'll have to tell the device to stop using the HMB. So: what power states do not allow the device to DMA to / from host memory? How do we find out we are about to enter those from the pm methods? We'll then need to disable the HMB, which might suck in terms of latency. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 15:49 ` hch @ 2019-05-10 15:49 ` hch 0 siblings, 0 replies; 60+ messages in thread From: hch @ 2019-05-10 15:49 UTC (permalink / raw) To: Kai Heng Feng Cc: Keith Busch, Mario.Limonciello@dell.com, hch@lst.de, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith On Fri, May 10, 2019 at 11:18:52PM +0800, Kai Heng Feng wrote: > > I'm afraid the requirement is still not clear to me. AFAIK, all our > > barriers routines ensure data is visible either between CPUs, or between > > CPU and devices. The CPU never accesses HMB memory, so there must be some > > other reasoning if this barrier is a real requirement for this device. > > Sure, I’ll ask vendor what that MemRd is for. I'd like to understand this bug, but this thread leaves me a little confused. So we have a NVMe driver with HMB. Something crashes - the kernel or the firmware? When does it crash? suspend or resume? That crash seems to be related to a related to a PCIe TLP that reads memory from the host, probably due to the HMB. But a device with a HMB has been told that it can access that memory at any time. So if in any given suspend state TLP to access RAM are not allowed we'll have to tell the device to stop using the HMB. So: what power states do not allow the device to DMA to / from host memory? How do we find out we are about to enter those from the pm methods? We'll then need to disable the HMB, which might suck in terms of latency. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 19:28 ` Keith Busch ` (2 preceding siblings ...) 2019-05-09 21:37 ` Mario.Limonciello @ 2019-05-10 5:30 ` Christoph Hellwig 2019-05-10 5:30 ` Christoph Hellwig 2019-05-10 13:51 ` Keith Busch 3 siblings, 2 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-10 5:30 UTC (permalink / raw) To: Keith Busch Cc: Mario.Limonciello, kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > +{ > + int ret; > + > + mutex_lock(&ctrl->scan_lock); > + nvme_start_freeze(ctrl); > + nvme_wait_freeze(ctrl); > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > + NULL); > + nvme_unfreeze(ctrl); > + mutex_unlock(&ctrl->scan_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(nvme_set_power); I think we should have this in the PCIe driver, givn that while in theory power states are generic in practice they are only applicable to PCIe. To be revisited if history proves me wrong. Also I don't see any reason why we'd need to do the freeze game on resume. Even on suspend it looks a little odd to me, as in theory the PM core should have already put the system into a quiescent state. But maybe we actually need it there, in which case a comment would be helpful. > + if (!pm_suspend_via_firmware()) pm_suspend_via_firmware is a weird name and has absolutely zero documentation. So I think we really need a big fat comment with the wisdom from this thread here. > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); I think we need to skip this code path is NPSS is zero, as that indicates that the device doesn't actually do power states and fall back to the full teardown. Also I can't find anything except for this odd sentences in the spec: "Hosts that do not dynamically manage power should set the power state to the lowest numbered state that satisfies the PCI Express slot power limit control value. that requires the power states to be ordered in any particular way. So we probably have to read through the table at probing time and find the lowest power state there. Rafael also brought up the issue of not entering D3, and the somewhat non-intuitive to me solution for it, so I'm not commenting on that except for asking on a comment on that save_state call. > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, 0); Don't we need to save the previous power state here and restore that? For example if someone set a specific state through nvme-cli? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 5:30 ` Christoph Hellwig @ 2019-05-10 5:30 ` Christoph Hellwig 2019-05-10 13:51 ` Keith Busch 1 sibling, 0 replies; 60+ messages in thread From: Christoph Hellwig @ 2019-05-10 5:30 UTC (permalink / raw) To: Keith Busch Cc: Mario.Limonciello, kai.heng.feng, hch, axboe, sagi, rafael, linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss) > +{ > + int ret; > + > + mutex_lock(&ctrl->scan_lock); > + nvme_start_freeze(ctrl); > + nvme_wait_freeze(ctrl); > + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0, > + NULL); > + nvme_unfreeze(ctrl); > + mutex_unlock(&ctrl->scan_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(nvme_set_power); I think we should have this in the PCIe driver, givn that while in theory power states are generic in practice they are only applicable to PCIe. To be revisited if history proves me wrong. Also I don't see any reason why we'd need to do the freeze game on resume. Even on suspend it looks a little odd to me, as in theory the PM core should have already put the system into a quiescent state. But maybe we actually need it there, in which case a comment would be helpful. > + if (!pm_suspend_via_firmware()) pm_suspend_via_firmware is a weird name and has absolutely zero documentation. So I think we really need a big fat comment with the wisdom from this thread here. > + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss); I think we need to skip this code path is NPSS is zero, as that indicates that the device doesn't actually do power states and fall back to the full teardown. Also I can't find anything except for this odd sentences in the spec: "Hosts that do not dynamically manage power should set the power state to the lowest numbered state that satisfies the PCI Express slot power limit control value. that requires the power states to be ordered in any particular way. So we probably have to read through the table at probing time and find the lowest power state there. Rafael also brought up the issue of not entering D3, and the somewhat non-intuitive to me solution for it, so I'm not commenting on that except for asking on a comment on that save_state call. > + if (!pm_suspend_via_firmware()) > + return nvme_set_power(&ndev->ctrl, 0); Don't we need to save the previous power state here and restore that? For example if someone set a specific state through nvme-cli? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 5:30 ` Christoph Hellwig 2019-05-10 5:30 ` Christoph Hellwig @ 2019-05-10 13:51 ` Keith Busch 2019-05-10 13:51 ` Keith Busch 1 sibling, 1 reply; 60+ messages in thread From: Keith Busch @ 2019-05-10 13:51 UTC (permalink / raw) To: Christoph Hellwig Cc: Mario.Limonciello@dell.com, kai.heng.feng@canonical.com, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith On Thu, May 09, 2019 at 10:30:52PM -0700, Christoph Hellwig wrote: > Also I don't see any reason why we'd need to do the freeze game on > resume. Right, definitely no reason for resume. > Even on suspend it looks a little odd to me, as in theory > the PM core should have already put the system into a quiescent state. > But maybe we actually need it there, in which case a comment would > be helpful. I wasn't sure if suspend prevents a kthread or work queue from running. For example, if the device sends a namespace notify AEN during S2I, does the nvme scan_work run? Since I wasn't sure, I took a paranoid approach to ensure nothing was in flight, but I'd be happy if this is unnecessary. > > + if (!pm_suspend_via_firmware()) > > + return nvme_set_power(&ndev->ctrl, 0); > > Don't we need to save the previous power state here and restore that? > For example if someone set a specific state through nvme-cli? Sure, we can do that. It would have been super if the spec had this set feature command's CQE DW0 return the previous power state so we could set the new and save the old in a single command, but two commands is just a minor inconvenience. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-10 13:51 ` Keith Busch @ 2019-05-10 13:51 ` Keith Busch 0 siblings, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-10 13:51 UTC (permalink / raw) To: Christoph Hellwig Cc: Mario.Limonciello@dell.com, kai.heng.feng@canonical.com, axboe@fb.com, sagi@grimberg.me, rafael@kernel.org, linux-pm@vger.kernel.org, Wysocki, Rafael J, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Busch, Keith On Thu, May 09, 2019 at 10:30:52PM -0700, Christoph Hellwig wrote: > Also I don't see any reason why we'd need to do the freeze game on > resume. Right, definitely no reason for resume. > Even on suspend it looks a little odd to me, as in theory > the PM core should have already put the system into a quiescent state. > But maybe we actually need it there, in which case a comment would > be helpful. I wasn't sure if suspend prevents a kthread or work queue from running. For example, if the device sends a namespace notify AEN during S2I, does the nvme scan_work run? Since I wasn't sure, I took a paranoid approach to ensure nothing was in flight, but I'd be happy if this is unnecessary. > > + if (!pm_suspend_via_firmware()) > > + return nvme_set_power(&ndev->ctrl, 0); > > Don't we need to save the previous power state here and restore that? > For example if someone set a specific state through nvme-cli? Sure, we can do that. It would have been super if the spec had this set feature command's CQE DW0 return the previous power state so we could set the new and save the old in a single command, but two commands is just a minor inconvenience. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 10:28 ` Kai-Heng Feng 2019-05-09 10:28 ` Kai-Heng Feng 2019-05-09 10:31 ` Christoph Hellwig @ 2019-05-09 16:20 ` Keith Busch 2019-05-09 16:20 ` Keith Busch 2 siblings, 1 reply; 60+ messages in thread From: Keith Busch @ 2019-05-09 16:20 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, Rafael J. Wysocki, Wysocki, Rafael J, Mario Limonciello, Busch, Keith, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 03:28:32AM -0700, Kai-Heng Feng wrote: > at 17:56, Christoph Hellwig <hch@lst.de> wrote: > > The we have the sequence in your patch. This seems to be related to > > some of the MS wording, but I'm not sure what for example tearing down > > the queues buys us. Can you explain a bit more where those bits > > make a difference? > > Based on my testing if queues (IRQ) are not disabled, NVMe controller won’t > be quiesced. > Symptoms can be high power drain or system freeze. > > I can check with vendors whether this also necessary under Windows. Hm, that doesn't sound right based on the spec's description of how this feature works. We should not need to tear down IO queues for entering low power, nor reset the controller to exit it. The sequence for entering non-operational low power should just be freeze request queues, set the power feature, then unfreeze request queues. We shouldn't have to do anything to exit the state as the spec requires devices autonomously return to operational when an IO doorbell is written. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle 2019-05-09 16:20 ` Keith Busch @ 2019-05-09 16:20 ` Keith Busch 0 siblings, 0 replies; 60+ messages in thread From: Keith Busch @ 2019-05-09 16:20 UTC (permalink / raw) To: Kai-Heng Feng Cc: Christoph Hellwig, Rafael J. Wysocki, Wysocki, Rafael J, Mario Limonciello, Busch, Keith, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM, LKML On Thu, May 09, 2019 at 03:28:32AM -0700, Kai-Heng Feng wrote: > at 17:56, Christoph Hellwig <hch@lst.de> wrote: > > The we have the sequence in your patch. This seems to be related to > > some of the MS wording, but I'm not sure what for example tearing down > > the queues buys us. Can you explain a bit more where those bits > > make a difference? > > Based on my testing if queues (IRQ) are not disabled, NVMe controller won’t > be quiesced. > Symptoms can be high power drain or system freeze. > > I can check with vendors whether this also necessary under Windows. Hm, that doesn't sound right based on the spec's description of how this feature works. We should not need to tear down IO queues for entering low power, nor reset the controller to exit it. The sequence for entering non-operational low power should just be freeze request queues, set the power feature, then unfreeze request queues. We shouldn't have to do anything to exit the state as the spec requires devices autonomously return to operational when an IO doorbell is written. ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2019-05-10 15:49 UTC | newest]
Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20190508185955.11406-1-kai.heng.feng@canonical.com>
[not found] ` <20190508191624.GA8365@localhost.localdomain>
[not found] ` <3CDA9F13-B17C-456F-8CE1-3A63C6E0DC8F@canonical.com>
[not found] ` <f8a043b00909418bad6adcdb62d16e6e@AUSX13MPC105.AMER.DELL.COM>
[not found] ` <20190508195159.GA1530@lst.de>
[not found] ` <b43f2c0078f245398101fa9a40cfc2dc@AUSX13MPC105.AMER.DELL.COM>
[not found] ` <20190509061237.GA15229@lst.de>
2019-05-09 6:48 ` [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng
2019-05-09 6:48 ` Kai-Heng Feng
2019-05-09 6:52 ` Christoph Hellwig
2019-05-09 6:52 ` Christoph Hellwig
2019-05-09 9:19 ` Rafael J. Wysocki
2019-05-09 9:19 ` Rafael J. Wysocki
2019-05-09 9:25 ` Christoph Hellwig
2019-05-09 9:25 ` Christoph Hellwig
2019-05-09 20:48 ` Rafael J. Wysocki
2019-05-09 20:48 ` Rafael J. Wysocki
2019-05-09 9:07 ` Rafael J. Wysocki
2019-05-09 9:07 ` Rafael J. Wysocki
2019-05-09 9:42 ` Kai-Heng Feng
2019-05-09 9:42 ` Kai-Heng Feng
2019-05-09 9:56 ` Christoph Hellwig
2019-05-09 9:56 ` Christoph Hellwig
2019-05-09 10:28 ` Kai-Heng Feng
2019-05-09 10:28 ` Kai-Heng Feng
2019-05-09 10:31 ` Christoph Hellwig
2019-05-09 10:31 ` Christoph Hellwig
2019-05-09 11:59 ` Kai-Heng Feng
2019-05-09 11:59 ` Kai-Heng Feng
2019-05-09 18:57 ` Mario.Limonciello
2019-05-09 18:57 ` Mario.Limonciello
2019-05-09 19:28 ` Keith Busch
2019-05-09 19:28 ` Keith Busch
2019-05-09 20:54 ` Rafael J. Wysocki
2019-05-09 20:54 ` Rafael J. Wysocki
2019-05-09 21:16 ` Keith Busch
2019-05-09 21:16 ` Keith Busch
2019-05-09 21:39 ` Rafael J. Wysocki
2019-05-09 21:39 ` Rafael J. Wysocki
2019-05-09 21:37 ` Mario.Limonciello
2019-05-09 21:37 ` Mario.Limonciello
2019-05-09 21:54 ` Keith Busch
2019-05-09 21:54 ` Keith Busch
2019-05-09 22:19 ` Mario.Limonciello
2019-05-09 22:19 ` Mario.Limonciello
2019-05-10 6:05 ` Kai-Heng Feng
2019-05-10 6:05 ` Kai-Heng Feng
2019-05-10 8:23 ` Rafael J. Wysocki
2019-05-10 8:23 ` Rafael J. Wysocki
2019-05-10 13:52 ` Keith Busch
2019-05-10 13:52 ` Keith Busch
2019-05-10 15:15 ` Kai Heng Feng
2019-05-10 15:15 ` Kai Heng Feng
2019-05-10 15:36 ` Keith Busch
2019-05-10 15:36 ` Keith Busch
2019-05-10 14:02 ` Keith Busch
2019-05-10 14:02 ` Keith Busch
2019-05-10 15:18 ` Kai Heng Feng
2019-05-10 15:18 ` Kai Heng Feng
2019-05-10 15:49 ` hch
2019-05-10 15:49 ` hch
2019-05-10 5:30 ` Christoph Hellwig
2019-05-10 5:30 ` Christoph Hellwig
2019-05-10 13:51 ` Keith Busch
2019-05-10 13:51 ` Keith Busch
2019-05-09 16:20 ` Keith Busch
2019-05-09 16:20 ` Keith Busch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox