* STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend @ 2013-03-11 3:42 Aaron Lu 2013-03-11 8:49 ` James Bottomley 2013-03-11 20:01 ` Peter Dons Tychsen 0 siblings, 2 replies; 22+ messages in thread From: Aaron Lu @ 2013-03-11 3:42 UTC (permalink / raw) To: bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock Cc: linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik Hi all, I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system goes to suspend(this command is sent by scsi sd driver on system suspend as a SCSI STOP command, which is translated to STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so I wrote this email in hope of getting some new idea. The related bug report: https://bugzilla.kernel.org/show_bug.cgi?id=48951 And google search showed that Peter reported a similar problem here: http://marc.info/?l=linux-ide&m=133534061316338&w=2 And bladud has found that, disable asyn suspend for the scsi target device can work around this problem. Please feel free to suggest what can possibly be the cause, thanks. -Aaron ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 3:42 STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend Aaron Lu @ 2013-03-11 8:49 ` James Bottomley 2013-03-11 13:51 ` Aaron Lu 2013-03-11 14:35 ` Robert Hancock 2013-03-11 20:01 ` Peter Dons Tychsen 1 sibling, 2 replies; 22+ messages in thread From: James Bottomley @ 2013-03-11 8:49 UTC (permalink / raw) To: Aaron Lu Cc: bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi, Alan Stern On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: > Hi all, > > I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x > controllers when system goes to suspend(this command is sent by scsi sd > driver on system suspend as a SCSI STOP command, which is translated to > STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so > I wrote this email in hope of getting some new idea. > > The related bug report: > https://bugzilla.kernel.org/show_bug.cgi?id=48951 > > And google search showed that Peter reported a similar problem here: > http://marc.info/?l=linux-ide&m=133534061316338&w=2 > > And bladud has found that, disable asyn suspend for the scsi target > device can work around this problem. > > Please feel free to suggest what can possibly be the cause, thanks. I sometimes despair of people getting PM stuff right. What on earth is the point of refusing to suspend if the disk refuses to stop? In theory it gives the device more time to park its head, but almost no modern drive requires this. The next action suspend will take (if allowed) is to power down peripherals which will forcibly stop the device. The stop request is purely informational for the device. If it ignores it, then the bigger hammer still works. You really need to discriminate better between conditions we should and shouldn't care about for suspend. so in sd_suspend, we definitely care if we can't flush the cache of a write back device because the power off could lose data. We don't really care if the disk says I can't stop, so this is probably the correct fix. James --- diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 7992635..384b621d 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3079,7 +3079,11 @@ static int sd_suspend(struct device *dev) if (sdkp->device->manage_start_stop) { sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n"); - ret = sd_start_stop_device(sdkp, 0); + /* + * this is informational for the disk we're going to power it + * off anyway, so don't bother about the return status + */ + sd_start_stop_device(sdkp, 0); } done: ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 8:49 ` James Bottomley @ 2013-03-11 13:51 ` Aaron Lu 2013-03-11 14:34 ` James Bottomley 2013-03-11 14:35 ` Robert Hancock 1 sibling, 1 reply; 22+ messages in thread From: Aaron Lu @ 2013-03-11 13:51 UTC (permalink / raw) To: James Bottomley Cc: Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi, Alan Stern On 2013-03-11 16:49, James Bottomley wrote: > On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: >> Hi all, >> >> I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x >> controllers when system goes to suspend(this command is sent by scsi sd >> driver on system suspend as a SCSI STOP command, which is translated to >> STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so >> I wrote this email in hope of getting some new idea. >> >> The related bug report: >> https://bugzilla.kernel.org/show_bug.cgi?id=48951 >> >> And google search showed that Peter reported a similar problem here: >> http://marc.info/?l=linux-ide&m=133534061316338&w=2 >> >> And bladud has found that, disable asyn suspend for the scsi target >> device can work around this problem. >> >> Please feel free to suggest what can possibly be the cause, thanks. > > You really need to discriminate better between conditions we should and > shouldn't care about for suspend. so in sd_suspend, we definitely care > if we can't flush the cache of a write back device because the power off > could lose data. We don't really care if the disk says I can't stop, so > this is probably the correct fix. Yes, this will make suspend succeed, but the problem is still there - the drive does not respond to this command, while it will if async suspend for the target device is disabled. I don't think scsi code has any fault here, it feels more like a chip bug to me. And if we solve it this way, affected user may complain about long delay when suspending or uncomfortable with the error messages in dmesg, since that command would eventually timeout. So I hope we can nderstand what is going wrong here and perhaps do something to avoid it. Thanks, Aaron > > --- > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 7992635..384b621d 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -3079,7 +3079,11 @@ static int sd_suspend(struct device *dev) > > if (sdkp->device->manage_start_stop) { > sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n"); > - ret = sd_start_stop_device(sdkp, 0); > + /* > + * this is informational for the disk we're going to power it > + * off anyway, so don't bother about the return status > + */ > + sd_start_stop_device(sdkp, 0); > } > > done: > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 13:51 ` Aaron Lu @ 2013-03-11 14:34 ` James Bottomley 2013-03-11 15:00 ` Alan Stern 0 siblings, 1 reply; 22+ messages in thread From: James Bottomley @ 2013-03-11 14:34 UTC (permalink / raw) To: Aaron Lu Cc: Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi, Alan Stern On Mon, 2013-03-11 at 21:51 +0800, Aaron Lu wrote: > On 2013-03-11 16:49, James Bottomley wrote: > > On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: > >> Hi all, > >> > >> I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x > >> controllers when system goes to suspend(this command is sent by scsi sd > >> driver on system suspend as a SCSI STOP command, which is translated to > >> STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so > >> I wrote this email in hope of getting some new idea. > >> > >> The related bug report: > >> https://bugzilla.kernel.org/show_bug.cgi?id=48951 > >> > >> And google search showed that Peter reported a similar problem here: > >> http://marc.info/?l=linux-ide&m=133534061316338&w=2 > >> > >> And bladud has found that, disable asyn suspend for the scsi target > >> device can work around this problem. > >> > >> Please feel free to suggest what can possibly be the cause, thanks. > > > > You really need to discriminate better between conditions we should and > > shouldn't care about for suspend. so in sd_suspend, we definitely care > > if we can't flush the cache of a write back device because the power off > > could lose data. We don't really care if the disk says I can't stop, so > > this is probably the correct fix. > > Yes, this will make suspend succeed, but the problem is still there - > the drive does not respond to this command, while it will if async > suspend for the target device is disabled. I don't think scsi code has > any fault here, it feels more like a chip bug to me. > > And if we solve it this way, affected user may complain about long > delay when suspending or uncomfortable with the error messages in dmesg, > since that command would eventually timeout. So I hope we can nderstand > what is going wrong here and perhaps do something to avoid it. Oh, that seems to be the suspend order isn't careful enough. __device_suspend() waits for its children, but the host disk are too far separated in the device tree. If the immediate children of the host are all sync, that wait never actually waits for anything. James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 14:34 ` James Bottomley @ 2013-03-11 15:00 ` Alan Stern 2013-03-12 2:53 ` Aaron Lu 0 siblings, 1 reply; 22+ messages in thread From: Alan Stern @ 2013-03-11 15:00 UTC (permalink / raw) To: James Bottomley Cc: Aaron Lu, Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi On Mon, 11 Mar 2013, James Bottomley wrote: > Oh, that seems to be the suspend order isn't careful enough. > __device_suspend() waits for its children, but the host disk are too far > separated in the device tree. If the immediate children of the host are > all sync, that wait never actually waits for anything. I was going to make exactly this same point. During async suspend, the PM core is careful to make sure that no device is suspended before its children. But there aren't any other checks, so if device A isn't an ancestor of device B then it's possible for async suspend to power down A before B. This can cause problems if B needs A to be active while B is suspending. Does the ATA system have any non-ancestor dependencies like this? If it does, the appropriate driver can be fixed to take them into account. Alan Stern ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 15:00 ` Alan Stern @ 2013-03-12 2:53 ` Aaron Lu 2013-03-12 12:10 ` James Bottomley 2013-03-12 15:08 ` Alan Stern 0 siblings, 2 replies; 22+ messages in thread From: Aaron Lu @ 2013-03-12 2:53 UTC (permalink / raw) To: Alan Stern, James Bottomley Cc: Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi Hi James and Alan, On 03/11/2013 11:00 PM, Alan Stern wrote: > On Mon, 11 Mar 2013, James Bottomley wrote: > >> Oh, that seems to be the suspend order isn't careful enough. >> __device_suspend() waits for its children, but the host disk are too far >> separated in the device tree. If the immediate children of the host are >> all sync, that wait never actually waits for anything. > > I was going to make exactly this same point. During async suspend, the > PM core is careful to make sure that no device is suspended before its > children. But there aren't any other checks, so if device A isn't an > ancestor of device B then it's possible for async suspend to power down > A before B. This can cause problems if B needs A to be active while B > is suspending. Thanks for the suggestions. > > Does the ATA system have any non-ancestor dependencies like this? If > it does, the appropriate driver can be fixed to take them into account. I don't think there is, and the relationship is like this: ata_host_controller* (named sata_nv xxx) | ata_port* (named atax, while "ata_port atax" is another device) / \ scsi_host ata_link | | scsi_target ata_device | scsi_device* (named sd x:x:x:x) With the devices that have actual PM operation functions defined have the asterisk next to it. So ata_host_controller waits for all of the ata_ports, and the ata_port waits for both scsi_host and ata_link. scsi_host waits for scsi_target, and scsi_target waits for scsi_device. So if scsi_device is not done, ata_port will not start. Doesn't look like a problem to me. And from the log: https://bugzilla.kernel.org/attachment.cgi?id=95101 It also looks like the order is correct. Thanks, Aaron ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-12 2:53 ` Aaron Lu @ 2013-03-12 12:10 ` James Bottomley 2013-03-12 13:46 ` Aaron Lu 2013-03-12 15:08 ` Alan Stern 1 sibling, 1 reply; 22+ messages in thread From: James Bottomley @ 2013-03-12 12:10 UTC (permalink / raw) To: Aaron Lu Cc: Alan Stern, Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi On Tue, 2013-03-12 at 10:53 +0800, Aaron Lu wrote: > Hi James and Alan, > > On 03/11/2013 11:00 PM, Alan Stern wrote: > > On Mon, 11 Mar 2013, James Bottomley wrote: > > > >> Oh, that seems to be the suspend order isn't careful enough. > >> __device_suspend() waits for its children, but the host disk are too far > >> separated in the device tree. If the immediate children of the host are > >> all sync, that wait never actually waits for anything. > > > > I was going to make exactly this same point. During async suspend, the > > PM core is careful to make sure that no device is suspended before its > > children. But there aren't any other checks, so if device A isn't an > > ancestor of device B then it's possible for async suspend to power down > > A before B. This can cause problems if B needs A to be active while B > > is suspending. > > Thanks for the suggestions. > > > > > Does the ATA system have any non-ancestor dependencies like this? If > > it does, the appropriate driver can be fixed to take them into account. > > I don't think there is, and the relationship is like this: > > ata_host_controller* (named sata_nv xxx) > | > ata_port* (named atax, while "ata_port atax" is another device) > / \ > scsi_host ata_link > | | > scsi_target ata_device > | > scsi_device* (named sd x:x:x:x) > > With the devices that have actual PM operation functions defined have > the asterisk next to it. > > So ata_host_controller waits for all of the ata_ports, and the ata_port > waits for both scsi_host and ata_link. scsi_host waits for scsi_target, > and scsi_target waits for scsi_device. So if scsi_device is not done, > ata_port will not start. Doesn't look like a problem to me. > > And from the log: > https://bugzilla.kernel.org/attachment.cgi?id=95101 > It also looks like the order is correct. That's not what that log appears to say. Here are the relevant bits [ 7377.813634] sd 2:0:0:0: async_suspend: scheduled [ 7377.813636] sd 2:0:0:0: __device_suspend: starts [ 7377.813639] sd 2:0:0:0: [sdb] Synchronizing SCSI cache ... so now we've begun suspend] [ 7377.813750] sd 2:0:0:0: [sdb] Stopping disk [... here we send STANDBY IMMEDIATE ] [ 7378.237627] sata_nv 0000:00:05.2: async_suspend: scheduled [ 7378.237631] sata_nv 0000:00:05.2: __device_suspend: starts [... we begin to shut down the host ] [ 7378.249333] sata_nv 0000:00:05.2: __device_suspend: done [... host shutdown complete ] [ 7408.372642] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 7408.372647] ata3.00: failed command: STANDBY IMMEDIATE [ ... command times out ] [ 7408.870675] dpm_run_callback(): scsi_bus_suspend+0x0/0x20 [scsi_mod] returns 134217730 [ 7408.870681] sd 2:0:0:0: __device_suspend: done We shut down the host controller before the command completed. This appears to cause the timeout James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-12 12:10 ` James Bottomley @ 2013-03-12 13:46 ` Aaron Lu 0 siblings, 0 replies; 22+ messages in thread From: Aaron Lu @ 2013-03-12 13:46 UTC (permalink / raw) To: James Bottomley Cc: Alan Stern, Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi On 03/12/2013 08:10 PM, James Bottomley wrote: > On Tue, 2013-03-12 at 10:53 +0800, Aaron Lu wrote: >> Hi James and Alan, >> >> On 03/11/2013 11:00 PM, Alan Stern wrote: >>> On Mon, 11 Mar 2013, James Bottomley wrote: >>> >>>> Oh, that seems to be the suspend order isn't careful enough. >>>> __device_suspend() waits for its children, but the host disk are too far >>>> separated in the device tree. If the immediate children of the host are >>>> all sync, that wait never actually waits for anything. >>> >>> I was going to make exactly this same point. During async suspend, the >>> PM core is careful to make sure that no device is suspended before its >>> children. But there aren't any other checks, so if device A isn't an >>> ancestor of device B then it's possible for async suspend to power down >>> A before B. This can cause problems if B needs A to be active while B >>> is suspending. >> >> Thanks for the suggestions. >> >>> >>> Does the ATA system have any non-ancestor dependencies like this? If >>> it does, the appropriate driver can be fixed to take them into account. >> >> I don't think there is, and the relationship is like this: >> >> ata_host_controller* (named sata_nv xxx) >> | >> ata_port* (named atax, while "ata_port atax" is another device) >> / \ >> scsi_host ata_link >> | | >> scsi_target ata_device >> | >> scsi_device* (named sd x:x:x:x) >> >> With the devices that have actual PM operation functions defined have >> the asterisk next to it. >> >> So ata_host_controller waits for all of the ata_ports, and the ata_port >> waits for both scsi_host and ata_link. scsi_host waits for scsi_target, >> and scsi_target waits for scsi_device. So if scsi_device is not done, >> ata_port will not start. Doesn't look like a problem to me. >> >> And from the log: >> https://bugzilla.kernel.org/attachment.cgi?id=95101 >> It also looks like the order is correct. > > That's not what that log appears to say. Here are the relevant bits > > [ 7377.813634] sd 2:0:0:0: async_suspend: scheduled > [ 7377.813636] sd 2:0:0:0: __device_suspend: starts > [ 7377.813639] sd 2:0:0:0: [sdb] Synchronizing SCSI cache > ... so now we've begun suspend] > [ 7377.813750] sd 2:0:0:0: [sdb] Stopping disk > [... here we send STANDBY IMMEDIATE ] > [ 7378.237627] sata_nv 0000:00:05.2: async_suspend: scheduled > [ 7378.237631] sata_nv 0000:00:05.2: __device_suspend: starts > [... we begin to shut down the host ] > [ 7378.249333] sata_nv 0000:00:05.2: __device_suspend: done > [... host shutdown complete ] I think sata_nv 0000:00:05.0 is the host controller for sd 2:0:0:0, and sata_nv 0000:00:05.1 is the host controller for sd 4:0:0:0. I've asked bladud@gmail.com to attach the full dmesg, which can make it easier for us to decide which port belongs to which host controller. Note that this system has multiple ata host controllers. Thanks, Aaron > [ 7408.372642] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen > [ 7408.372647] ata3.00: failed command: STANDBY IMMEDIATE > [ ... command times out ] > [ 7408.870675] dpm_run_callback(): scsi_bus_suspend+0x0/0x20 [scsi_mod] returns 134217730 > [ 7408.870681] sd 2:0:0:0: __device_suspend: done > > We shut down the host controller before the command completed. This > appears to cause the timeout > > James > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-12 2:53 ` Aaron Lu 2013-03-12 12:10 ` James Bottomley @ 2013-03-12 15:08 ` Alan Stern 1 sibling, 0 replies; 22+ messages in thread From: Alan Stern @ 2013-03-12 15:08 UTC (permalink / raw) To: Aaron Lu Cc: James Bottomley, Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi On Tue, 12 Mar 2013, Aaron Lu wrote: > So ata_host_controller waits for all of the ata_ports, and the ata_port > waits for both scsi_host and ata_link. scsi_host waits for scsi_target, > and scsi_target waits for scsi_device. So if scsi_device is not done, > ata_port will not start. Doesn't look like a problem to me. > > And from the log: > https://bugzilla.kernel.org/attachment.cgi?id=95101 > It also looks like the order is correct. You could compare the PM debugging logs for an async suspend and a synchronous suspend. There might be an obvious difference in the ordering that would explain the problem. Alan Stern ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 8:49 ` James Bottomley 2013-03-11 13:51 ` Aaron Lu @ 2013-03-11 14:35 ` Robert Hancock 2013-03-11 14:51 ` James Bottomley 1 sibling, 1 reply; 22+ messages in thread From: Robert Hancock @ 2013-03-11 14:35 UTC (permalink / raw) To: James Bottomley Cc: Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi, Alan Stern (resending due to GMail mess-up, sorry) On Mon, Mar 11, 2013 at 2:49 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: >> Hi all, >> >> I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x >> controllers when system goes to suspend(this command is sent by scsi sd >> driver on system suspend as a SCSI STOP command, which is translated to >> STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so >> I wrote this email in hope of getting some new idea. >> >> The related bug report: >> https://bugzilla.kernel.org/show_bug.cgi?id=48951 >> >> And google search showed that Peter reported a similar problem here: >> http://marc.info/?l=linux-ide&m=133534061316338&w=2 >> >> And bladud has found that, disable asyn suspend for the scsi target >> device can work around this problem. >> >> Please feel free to suggest what can possibly be the cause, thanks. > > I sometimes despair of people getting PM stuff right. What on earth is > the point of refusing to suspend if the disk refuses to stop? In theory > it gives the device more time to park its head, but almost no modern > drive requires this. The next action suspend will take (if allowed) is > to power down peripherals which will forcibly stop the device. The stop > request is purely informational for the device. If it ignores it, then > the bigger hammer still works. This really does matter. Especially on laptop hard drives (but on many desktop ones as well), we really want to stop the drive, and therefore unload the heads, before the power is shut off. Drives are often rated for a much lower number of emergency head unloads (caused by power loss) over their lifespan than normal software-commanded ones. So over a long period of time, repeatedly failing to stop the drive before power off will shorten its life. Maybe aborting suspend for this is a bit harsh but it's not something that should be ignored. Just to add to this point, it doesn't just matter for rotational drives, either. A lot of SSDs use the "standby now" command to do some kind of cleanup before power off. Some drives document that a subsequent power up may take longer for the drive to be ready if it wasn't "spun down" prior to the last power off. It's conceivable that in some crappy drives, lack of proper power-off sequence could even cause data loss. The question is, at the point where we are stopping the drive, has anything been done to the ATA controller to suspend it? If so, that sounds like a definite bug somewhere. It has to be something we are doing on the kernel side, otherwise why would this work if async suspend was disabled? > > You really need to discriminate better between conditions we should and > shouldn't care about for suspend. so in sd_suspend, we definitely care > if we can't flush the cache of a write back device because the power off > could lose data. We don't really care if the disk says I can't stop, so > this is probably the correct fix. > > James > > --- > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 7992635..384b621d 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -3079,7 +3079,11 @@ static int sd_suspend(struct device *dev) > > if (sdkp->device->manage_start_stop) { > sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n"); > - ret = sd_start_stop_device(sdkp, 0); > + /* > + * this is informational for the disk we're going to power it > + * off anyway, so don't bother about the return status > + */ > + sd_start_stop_device(sdkp, 0); > } > > done: > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 14:35 ` Robert Hancock @ 2013-03-11 14:51 ` James Bottomley 2013-03-11 19:30 ` Robert Hancock 0 siblings, 1 reply; 22+ messages in thread From: James Bottomley @ 2013-03-11 14:51 UTC (permalink / raw) To: Robert Hancock Cc: Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi, Alan Stern On Mon, 2013-03-11 at 08:35 -0600, Robert Hancock wrote: > (resending due to GMail mess-up, sorry) > > On Mon, Mar 11, 2013 at 2:49 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: > >> Hi all, > >> > >> I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x > >> controllers when system goes to suspend(this command is sent by scsi sd > >> driver on system suspend as a SCSI STOP command, which is translated to > >> STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so > >> I wrote this email in hope of getting some new idea. > >> > >> The related bug report: > >> https://bugzilla.kernel.org/show_bug.cgi?id=48951 > >> > >> And google search showed that Peter reported a similar problem here: > >> http://marc.info/?l=linux-ide&m=133534061316338&w=2 > >> > >> And bladud has found that, disable asyn suspend for the scsi target > >> device can work around this problem. > >> > >> Please feel free to suggest what can possibly be the cause, thanks. > > > > I sometimes despair of people getting PM stuff right. What on earth is > > the point of refusing to suspend if the disk refuses to stop? In theory > > it gives the device more time to park its head, but almost no modern > > drive requires this. The next action suspend will take (if allowed) is > > to power down peripherals which will forcibly stop the device. The stop > > request is purely informational for the device. If it ignores it, then > > the bigger hammer still works. > > This really does matter. Especially on laptop hard drives (but on many > desktop ones as well), we really want to stop the drive, and therefore > unload the heads, before the power is shut off. Drives are often rated > for a much lower number of emergency head unloads (caused by power > loss) over their lifespan than normal software-commanded ones. So over > a long period of time, repeatedly failing to stop the drive before > power off will shorten its life. Maybe aborting suspend for this is a > bit harsh but it's not something that should be ignored. Where do you get this information from? The only smart parameter tracking this is the power off retract count (it may originally have been the emergency head retract count, but it was renamed a while ago), and that happens for any head unload, however caused. I know SCSI devices long ago ceased caring about this, because the in-drive capacitance is sufficient to achieve a head unload before the device completely spins down in forced power off (I admit the really old IDE devices ... the ones that required the OS to do everything did have a nasty habit of crashing their heads onto the surface, but they stopped being manufactured years ago) ... I really don't see how any modern SATA device would fail to do this. > Just to add to this point, it doesn't just matter for rotational > drives, either. A lot of SSDs use the "standby now" command to do some > kind of cleanup before power off. Some drives document that a > subsequent power up may take longer for the drive to be ready if it > wasn't "spun down" prior to the last power off. It's conceivable that > in some crappy drives, lack of proper power-off sequence could even > cause data loss. Well, no, they shouldn't ... not unless the drive violates its data integrity commitment, which is going to cause a whole load of FS corruption. All our Jornalled FS guarantees rely on this data integrity commitment. If they're violated, we have a whole boat load of data safety issues. James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 14:51 ` James Bottomley @ 2013-03-11 19:30 ` Robert Hancock 2013-03-12 16:34 ` Mark Lord 0 siblings, 1 reply; 22+ messages in thread From: Robert Hancock @ 2013-03-11 19:30 UTC (permalink / raw) To: James Bottomley Cc: Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi, Alan Stern On Mon, Mar 11, 2013 at 8:51 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Mon, 2013-03-11 at 08:35 -0600, Robert Hancock wrote: >> (resending due to GMail mess-up, sorry) >> >> On Mon, Mar 11, 2013 at 2:49 AM, James Bottomley >> <James.Bottomley@hansenpartnership.com> wrote: >> > On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: >> >> Hi all, >> >> >> >> I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x >> >> controllers when system goes to suspend(this command is sent by scsi sd >> >> driver on system suspend as a SCSI STOP command, which is translated to >> >> STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so >> >> I wrote this email in hope of getting some new idea. >> >> >> >> The related bug report: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=48951 >> >> >> >> And google search showed that Peter reported a similar problem here: >> >> http://marc.info/?l=linux-ide&m=133534061316338&w=2 >> >> >> >> And bladud has found that, disable asyn suspend for the scsi target >> >> device can work around this problem. >> >> >> >> Please feel free to suggest what can possibly be the cause, thanks. >> > >> > I sometimes despair of people getting PM stuff right. What on earth is >> > the point of refusing to suspend if the disk refuses to stop? In theory >> > it gives the device more time to park its head, but almost no modern >> > drive requires this. The next action suspend will take (if allowed) is >> > to power down peripherals which will forcibly stop the device. The stop >> > request is purely informational for the device. If it ignores it, then >> > the bigger hammer still works. >> >> This really does matter. Especially on laptop hard drives (but on many >> desktop ones as well), we really want to stop the drive, and therefore >> unload the heads, before the power is shut off. Drives are often rated >> for a much lower number of emergency head unloads (caused by power >> loss) over their lifespan than normal software-commanded ones. So over >> a long period of time, repeatedly failing to stop the drive before >> power off will shorten its life. Maybe aborting suspend for this is a >> bit harsh but it's not something that should be ignored. > > Where do you get this information from? The only smart parameter > tracking this is the power off retract count (it may originally have > been the emergency head retract count, but it was renamed a while ago), > and that happens for any head unload, however caused. I know SCSI > devices long ago ceased caring about this, because the in-drive > capacitance is sufficient to achieve a head unload before the device > completely spins down in forced power off (I admit the really old IDE > devices ... the ones that required the OS to do everything did have a > nasty habit of crashing their heads onto the surface, but they stopped > being manufactured years ago) ... I really don't see how any modern SATA > device would fail to do this. They do still unload the heads if you cut power (I seem to recall hearing that some of them actually used the rotational energy in the spinning platters as a power source to do this). But it's not nearly as nice for the drive as the normal commanded unload. You can kind of tell just from the sound of it that it's not really the same at all. Many drives do track these separately in SMART attributes. The one for normal unloads is Load_Cycle_Count, the other is Power-Off_Retract_Count. The drive on the machine I'm on right now, for example, has a Load_Cycle_Count of 88 and Power-Off_Retract_Count of 27. I haven't seen what the actual cycle limits are for these but I'm willing to bet that the power-off retract count limit is a lot lower. > >> Just to add to this point, it doesn't just matter for rotational >> drives, either. A lot of SSDs use the "standby now" command to do some >> kind of cleanup before power off. Some drives document that a >> subsequent power up may take longer for the drive to be ready if it >> wasn't "spun down" prior to the last power off. It's conceivable that >> in some crappy drives, lack of proper power-off sequence could even >> cause data loss. > > Well, no, they shouldn't ... not unless the drive violates its data > integrity commitment, which is going to cause a whole load of FS > corruption. All our Jornalled FS guarantees rely on this data integrity > commitment. If they're violated, we have a whole boat load of data > safety issues. I'm sure there are some SSDs that do violate their data integrity commitments - a while ago some tests were done (don't have a link handy and I don't think they identified the actual vendor/model of the drives in any case) but there were definitely some SSDs that did do nasty things like trash unrelated data if power was lost while writing, etc. So we definitely don't want to risk this sort of thing occurring on a normal power-off or suspend. There's a reference to this in SMART here, for example: http://www.thomas-krenn.com/en/wiki/SMART_attributes_of_Intel_SSDs They refer to what's normally called "Power-off Retract Count" for HDDs as "Unsafe Shutdown Count ": "The raw value reports the cumulative number of unsafe (unclean) shutdown events over the life of the device. An unsafe shutdown occurs whenever the device is powered off without STANDBY IMMEDIATE being the last command." ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 19:30 ` Robert Hancock @ 2013-03-12 16:34 ` Mark Lord 0 siblings, 0 replies; 22+ messages in thread From: Mark Lord @ 2013-03-12 16:34 UTC (permalink / raw) To: Robert Hancock Cc: James Bottomley, Aaron Lu, bladud, Joe Sapp, Alberto Mattea, Peter Dons Tychsen, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik, linux-scsi, Alan Stern On 13-03-11 03:30 PM, Robert Hancock wrote: .. > I'm sure there are some SSDs that do violate their data integrity > commitments - a while ago some tests were done (don't have a link > handy and I don't think they identified the actual vendor/model of the > drives in any case) but there were definitely some SSDs that did do > nasty things like trash unrelated data if power was lost while > writing, etc. So we definitely don't want to risk this sort of thing > occurring on a normal power-off or suspend. Just about all current SATA SSDs advertise having some form(s) of "background garbage collection". That "feature" has always concerned me, because of the real possibility of power being removed (system shutdown) while the drive firmware is in the midst of re-shuffling my data around. One would hope that "STANDBY IMMEDIATE" is taken by the drive as a signal to stop that fussing about and prepare for full/sudden power-off. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 3:42 STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend Aaron Lu 2013-03-11 8:49 ` James Bottomley @ 2013-03-11 20:01 ` Peter Dons Tychsen 2013-03-12 2:34 ` Aaron Lu 1 sibling, 1 reply; 22+ messages in thread From: Peter Dons Tychsen @ 2013-03-11 20:01 UTC (permalink / raw) To: Aaron Lu Cc: bladud, Joe Sapp, Alberto Mattea, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik Hello, Sorry that i have not had time to investigate further. However, it would seem that the logs provided by bladud shows (as i see it) the problem (the logs Aaron requested). https://bugzilla.kernel.org/show_bug.cgi?id=48951 1) When running without async suspend, the disks are suspended *after* the sata_nv module. That makes sense, as we cannot expect the controller to handle commands after it has been suspended. 2) when running with async suspended, the disks are suspended *before* the sata_nv module. This does not make much sense. This would also explain certain strangeness (different outcome) in the bug, as effectively the disks are racing against their own controller! Hmmmm how is this supposed to work? Even though async suspend is enabled, the kernel must still have some dependencies that describe in which order things can be suspended. What is the correct way of telling the kernel (PM) that sata_nv cannot be suspended before all child disks are suspended? How do other drivers handle this? Any comments? /pedro On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: > Hi all, > > I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x > controllers when system goes to suspend(this command is sent by scsi sd > driver on system suspend as a SCSI STOP command, which is translated to > STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so > I wrote this email in hope of getting some new idea. > > The related bug report: > https://bugzilla.kernel.org/show_bug.cgi?id=48951 > > And google search showed that Peter reported a similar problem here: > http://marc.info/?l=linux-ide&m=133534061316338&w=2 > > And bladud has found that, disable asyn suspend for the scsi target > device can work around this problem. > > Please feel free to suggest what can possibly be the cause, thanks. > > -Aaron > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-11 20:01 ` Peter Dons Tychsen @ 2013-03-12 2:34 ` Aaron Lu 2013-03-12 23:21 ` Peter Dons Tychsen 0 siblings, 1 reply; 22+ messages in thread From: Aaron Lu @ 2013-03-12 2:34 UTC (permalink / raw) To: Peter Dons Tychsen Cc: bladud, Joe Sapp, Alberto Mattea, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik Hi Peter, On 03/12/2013 04:01 AM, Peter Dons Tychsen wrote: > Hello, > > Sorry that i have not had time to investigate further. Never mind, I just want to show the problem to all affected people and think perhaps we can get some ideas :-) > > However, it would seem that the logs provided by bladud shows (as i see > it) the problem (the logs Aaron requested). > > https://bugzilla.kernel.org/show_bug.cgi?id=48951 > > 1) When running without async suspend, the disks are suspended *after* > the sata_nv module. That makes sense, as we cannot expect the controller > to handle commands after it has been suspended. > > 2) when running with async suspended, the disks are suspended *before* > the sata_nv module. This does not make much sense. This would also > explain certain strangeness (different outcome) in the bug, as > effectively the disks are racing against their own controller! I don't see this is the case. The device that is responsible for PM operations for the disk is named something like sd X:0:0:0, where X is the host number, and is normally attached under the ata port number+1. So sd 2:0:0:0 is attached to ata port 3(if there are USB disks, things may change). And the ata port device is reponsible for ata port level PM operations like power off the port etc., and its device is named as atax, where x is its port number. So you can see from the log, only after sd 4:0:0:0 and sd 2:0:0:0 failed in their ->suspend function, ata5 and ata3's suspend functions start to run, and then sata_nv 0000:00:05.1 starts to run. And Bladud, Please attach the *full* dmesg into the bugzilla page, not just the PM part, that would make it easier for us to decide which port belongs to which controller, thanks. > > Hmmmm how is this supposed to work? Even though async suspend is > enabled, the kernel must still have some dependencies that describe in > which order things can be suspended. What is the correct way of telling > the kernel (PM) that sata_nv cannot be suspended before all child disks > are suspended? How do other drivers handle this? PM core ensure that when a device's suspend function is to run, all its children devices' suspend callback functions will have to be completed, or it will wait. And here is a simplified device relationship: ata_host_controller* (named sata_nv xxx) | ata_port* (named atax, while "ata_port atax" is another device) / \ scsi_host ata_link | | scsi_target ata_device | scsi_device* (named sd x:x:x:x) With the devices that have actual PM operation functions defined have the asterisk next to it. Thanks, Aaron > > Any comments? > > /pedro > > > On Mon, 2013-03-11 at 11:42 +0800, Aaron Lu wrote: >> Hi all, >> >> I've seen some reports on STANDBY IMMEDIATE failed on NVIDIA MCP5x >> controllers when system goes to suspend(this command is sent by scsi sd >> driver on system suspend as a SCSI STOP command, which is translated to >> STANDBY IMMEDIATE ATA command). I've no idea of why this happened, so >> I wrote this email in hope of getting some new idea. >> >> The related bug report: >> https://bugzilla.kernel.org/show_bug.cgi?id=48951 >> >> And google search showed that Peter reported a similar problem here: >> http://marc.info/?l=linux-ide&m=133534061316338&w=2 >> >> And bladud has found that, disable asyn suspend for the scsi target >> device can work around this problem. >> >> Please feel free to suggest what can possibly be the cause, thanks. >> >> -Aaron >> > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-12 2:34 ` Aaron Lu @ 2013-03-12 23:21 ` Peter Dons Tychsen 2013-03-13 2:10 ` Robert Hancock 0 siblings, 1 reply; 22+ messages in thread From: Peter Dons Tychsen @ 2013-03-12 23:21 UTC (permalink / raw) To: Aaron Lu Cc: bladud, Joe Sapp, Alberto Mattea, Robert Hancock, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik Hello again, On Tue, 2013-03-12 at 10:34 +0800, Aaron Lu wrote: > The device that is responsible for PM operations for the disk is named > something like sd X:0:0:0, where X is the host number, and is normally > attached under the ata port number+1. So sd 2:0:0:0 is attached to ata > port 3(if there are USB disks, things may change). And the ata port > device is reponsible for ata port level PM operations like power off > the > port etc., and its device is named as atax, where x is its port > number. > So you can see from the log, only after sd 4:0:0:0 and sd 2:0:0:0 > failed > in their ->suspend function, ata5 and ata3's suspend functions start > to > run, and then sata_nv 0000:00:05.1 starts to run. Yes, but 1 thing completes before disks are suspended: [ 7378.237631] sata_nv 0000:00:05.2: __device_suspend: starts [ 7378.249333] sata_nv 0000:00:05.2: __device_suspend: done And then sd 2:0:0:0 starts to suspend. This might be correct seen from a dependency point of view. But what if the HW designer was a bit lazy (or saved a bit of logic), and performs some sort of shutdown when the first function (0000:00:05.2) is suspended? That would require all disks attached to be dependent on all functions. If it is the case (which would not surprise me), is it possible to describe such dependencies in the PM? Thanks, /pedro ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-12 23:21 ` Peter Dons Tychsen @ 2013-03-13 2:10 ` Robert Hancock 2013-03-13 2:36 ` Aaron Lu 0 siblings, 1 reply; 22+ messages in thread From: Robert Hancock @ 2013-03-13 2:10 UTC (permalink / raw) To: donpedro Cc: Aaron Lu, bladud, Joe Sapp, Alberto Mattea, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik On Tue, Mar 12, 2013 at 5:21 PM, Peter Dons Tychsen <donpedro@tdcadsl.dk> wrote: > Hello again, > > On Tue, 2013-03-12 at 10:34 +0800, Aaron Lu wrote: >> The device that is responsible for PM operations for the disk is named >> something like sd X:0:0:0, where X is the host number, and is normally >> attached under the ata port number+1. So sd 2:0:0:0 is attached to ata >> port 3(if there are USB disks, things may change). And the ata port >> device is reponsible for ata port level PM operations like power off >> the >> port etc., and its device is named as atax, where x is its port >> number. >> So you can see from the log, only after sd 4:0:0:0 and sd 2:0:0:0 >> failed >> in their ->suspend function, ata5 and ata3's suspend functions start >> to >> run, and then sata_nv 0000:00:05.1 starts to run. > > Yes, but 1 thing completes before disks are suspended: > > [ 7378.237631] sata_nv 0000:00:05.2: __device_suspend: starts > [ 7378.249333] sata_nv 0000:00:05.2: __device_suspend: done > > And then sd 2:0:0:0 starts to suspend. > > This might be correct seen from a dependency point of view. But what if > the HW designer was a bit lazy (or saved a bit of logic), and performs > some sort of shutdown when the first function (0000:00:05.2) is > suspended? That would require all disks attached to be dependent on all > functions. I am not sure that inter-function dependencies are the issue here (although given these NVIDIA controllers, anything is possible). Looking at the dmesg output posted on the Bugzilla report, it looks like although bladud's machine has multiple controller instances, Joe Sapp's only has one, so that can't be the problem there. I can't tell about bladud's machine but Joe Sapp's is using SWNCQ mode. Looking at the nv_swncq_port_suspend and nv_swncq_port_resume functions in sata_nv, they seem suspicious. They seem to act more globally than the port they are called for, doing things like disabling interrupts and SWNCQ mode for all ports on that host. It appears that if any of the ports on a host was suspended, all of them would be disabled. It looks like it would be fixable, though since there's no proper public docs on this chipset it would be done kind of blindly. > > If it is the case (which would not surprise me), is it possible to > describe such dependencies in the PM? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-13 2:10 ` Robert Hancock @ 2013-03-13 2:36 ` Aaron Lu 2013-03-13 4:50 ` Simeon Bird 0 siblings, 1 reply; 22+ messages in thread From: Aaron Lu @ 2013-03-13 2:36 UTC (permalink / raw) To: Robert Hancock, Kuan Luo, Peer Chen Cc: donpedro, bladud, Joe Sapp, Alberto Mattea, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik On 03/13/2013 10:10 AM, Robert Hancock wrote: > On Tue, Mar 12, 2013 at 5:21 PM, Peter Dons Tychsen <donpedro@tdcadsl.dk> wrote: >> Hello again, >> >> On Tue, 2013-03-12 at 10:34 +0800, Aaron Lu wrote: >>> The device that is responsible for PM operations for the disk is named >>> something like sd X:0:0:0, where X is the host number, and is normally >>> attached under the ata port number+1. So sd 2:0:0:0 is attached to ata >>> port 3(if there are USB disks, things may change). And the ata port >>> device is reponsible for ata port level PM operations like power off >>> the >>> port etc., and its device is named as atax, where x is its port >>> number. >>> So you can see from the log, only after sd 4:0:0:0 and sd 2:0:0:0 >>> failed >>> in their ->suspend function, ata5 and ata3's suspend functions start >>> to >>> run, and then sata_nv 0000:00:05.1 starts to run. >> >> Yes, but 1 thing completes before disks are suspended: >> >> [ 7378.237631] sata_nv 0000:00:05.2: __device_suspend: starts >> [ 7378.249333] sata_nv 0000:00:05.2: __device_suspend: done >> >> And then sd 2:0:0:0 starts to suspend. >> >> This might be correct seen from a dependency point of view. But what if >> the HW designer was a bit lazy (or saved a bit of logic), and performs >> some sort of shutdown when the first function (0000:00:05.2) is >> suspended? That would require all disks attached to be dependent on all >> functions. > > I am not sure that inter-function dependencies are the issue here > (although given these NVIDIA controllers, anything is possible). > Looking at the dmesg output posted on the Bugzilla report, it looks > like although bladud's machine has multiple controller instances, Joe > Sapp's only has one, so that can't be the problem there. Agree. > > I can't tell about bladud's machine but Joe Sapp's is using SWNCQ Well, we definitely need bladud to give us the *full* dmesg :-) > mode. Looking at the nv_swncq_port_suspend and nv_swncq_port_resume > functions in sata_nv, they seem suspicious. They seem to act more > globally than the port they are called for, doing things like > disabling interrupts and SWNCQ mode for all ports on that host. It > appears that if any of the ports on a host was suspended, all of them > would be disabled. Glad to know this, thanks. I would like Joe to run the debug patch too, see if ata2 is suspended first, that may cause ata port 1, where the only disk is attached, stops working. > > It looks like it would be fixable, though since there's no proper > public docs on this chipset it would be done kind of blindly. +KuanLuo@nvidia and PeerChen@nvidia and pray their email addresses are still valid. Hi Kuan and Peer, We have seen bug report on suspend problem for NVIDIA MCP 5x controller here: https://bugzilla.kernel.org/show_bug.cgi?id=48951 And it looks like to us, if swncq is in use, the port suspend function will disable irq for all ports on this host. Is this the case? And if yes, can you please fix it? Thanks. Thanks, Aaron ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-13 2:36 ` Aaron Lu @ 2013-03-13 4:50 ` Simeon Bird 2013-03-13 5:07 ` Aaron Lu 0 siblings, 1 reply; 22+ messages in thread From: Simeon Bird @ 2013-03-13 4:50 UTC (permalink / raw) To: Aaron Lu Cc: Robert Hancock, Kuan Luo, Peer Chen, donpedro, Joe Sapp, Alberto Mattea, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik >> I can't tell about bladud's machine but Joe Sapp's is using SWNCQ > > Well, we definitely need bladud to give us the *full* dmesg :-) Ok, sorry. Full dmesg with debug info and successful suspend attached to bug report. Simeon ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-13 4:50 ` Simeon Bird @ 2013-03-13 5:07 ` Aaron Lu 2013-03-13 5:16 ` Simeon Bird 0 siblings, 1 reply; 22+ messages in thread From: Aaron Lu @ 2013-03-13 5:07 UTC (permalink / raw) To: Simeon Bird Cc: Robert Hancock, Kuan Luo, Peer Chen, donpedro, Joe Sapp, Alberto Mattea, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik On 03/13/2013 12:50 PM, Simeon Bird wrote: >>> I can't tell about bladud's machine but Joe Sapp's is using SWNCQ >> >> Well, we definitely need bladud to give us the *full* dmesg :-) > > Ok, sorry. Full dmesg with debug info and successful suspend attached > to bug report. Thanks. And also full dmesg for the failed case please, thanks. -Aaron > > Simeon > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-13 5:07 ` Aaron Lu @ 2013-03-13 5:16 ` Simeon Bird 2013-03-13 5:41 ` Aaron Lu 0 siblings, 1 reply; 22+ messages in thread From: Simeon Bird @ 2013-03-13 5:16 UTC (permalink / raw) To: Aaron Lu Cc: Robert Hancock, Kuan Luo, Peer Chen, donpedro, Joe Sapp, Alberto Mattea, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik > And also full dmesg for the failed case please, thanks. Done (but it is the same boot, so includes the successful case as a subset) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend 2013-03-13 5:16 ` Simeon Bird @ 2013-03-13 5:41 ` Aaron Lu 0 siblings, 0 replies; 22+ messages in thread From: Aaron Lu @ 2013-03-13 5:41 UTC (permalink / raw) To: Simeon Bird Cc: Robert Hancock, Kuan Luo, Peer Chen, donpedro, Joe Sapp, Alberto Mattea, linux-ide@vger.kernel.org, Tejun Heo, Jeff Garzik On 03/13/2013 01:16 PM, Simeon Bird wrote: >> And also full dmesg for the failed case please, thanks. > > Done (but it is the same boot, so includes the successful case as a subset) > OK, thanks. So here is the relationship: 0000:00:05.0 0000:00:05.1 0000:00:05.2 ata1 ata2 ata3 ata4 ata5 ata6 sd 0:0:0:0 sd 1:0:0:0 sd 2:0:0:0 sd 3:0:0:0 no disks attched and ata3's suspend callback is done before sd 3:0:0:0's suspend callback is finished, and as Robert analyzed, the port suspend callback masked all isr for the controller, which makes the STANDBY IMMEDIATE command for sd 3:0:0:0 timedout. The same is true for sd 0:0:0:0, where ata2 is done before its suspend callback. So I think Robert's analysis is right, there is a bug in nv_swncq_port_suspend. Thanks, Aaron ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2013-03-13 5:39 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-11 3:42 STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend Aaron Lu 2013-03-11 8:49 ` James Bottomley 2013-03-11 13:51 ` Aaron Lu 2013-03-11 14:34 ` James Bottomley 2013-03-11 15:00 ` Alan Stern 2013-03-12 2:53 ` Aaron Lu 2013-03-12 12:10 ` James Bottomley 2013-03-12 13:46 ` Aaron Lu 2013-03-12 15:08 ` Alan Stern 2013-03-11 14:35 ` Robert Hancock 2013-03-11 14:51 ` James Bottomley 2013-03-11 19:30 ` Robert Hancock 2013-03-12 16:34 ` Mark Lord 2013-03-11 20:01 ` Peter Dons Tychsen 2013-03-12 2:34 ` Aaron Lu 2013-03-12 23:21 ` Peter Dons Tychsen 2013-03-13 2:10 ` Robert Hancock 2013-03-13 2:36 ` Aaron Lu 2013-03-13 4:50 ` Simeon Bird 2013-03-13 5:07 ` Aaron Lu 2013-03-13 5:16 ` Simeon Bird 2013-03-13 5:41 ` Aaron Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).