* Analysis of EH on Andi's dying disk and stuff to discuss about [not found] <20080328093055.GA16736@basil.nowhere.org> @ 2008-03-29 7:16 ` Tejun Heo 2008-03-29 15:34 ` Ric Wheeler 2008-03-29 20:53 ` Mark Lord 0 siblings, 2 replies; 9+ messages in thread From: Tejun Heo @ 2008-03-29 7:16 UTC (permalink / raw) To: Andi Kleen Cc: Jeff Garzik, Alan Cox, Mark Lord, IDE/ATA development list, ric Hello, all. Andi Kleen wrote: > > I'm attaching them. They are huge, sorry. > > This was over multiple attempts with different kernels. Initially > it failed just on mounting, then later also developed problems > on scanning. I also tried to switch the port around so you see > it moving. There were two identical disk on the box, only > one failed. > > I think it started when I hard powered off the machine at some point, > the result was a large corrupted chunk in the inode table on the > disk (didn't Linus run into a similar problem recently?) Heh.. that disk is completely toasted. Probing itself was okay. Errors occur when someone is trying to access the data on platter - reading the partition, udev trying to determine persistent names. Several things to note. (While writing, the message developed into discussion material, cc'ing relevant people. The log is quite large and can be accessed from http://htj.dyndns.org/export/libata-eh.log). 1. Currently timeout for reads and writes is 30secs which is a bit too long. This long default timeout is one of the reasons why IO errors take so long to get detected and acted upon. I think it should be in the range of 10-15 second. 2. In the first error case in the log, the device goes offline after timing out. When the device keeps its link up but doesn't respond at all, libata takes slightly over 1 minutes before it gives up. Combined with the initial 30sec timeout, this can feel quite long. This timing is determined by ata_eh_timeouts[] table in drivers/ata/libata-eh.c and the current timeout table is the shortest it can get while allowing the theoretical worst case with a bit of margin. There are several factors at play here. ATA resets are allowed to take up to 30 secs. Don't ask me why. That's the spec. This is to allow the device to postpone replying to reset while spinning up, which simply is a bad design. Waiting blindly for 30 + margin seconds for each try doesn't work too well because during hotplug or after PHY events, reset protocol could get a bit unreliable and the response from device can get lost. In addition, some devices might not respond to reset if it's issued before the device indicated readiness (SRST) and some controllers can only wait for the initial readiness notificaiton from the drive after SRST. The combined result is that even when everything is done right there are times when the driver misses reset completion. So, to handle the common cases better, libata EH times out resets quickly. The first two tries are 10 seconds each and most devices get reset properly before it hits the end of the second reset try even if it needs to spin up. What takes the longest is the third try, for which the timeout is 35secs. This is to allow dumb devices which require long silent period after reset is issued and have at least one reset try with the timeout suggested by the spec. I haven't actually seen such device and it could be that we could be paying too much for a problem which doesn't exist. If we can lift the 35 sec reset try, we can give up resetting in slightly over 30 seconds. If we reduce the command timeout, the whole thing from command issue to device disablement could be done in around 50 seconds. 3. Another possible source of delay is command retries after failure. sd currently sets retry count to five so every failed IO command is retried five times. I agree with Mark that there isn't much sense in retrying a command when the drive already told us that it couldn't accomplish it due to media problem. So, retrying commands failed with media error five times is probably not the best action to take. What do you guys think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-29 7:16 ` Analysis of EH on Andi's dying disk and stuff to discuss about Tejun Heo @ 2008-03-29 15:34 ` Ric Wheeler 2008-03-29 20:49 ` Mark Lord 2008-03-29 20:53 ` Mark Lord 1 sibling, 1 reply; 9+ messages in thread From: Ric Wheeler @ 2008-03-29 15:34 UTC (permalink / raw) To: Tejun Heo Cc: Andi Kleen, Jeff Garzik, Alan Cox, Mark Lord, IDE/ATA development list Tejun Heo wrote: > Hello, all. > > Andi Kleen wrote: > > > > I'm attaching them. They are huge, sorry. > > > > This was over multiple attempts with different kernels. Initially > > it failed just on mounting, then later also developed problems > > on scanning. I also tried to switch the port around so you see > > it moving. There were two identical disk on the box, only > > one failed. > > > > I think it started when I hard powered off the machine at some point, > > the result was a large corrupted chunk in the inode table on the > > disk (didn't Linus run into a similar problem recently?) > > Heh.. that disk is completely toasted. Probing itself was okay. > Errors occur when someone is trying to access the data on platter - > reading the partition, udev trying to determine persistent names. > Several things to note. > > (While writing, the message developed into discussion material, cc'ing > relevant people. The log is quite large and can be accessed from > http://htj.dyndns.org/export/libata-eh.log). > > 1. Currently timeout for reads and writes is 30secs which is a bit too > long. This long default timeout is one of the reasons why IO > errors take so long to get detected and acted upon. I think it > should be in the range of 10-15 second. I agree that 10-15 seconds is a more reasonable default timeout. For the extremely unusual case where the device does respond with success after more than 15 seconds, what would it look like to us when we have timed it out? > > 2. In the first error case in the log, the device goes offline after > timing out. When the device keeps its link up but doesn't respond > at all, libata takes slightly over 1 minutes before it gives up. > Combined with the initial 30sec timeout, this can feel quite long. > This timing is determined by ata_eh_timeouts[] table in > drivers/ata/libata-eh.c and the current timeout table is the > shortest it can get while allowing the theoretical worst case with > a bit of margin. There are several factors at play here. > > ATA resets are allowed to take up to 30 secs. Don't ask me why. > That's the spec. This is to allow the device to postpone replying > to reset while spinning up, which simply is a bad design. > > Waiting blindly for 30 + margin seconds for each try doesn't work > too well because during hotplug or after PHY events, reset protocol > could get a bit unreliable and the response from device can get > lost. In addition, some devices might not respond to reset if it's > issued before the device indicated readiness (SRST) and some > controllers can only wait for the initial readiness notificaiton > from the drive after SRST. The combined result is that even when > everything is done right there are times when the driver misses > reset completion. > > So, to handle the common cases better, libata EH times out resets > quickly. The first two tries are 10 seconds each and most devices > get reset properly before it hits the end of the second reset try > even if it needs to spin up. What takes the longest is the third > try, for which the timeout is 35secs. This is to allow dumb > devices which require long silent period after reset is issued and > have at least one reset try with the timeout suggested by the spec. > I haven't actually seen such device and it could be that we could > be paying too much for a problem which doesn't exist. > > If we can lift the 35 sec reset try, we can give up resetting in > slightly over 30 seconds. If we reduce the command timeout, the > whole thing from command issue to device disablement could be done > in around 50 seconds. I think that this is also reasonable. We should try to respond with a failure in that 30 second window when we can. > > 3. Another possible source of delay is command retries after failure. > sd currently sets retry count to five so every failed IO command is > retried five times. I agree with Mark that there isn't much sense > in retrying a command when the drive already told us that it > couldn't accomplish it due to media problem. So, retrying commands > failed with media error five times is probably not the best action > to take. I definitely agree with you and Mark on this - no reason to retry media errors (or some other less popular errors). We run with the retry logic neutered and have not seen an issue with a very large population of S-ATA drives in the field... > > What do you guys think? > > Thanks. > One thought that is related to this is that we could really, really use a target mode S-ATA (or ATA) device. I am pretty sure that some of the Marvell parts support target mode. Their original (non-libata) driver had target mode support coded in as well if I remember correctly. With that base, we could program the target driver to inject errors and give us a much more complete testing of the error injection code. Maybe even really test the debated error during CACHE_FLUSH sequence ;-) It is really, really hard to find flaky drives that are not totally dead which means we are left using common sense and intuition around this kind of thing... ric ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-29 15:34 ` Ric Wheeler @ 2008-03-29 20:49 ` Mark Lord 0 siblings, 0 replies; 9+ messages in thread From: Mark Lord @ 2008-03-29 20:49 UTC (permalink / raw) To: Ric Wheeler Cc: Tejun Heo, Andi Kleen, Jeff Garzik, Alan Cox, IDE/ATA development list Ric Wheeler wrote: > > One thought that is related to this is that we could really, really use > a target mode S-ATA (or ATA) device. I am pretty sure that some of the > Marvell parts support target mode. Their original (non-libata) driver > had target mode support coded in as well if I remember correctly. .. Yeah. It's on my TO-DO list, funded by Marvell. But currently at the *bottom* of that TO-DO list. :) Cheers ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-29 7:16 ` Analysis of EH on Andi's dying disk and stuff to discuss about Tejun Heo 2008-03-29 15:34 ` Ric Wheeler @ 2008-03-29 20:53 ` Mark Lord 2008-03-29 21:12 ` Jeff Garzik 1 sibling, 1 reply; 9+ messages in thread From: Mark Lord @ 2008-03-29 20:53 UTC (permalink / raw) To: Tejun Heo Cc: Andi Kleen, Jeff Garzik, Alan Cox, IDE/ATA development list, ric Tejun Heo wrote: .. > So, to handle the common cases better, libata EH times out resets > quickly. The first two tries are 10 seconds each and most devices > get reset properly before it hits the end of the second reset try > even if it needs to spin up. What takes the longest is the third .. I think that 10 seconds timeout is just *slightly* too short. There are drives here somewhere, that always fail the first attempt because they take about 12 seconds to spin-up and begin communicating. Cheers ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-29 20:53 ` Mark Lord @ 2008-03-29 21:12 ` Jeff Garzik 2008-03-29 23:35 ` Tejun Heo 2008-03-30 7:03 ` Andi Kleen 0 siblings, 2 replies; 9+ messages in thread From: Jeff Garzik @ 2008-03-29 21:12 UTC (permalink / raw) To: Mark Lord; +Cc: Tejun Heo, Andi Kleen, Alan Cox, IDE/ATA development list, ric Mark Lord wrote: > Tejun Heo wrote: > .. > >> So, to handle the common cases better, libata EH times out resets >> quickly. The first two tries are 10 seconds each and most devices >> get reset properly before it hits the end of the second reset try >> even if it needs to spin up. What takes the longest is the third > .. > > I think that 10 seconds timeout is just *slightly* too short. > There are drives here somewhere, that always fail the first attempt > because they take about 12 seconds to spin-up and begin communicating. Also, ATAPI sometimes takes quite a while to respond, I've seen, when media is in the driver. Jeff ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-29 21:12 ` Jeff Garzik @ 2008-03-29 23:35 ` Tejun Heo 2008-03-30 7:03 ` Andi Kleen 1 sibling, 0 replies; 9+ messages in thread From: Tejun Heo @ 2008-03-29 23:35 UTC (permalink / raw) To: Jeff Garzik Cc: Mark Lord, Andi Kleen, Alan Cox, IDE/ATA development list, ric Jeff Garzik wrote: > Mark Lord wrote: >> Tejun Heo wrote: >> .. >> >>> So, to handle the common cases better, libata EH times out resets >>> quickly. The first two tries are 10 seconds each and most devices >>> get reset properly before it hits the end of the second reset try >>> even if it needs to spin up. What takes the longest is the third >> .. >> >> I think that 10 seconds timeout is just *slightly* too short. >> There are drives here somewhere, that always fail the first attempt >> because they take about 12 seconds to spin-up and begin communicating. > > Also, ATAPI sometimes takes quite a while to respond, I've seen, when > media is in the driver. The goal there was to get, say, 90% of devices in the first reset and then the rest of sane ones in the second reset and idiots in the third reset. As long as resets don't interfere with the device preparing for readiness as is the case for harddrive spinning up, this works just fine. If there are devices which have to restart prepping for readiness on each reset, this can be a problem (those fall into the idiot category). I personally have never seen such a device yet but if there's an ATAPI device which doesn't respond to reset till it has spun up the media and recognized it, it could be a problem. I have to say that would be a pretty stupid way to implement reset. Jeff, do you remember which drive it was? -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-29 21:12 ` Jeff Garzik 2008-03-29 23:35 ` Tejun Heo @ 2008-03-30 7:03 ` Andi Kleen 2008-03-30 7:33 ` Jeff Garzik 2008-03-30 11:03 ` Tejun Heo 1 sibling, 2 replies; 9+ messages in thread From: Andi Kleen @ 2008-03-30 7:03 UTC (permalink / raw) To: Jeff Garzik Cc: Mark Lord, Tejun Heo, Andi Kleen, Alan Cox, IDE/ATA development list, ric > Also, ATAPI sometimes takes quite a while to respond, I've seen, when > media is in the driver. Surely ATAPI could get other defaults than disks? -Andi ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-30 7:03 ` Andi Kleen @ 2008-03-30 7:33 ` Jeff Garzik 2008-03-30 11:03 ` Tejun Heo 1 sibling, 0 replies; 9+ messages in thread From: Jeff Garzik @ 2008-03-30 7:33 UTC (permalink / raw) To: Andi Kleen; +Cc: Mark Lord, Tejun Heo, Alan Cox, IDE/ATA development list, ric Andi Kleen wrote: >> Also, ATAPI sometimes takes quite a while to respond, I've seen, when >> media is in the driver. > > Surely ATAPI could get other defaults than disks? Absolutely. I was just posting a reminder, since there have been mistakes in the past where ATA and ATAPI were given the same defaults, only to find out later that was a mistake for ATAPI (since ATA is more often tested, usually). Jeff ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Analysis of EH on Andi's dying disk and stuff to discuss about 2008-03-30 7:03 ` Andi Kleen 2008-03-30 7:33 ` Jeff Garzik @ 2008-03-30 11:03 ` Tejun Heo 1 sibling, 0 replies; 9+ messages in thread From: Tejun Heo @ 2008-03-30 11:03 UTC (permalink / raw) To: Andi Kleen Cc: Jeff Garzik, Mark Lord, Alan Cox, IDE/ATA development list, ric Andi Kleen wrote: >> Also, ATAPI sometimes takes quite a while to respond, I've seen, when >> media is in the driver. > > Surely ATAPI could get other defaults than disks? The driver doesn't know if it's an ATA or ATAPI during probing reset but after detection, yeah, we can use 10s, 10s, 15s timing for ATAs. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-03-30 11:03 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20080328093055.GA16736@basil.nowhere.org>
2008-03-29 7:16 ` Analysis of EH on Andi's dying disk and stuff to discuss about Tejun Heo
2008-03-29 15:34 ` Ric Wheeler
2008-03-29 20:49 ` Mark Lord
2008-03-29 20:53 ` Mark Lord
2008-03-29 21:12 ` Jeff Garzik
2008-03-29 23:35 ` Tejun Heo
2008-03-30 7:03 ` Andi Kleen
2008-03-30 7:33 ` Jeff Garzik
2008-03-30 11:03 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).