sata media errors not propagated? (was Re: sata

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sata media errors not propagated? (was Re: sata_via issue)
@ 2004-08-17 11:17 Kevin Shanahan
  2004-08-17 12:38 ` Brad Campbell
  2004-08-17 16:40 ` Brad Campbell
  0 siblings, 2 replies; 4+ messages in thread
From: Kevin Shanahan @ 2004-08-17 11:17 UTC (permalink / raw)
  To: linux-ide

Hi,

This is reply to this message sent just before I subscribed:
  http://marc.theaimsgroup.com/?l=linux-ide&m=109273319729824&w=2

I'm pretty sure this is a general sata issue as I'm in the same
situation where I'm trying to recover some data from a dying sata disk
using dd_rescue. I've tried on both promise and via based controllers
and the behaviour is the same on both.

strace ends like this:

pread(4, "\355\201\0\0tv\0\0$\232\21A4\232\21Ab^\20A\0\0\0\0\0\0"..., 16384, 7389184) = 16384
pwrite(5, "\355\201\0\0tv\0\0$\232\21A4\232\21Ab^\20A\0\0\0\0\0\0"..., 16384, 7389184) = 16384
pread(4, "\377C\0\0\0\20\0\0D\224\35A\204|\36A\204|\36A\0\0\0\0\0"..., 16384, 7405568) = 16384
pwrite(5, "\377C\0\0\0\20\0\0D\224\35A\204|\36A\204|\36A\0\0\0\0\0"..., 16384, 7405568) = 16384
pread(4, "\240\201\0\0\307\1\0\0\\gS@\372A\25A\323\5\\@\0\0\0\0\4"..., 16384, 7421952) = 16384
pwrite(5, "\240\201\0\0\307\1\0\0\\gS@\372A\25A\323\5\\@\0\0\0\0\4"..., 16384, 7421952) = 16384
pread(4, 

dmesg shows several (10) errors like this:

Buffer I/O error on device sdb2, logical block 7281
ata2: command 0x25 timeout, stat 0x59 host_stat 0x1
scsi1: ERROR on channel 0, id 0, lun 0, CDB: 0x28 00 00 03 29 f0 00 00 ec 00 
Current sdb: sense = 70  3
ASC=11 ASCQ= 4
Raw sense data:0x70 0x00 0x03 0x00 0x00 0x00 0x00 0x06 0x00 0x00 0x00 0x00 0x11 0x04 
end_request: I/O error, dev sdb, sector 207344

Then always ends with the following:

ATA: abnormal status 0x59 on port 0xB807
ATA: abnormal status 0x59 on port 0xB807
ATA: abnormal status 0x59 on port 0xB807

ps shows the process is in 'D' state and is unkillable:

milamber:~# ps x | grep dd_rescue
 1773 pts/1    S+     0:00 strace -o ddr-strace.txt dd_rescue -l ddr-errors.txt -A /dev/sdb2 /sda1
 1774 pts/1    D+     0:00 dd_rescue -l ddr-errors.txt -A /dev/sdb2 /sda1
 3101 pts/2    S+     0:00 grep dd_rescue

Perhaps this is already a known limitation of the SATA drivers (still
developing features I guess), but if I can help get this working I'm
willing to test patches or provide more info if needed. In terms of just
getting my data back, I guess another option is to buy a PATA->SATA
adapter and see if the "ide" drivers fare any better.

Thanks,
Kevin.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: sata media errors not propagated? (was Re: sata_via issue)
  2004-08-17 11:17 sata media errors not propagated? (was Re: sata_via issue) Kevin Shanahan
@ 2004-08-17 12:38 ` Brad Campbell
  2004-08-17 16:40 ` Brad Campbell
  1 sibling, 0 replies; 4+ messages in thread
From: Brad Campbell @ 2004-08-17 12:38 UTC (permalink / raw)
  To: linux-ide

Kevin Shanahan wrote:

> Perhaps this is already a known limitation of the SATA drivers (still
> developing features I guess), but if I can help get this working I'm
> willing to test patches or provide more info if needed. In terms of just
> getting my data back, I guess another option is to buy a PATA->SATA
> adapter and see if the "ide" drivers fare any better.

Well I'm in the opposite situation. I'm using an ATA drive via an SATA->PATA adaptor and the "ide" 
drivers do just fine. They just flag the error and keep on keeping on.
In fact, a HPT 1540 running with the kernel Highpoint drivers will behave in the same fashion.

I'm also willing to test or perhaps donate one of my toasted drives for testing.

Regards,
Brad

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: sata media errors not propagated? (was Re: sata_via issue)
  2004-08-17 11:17 sata media errors not propagated? (was Re: sata_via issue) Kevin Shanahan
  2004-08-17 12:38 ` Brad Campbell
@ 2004-08-17 16:40 ` Brad Campbell
  2004-08-17 18:57   ` Brad Campbell
  1 sibling, 1 reply; 4+ messages in thread
From: Brad Campbell @ 2004-08-17 16:40 UTC (permalink / raw)
  To: linux-ide

Kevin Shanahan wrote:

I have done about 15 re-compiles now with increasing numbers of printk's added and have got this far.

scsi -> libata to ata_queued_cmd

> 
> Buffer I/O error on device sdb2, logical block 7281
> ata2: command 0x25 timeout, stat 0x59 host_stat 0x1
> scsi1: ERROR on channel 0, id 0, lun 0, CDB: 0x28 00 00 03 29 f0 00 00 ec 00 
> Current sdb: sense = 70  3
> ASC=11 ASCQ= 4
> Raw sense data:0x70 0x00 0x03 0x00 0x00 0x00 0x00 0x06 0x00 0x00 0x00 0x00 0x11 0x04 
> end_request: I/O error, dev sdb, sector 207344

Here the libata layer passes the error back to scsi and everyone is happy. (uses ata_scsi_qc_complete)

> 
> Then always ends with the following:

About here, scsi passes a command down to libata, ata_queued_cmd which causes the next 3 messages.
libata never errors out here and completes the request so it all falls in a heap.
I'm going out to dinner, I'll see if I can trace it further later tonight.

> ATA: abnormal status 0x59 on port 0xB807
> ATA: abnormal status 0x59 on port 0xB807
> ATA: abnormal status 0x59 on port 0xB807

Not much help, but closing in.

Regards,
Brad

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: sata media errors not propagated? (was Re: sata_via issue)
  2004-08-17 16:40 ` Brad Campbell
@ 2004-08-17 18:57   ` Brad Campbell
  0 siblings, 0 replies; 4+ messages in thread
From: Brad Campbell @ 2004-08-17 18:57 UTC (permalink / raw)
  To: linux-ide

Brad Campbell wrote:
> About here, scsi passes a command down to libata, ata_queued_cmd which 
> causes the next 3 messages.
> libata never errors out here and completes the request so it all falls 
> in a heap.
> I'm going out to dinner, I'll see if I can trace it further later tonight.
> 
>> ATA: abnormal status 0x59 on port 0xB807
>> ATA: abnormal status 0x59 on port 0xB807
>> ATA: abnormal status 0x59 on port 0xB807

OOOooooo it's timing sensitive.
I sprinkled some printk's around and now scsi_request_fn is stuck in an infinite loop as this 
evaluates as true. In case you are wondering where all the 12345 and the like come from, I'm 
including my scsi_request_fn with it's added printk's

                 if (!scsi_host_queue_ready(q, shost, sdev))
                         goto not_ready;

Aug 17 22:32:54 srv kernel: 123456789ABata_queued_cmd
Aug 17 22:32:54 srv kernel: ata_qc_new_init
Aug 17 22:32:54 srv kernel: ata_qc_new
Aug 17 22:32:54 srv kernel: ata_qc_new out
Aug 17 22:32:54 srv kernel: ata_qc_new_init out
Aug 17 22:32:54 srv kernel:  ata_qc_issue
Aug 17 22:32:54 srv kernel: ata_qc_issue_out
Aug 17 22:32:54 srv kernel: ata_qc_issue_prot
Aug 17 22:32:54 srv kernel: ata_qc_issue_prot out
Aug 17 22:32:54 srv kernel: CD3EHI
Aug 17 22:33:24 srv kernel: ata13: command 0x25 timeout, stat 0xd0 host_stat 0x1
Aug 17 22:33:24 srv kernel: ata_qc_complete
Aug 17 22:33:24 srv kernel: ata_scsi_qc_complete
Aug 17 22:33:24 srv kernel: scsi12: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 00 00 00 
00 00 80 00
Aug 17 22:33:24 srv kernel: Current sda: sense key Medium Error
Aug 17 22:33:24 srv kernel: Additional sense: Unrecovered read error - auto reallocate failed
Aug 17 22:33:24 srv kernel: end_request: I/O error, dev sda, sector 0
Aug 17 22:33:24 srv kernel: Buffer I/O error on device sda, logical block 0
Aug 17 22:33:24 srv kernel: 123456FGHI
Aug 17 22:33:24 srv kernel: 123456FGHI
Aug 17 22:33:24 srv kernel: ata_qc_complete_out
Aug 17 22:33:24 srv kernel: 123456FGHI
Aug 17 22:33:24 srv kernel: 123456FGHI
This last line loops forever and prevent me from doing anything at all on the console.
Notice no abnormal status messages anymore.

Progress! Without the extra printk's in libata-core and libata-scsi this was how it went

Aug 17 22:01:10 srv kernel: 123456789ABata_queued_cmd
Aug 17 22:01:10 srv kernel: CD3EHI
Aug 17 22:01:40 srv kernel: ata13: command 0x25 timeout, stat 0xd0 host_stat 0x1
Aug 17 22:01:40 srv kernel: ata_scsi_qc_complete
Aug 17 22:01:40 srv kernel: scsi12: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 00 00 80 
00 00 80 00
Aug 17 22:01:40 srv kernel: Current sda: sense key Medium Error
Aug 17 22:01:41 srv kernel: Additional sense: Unrecovered read error - auto reallocate failed
Aug 17 22:01:41 srv kernel: end_request: I/O error, dev sda, sector 128
Aug 17 22:01:41 srv kernel: Buffer I/O error on device sda, logical block 128
Aug 17 22:01:41 srv kernel: 123456FGHI
Aug 17 22:01:41 srv kernel: 123456FGHI
Aug 17 22:01:41 srv kernel: 123456789ABata_queued_cmd
Aug 17 22:01:41 srv kernel: ATA: abnormal status 0xD0 on port 0x8807
Aug 17 22:01:41 srv last message repeated 2 times
Aug 17 22:01:41 srv kernel: CD3EHI
This just sits here idle with dd in the D state.

dd's call trace for the above log.

Aug 17 20:00:52 srv kernel: dd            D DFCA4C88     0  1681   1585                     (NOTLB)
Aug 17 20:00:52 srv kernel: dce31dfc 00000086 dce31e34 dfca4c88 c01048b4 78601dca 00000034 dfaa1790
Aug 17 20:00:52 srv kernel:        000d6653 78603c1f 00000034 0000031d 7860af72 00000034 dcccea68 
c130cd60
Aug 17 20:00:52 srv kernel:        c1400588 dce31e34 dce31e04 c02fd98e dfca4c88 c01310f5 c130cd60 
00000000
Aug 17 20:00:52 srv kernel: Call Trace:
Aug 17 20:00:52 srv kernel:  [<c01048b4>] common_interrupt+0x18/0x20
Aug 17 20:00:52 srv kernel:  [<c02fd98e>] io_schedule+0xe/0x20
Aug 17 20:00:52 srv kernel:  [<c01310f5>] __lock_page+0xc5/0xf0
Aug 17 20:00:52 srv kernel:  [<c0130e40>] page_wake_function+0x0/0x50
Aug 17 20:00:52 srv kernel:  [<c0130e40>] page_wake_function+0x0/0x50
Aug 17 20:00:52 srv kernel:  [<c01318da>] do_generic_mapping_read+0x3ca/0x480
Aug 17 20:00:52 srv kernel:  [<c0131c47>] __generic_file_aio_read+0x1d7/0x210
Aug 17 20:00:52 srv kernel:  [<c0131990>] file_read_actor+0x0/0xe0
Aug 17 20:00:52 srv kernel:  [<c0131d5f>] generic_file_read+0x6f/0x90
Aug 17 20:00:52 srv kernel:  [<c01154eb>] scheduler_tick+0x1b/0x400
Aug 17 20:00:52 srv kernel:  [<c012060f>] do_timer+0xcf/0xe0
Aug 17 20:00:52 srv kernel:  [<c014c5b0>] vfs_read+0xb0/0x100
Aug 17 20:00:52 srv kernel:  [<c014c7e7>] sys_read+0x47/0x80
Aug 17 20:00:52 srv kernel:  [<c0103f47>] syscall_call+0x7/0xb



static void scsi_request_fn(struct request_queue *q)
{
         struct scsi_device *sdev = q->queuedata;
         struct Scsi_Host *shost = sdev->host;
         struct scsi_cmnd *cmd;
         struct request *req;

         printk("1");
         if(!get_device(&sdev->sdev_gendev))
                 /* We must be tearing the block queue down already */
                 return;

         /*
          * To start with, we keep looping until the queue is empty, or until
          * the host is no longer able to accept any more requests.
          */
         printk("2");
         while (!blk_queue_plugged(q)) {
                 int rtn;
                 /*
                  * get next queueable request.  We do this early to make sure
                  * that the request is fully prepared even if we cannot
                  * accept it.
                  */
                 printk("3");
                 req = elv_next_request(q);
                 if (!req || !scsi_dev_queue_ready(q, sdev))
                         break;
                 printk("4");
                 if (unlikely(!scsi_device_online(sdev))) {
                         printk(KERN_ERR "scsi%d (%d:%d): rejecting I/O to offline device\n",
                                sdev->host->host_no, sdev->id, sdev->lun);
                         blkdev_dequeue_request(req);
                         req->flags |= REQ_QUIET;
                         while (end_that_request_first(req, 0, req->nr_sectors))
                                 ;
                         end_that_request_last(req);
                         continue;
                 }


                 /*
                  * Remove the request from the request list.
                  */
                 if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
                         blkdev_dequeue_request(req);
                 sdev->device_busy++;
                 printk("5");
                 spin_unlock(q->queue_lock);
                 spin_lock(shost->host_lock);
                 printk("6");
                 if (!scsi_host_queue_ready(q, shost, sdev))
                         goto not_ready;
                 printk("7");
                 if (sdev->single_lun) {
                         if (sdev->sdev_target->starget_sdev_user &&
                             sdev->sdev_target->starget_sdev_user != sdev)
                                 goto not_ready;
                         sdev->sdev_target->starget_sdev_user = sdev;
                 }
                 shost->host_busy++;
                 printk("8");

                 /*
                  * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
                  *              take the lock again.
                  */
                 spin_unlock_irq(shost->host_lock);
                 printk("9");

                 cmd = req->special;
                 if (unlikely(cmd == NULL)) {
                         printk(KERN_CRIT "impossible request in %s.\n"
                                          "please mail a stack trace to "
                                          "linux-scsi@vger.kernel.org",
                                          __FUNCTION__);
                         BUG();
                 }
                 printk("A");

                 /*
                  * Finally, initialize any error handling parameters, and set up
                  * the timers for timeouts.
                  */
                 scsi_init_cmd_errh(cmd);
                 printk("B");

                 /*
                  * Dispatch the command to the low-level driver.
                  */
                 rtn = scsi_dispatch_cmd(cmd);
                 spin_lock_irq(q->queue_lock);
                 printk("C");
                 if(rtn) {
                         /* we're refusing the command; because of
                          * the way locks get dropped, we need to
                          * check here if plugging is required */
                         if(sdev->device_busy == 0)
                                 blk_plug_device(q);

                         break;
                 }
                 printk("D");
         }

                 printk("E");
         goto out;

  not_ready:
         spin_unlock_irq(shost->host_lock);
                 printk("F");

         /*
          * lock q, handle tag, requeue req, and decrement device_busy. We
          * must return with queue_lock held.
          *
          * Decrementing device_busy without checking it is OK, as all such
          * cases (host limits or settings) should run the queue at some
          * later time.
          */
         spin_lock_irq(q->queue_lock);
         blk_requeue_request(q, req);
         sdev->device_busy--;
         if(sdev->device_busy == 0)
                 blk_plug_device(q);
                 printk("G");
  out:
         /* must be careful here...if we trigger the ->remove() function
          * we cannot be holding the q lock */
                 printk("H");
         spin_unlock_irq(q->queue_lock);
         put_device(&sdev->sdev_gendev);
         spin_lock_irq(q->queue_lock);
                 printk("I\n");
}


Does anyone get the idea I'm stumbling around in the dark ? ;p)

Regards,
Brad

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-08-17 18:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-17 11:17 sata media errors not propagated? (was Re: sata_via issue) Kevin Shanahan
2004-08-17 12:38 ` Brad Campbell
2004-08-17 16:40 ` Brad Campbell
2004-08-17 18:57   ` Brad Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).