* regarding bug #5914 - fs corruption on SATA
@ 2006-01-26 5:50 Tejun Heo
2006-01-26 5:51 ` Tejun Heo
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Tejun Heo @ 2006-01-26 5:50 UTC (permalink / raw)
To: Nicolas.Mailhot; +Cc: Jeff Garzik, Jens Axboe, Linux-ide
Hello, Nicolas. Hello, all.
Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA
(forced-unit-access)thing made into the mainline lately, and it seems
that your drive is reporting FUA support but doesn't really do it
properly when it's asked to.
Can you try the followings to verify the problem?
1. make a small partition on the affected drive and do mkfs.ext3 on it.
2. mount -o barrier new_partition /mnt/tmp
3. cd /mnt/tmp; touch asdf; sync
This should give something like the following.
======
ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0
ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }
ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0
ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }
ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0
ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }
ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0
ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }
ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0
ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }
sd 2:0:0:0: SCSI error: return code = 0x8000002
sdc: Current: sense key: Aborted Command
Additional sense: No additional sense information
end_request: I/O error, dev sdc, sector 4359
Buffer I/O error on device sdc1, logical block 537
lost page write due to I/O error on sdc1
Aborting journal on device sdc1.
journal commit I/O error
======
The ext3 fs will back off and won't use any barrier from this point.
If this is what you see, please apply the patch at the end of this
mail, which makes libata issue non-FUA commmands even if FUA commands
are asked for. After recompiling repeat above, create some files,
unmount, mount, verify stuff, unmount and fsck... All should succeed
without any complaint from the kernel.
If my guess turns out to be true, we'll need a blacklist for those
lying drives. Damn it.
diff --git a/drivers/scsi/libata-core.c b/drivers/scsi/libata-core.c
index 46c4cdb..6ba6ad2 100644
--- a/drivers/scsi/libata-core.c
+++ b/drivers/scsi/libata-core.c
@@ -565,7 +565,7 @@ static const u8 ata_rw_cmds[] = {
0,
0,
0,
- ATA_CMD_WRITE_MULTI_FUA_EXT,
+ ATA_CMD_WRITE_MULTI_EXT,
/* pio */
ATA_CMD_PIO_READ,
ATA_CMD_PIO_WRITE,
@@ -583,7 +583,7 @@ static const u8 ata_rw_cmds[] = {
0,
0,
0,
- ATA_CMD_WRITE_FUA_EXT
+ ATA_CMD_WRITE_EXT
};
/**
^ permalink raw reply related [flat|nested] 33+ messages in thread* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 5:50 regarding bug #5914 - fs corruption on SATA Tejun Heo @ 2006-01-26 5:51 ` Tejun Heo 2006-01-26 9:14 ` Nicolas Mailhot 2006-01-26 9:18 ` Jens Axboe 2006-01-26 16:41 ` David Greaves 2 siblings, 1 reply; 33+ messages in thread From: Tejun Heo @ 2006-01-26 5:51 UTC (permalink / raw) To: Nicolas.Mailhot; +Cc: Jeff Garzik, Jens Axboe, Linux-ide Tejun Heo wrote: > Hello, Nicolas. Hello, all. > > Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA > (forced-unit-access)thing made into the mainline lately, and it seems > that your drive is reporting FUA support but doesn't really do it > properly when it's asked to. > > Can you try the followings to verify the problem? > > 1. make a small partition on the affected drive and do mkfs.ext3 on it. > 2. mount -o barrier new_partition /mnt/tmp This should be 'mount -o barrier=1 new_partition /mnt/tmp' > 3. cd /mnt/tmp; touch asdf; sync -- tejun ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 5:51 ` Tejun Heo @ 2006-01-26 9:14 ` Nicolas Mailhot 2006-01-26 9:21 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-26 9:14 UTC (permalink / raw) To: Tejun Heo; +Cc: Jeff Garzik, Jens Axboe, Linux-ide Le Jeu 26 janvier 2006 06:51, Tejun Heo a écrit : > Tejun Heo wrote: >> Hello, Nicolas. Hello, all. Hi >> Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA >> (forced-unit-access)thing made into the mainline lately, and it seems >> that your drive is reporting FUA support but doesn't really do it >> properly when it's asked to. >> >> Can you try the followings to verify the problem? >> >> 1. make a small partition on the affected drive and do mkfs.ext3 on it. >> 2. mount -o barrier new_partition /mnt/tmp > > This should be 'mount -o barrier=1 new_partition /mnt/tmp' > >> 3. cd /mnt/tmp; touch asdf; sync What parts can be done one a pre-breakage kernel and what parts on a problem kernel (I ask this because a problem kernel will corrupt basically any file it writes to, even in single login mode the damage is significant so I need to limit the corruption window to minimum). Also I have plenty of space to create partitions but that will be lvm-on-md-raid1 space (don't know if it matters, if it does I need to learn to shrink the lvm/md) Regards, BTW what's FUA in semi-layman terms ? -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 9:14 ` Nicolas Mailhot @ 2006-01-26 9:21 ` Jens Axboe 2006-01-26 10:01 ` Nicolas Mailhot [not found] ` <5840.192.54.193.25.1138269692.squirrel@rousalka.dyndns.org> 0 siblings, 2 replies; 33+ messages in thread From: Jens Axboe @ 2006-01-26 9:21 UTC (permalink / raw) To: Nicolas Mailhot; +Cc: Tejun Heo, Jeff Garzik, Linux-ide On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > Le Jeu 26 janvier 2006 06:51, Tejun Heo a écrit : > > Tejun Heo wrote: > >> Hello, Nicolas. Hello, all. > > Hi > > >> Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA > >> (forced-unit-access)thing made into the mainline lately, and it seems > >> that your drive is reporting FUA support but doesn't really do it > >> properly when it's asked to. > >> > >> Can you try the followings to verify the problem? > >> > >> 1. make a small partition on the affected drive and do mkfs.ext3 on it. > >> 2. mount -o barrier new_partition /mnt/tmp > > > > This should be 'mount -o barrier=1 new_partition /mnt/tmp' > > > >> 3. cd /mnt/tmp; touch asdf; sync > > What parts can be done one a pre-breakage kernel and what parts on a > problem kernel (I ask this because a problem kernel will corrupt basically > any file it writes to, even in single login mode the damage is significant > so I need to limit the corruption window to minimum). You need a new kernel (after the barrier rework), so 2.6.16-rc1 for instance. > Also I have plenty of space to create partitions but that will be > lvm-on-md-raid1 space (don't know if it matters, if it does I need to > learn to shrink the lvm/md) It would be best to exclude lvm/md for now, but I can see it might not be so easy for you... > BTW what's FUA in semi-layman terms ? It stands for Forced Unit Access, basically a way to force the drive to write through the cache directly to platter even when write back caching is enabled. Or just bypass the cache on a read, but we use it for writes with the barrier stuff. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 9:21 ` Jens Axboe @ 2006-01-26 10:01 ` Nicolas Mailhot [not found] ` <5840.192.54.193.25.1138269692.squirrel@rousalka.dyndns.org> 1 sibling, 0 replies; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-26 10:01 UTC (permalink / raw) To: Jens Axboe; +Cc: Tejun Heo, Jeff Garzik, Linux-ide Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : > On Thu, Jan 26 2006, Nicolas Mailhot wrote: >> What parts can be done one a pre-breakage kernel and what parts on a >> problem kernel (I ask this because a problem kernel will corrupt >> basically >> any file it writes to, even in single login mode the damage is >> significant >> so I need to limit the corruption window to minimum). > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for > instance. Ok, I'll do the test this evening (CET) with the rawhide/davej kernel-of the day. Regards, -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <5840.192.54.193.25.1138269692.squirrel@rousalka.dyndns.org>]
* Re: regarding bug #5914 - fs corruption on SATA [not found] ` <5840.192.54.193.25.1138269692.squirrel@rousalka.dyndns.org> @ 2006-01-26 21:04 ` Nicolas Mailhot 2006-01-27 8:13 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-26 21:04 UTC (permalink / raw) To: Jens Axboe; +Cc: Tejun Heo, Jeff Garzik, Linux-ide [-- Attachment #1: Type: text/plain, Size: 911 bytes --] Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > >> What parts can be done one a pre-breakage kernel and what parts on a > >> problem kernel (I ask this because a problem kernel will corrupt > >> basically > >> any file it writes to, even in single login mode the damage is > >> significant > >> so I need to limit the corruption window to minimum). > > > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for > > instance. > > Ok, I'll do the test this evening (CET) with the rawhide/davej kernel-of > the day. I applied the fua backout patch and the kernel booted beautifully. Now I guess I need to see if Maxtor released a fixed firmware right ? (is it possible to change the firmware on a running system ?) Regards, -- Nicolas Mailhot [-- Attachment #2: Ceci est une partie de message numériquement signée --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 21:04 ` Nicolas Mailhot @ 2006-01-27 8:13 ` Jens Axboe 2006-01-27 8:53 ` Nicolas Mailhot 2006-01-27 12:12 ` Ric Wheeler 0 siblings, 2 replies; 33+ messages in thread From: Jens Axboe @ 2006-01-27 8:13 UTC (permalink / raw) To: Nicolas Mailhot; +Cc: Tejun Heo, Jeff Garzik, Linux-ide On Thu, Jan 26 2006, Nicolas Mailhot wrote: > Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : > > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : > > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > > > >> What parts can be done one a pre-breakage kernel and what parts on a > > >> problem kernel (I ask this because a problem kernel will corrupt > > >> basically > > >> any file it writes to, even in single login mode the damage is > > >> significant > > >> so I need to limit the corruption window to minimum). > > > > > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for > > > instance. > > > > Ok, I'll do the test this evening (CET) with the rawhide/davej kernel-of > > the day. > > I applied the fua backout patch and the kernel booted beautifully. > Now I guess I need to see if Maxtor released a fixed firmware right ? > (is it possible to change the firmware on a running system ?) If you can get an update firmware, it is usually done by booting from DOS floppy and running a special flash utility from there. Can you send me the hdparm -I /dev/sdX output of the problem drive? I think we should just blacklist it for FUA. This bug is so obscure I think it's a better solution than adding a FUA disable module parameter at this point. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 8:13 ` Jens Axboe @ 2006-01-27 8:53 ` Nicolas Mailhot 2006-01-27 9:10 ` Jens Axboe 2006-01-27 12:12 ` Ric Wheeler 1 sibling, 1 reply; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-27 8:53 UTC (permalink / raw) To: Jens Axboe; +Cc: Tejun Heo, Jeff Garzik, Linux-ide Le Ven 27 janvier 2006 09:13, Jens Axboe a écrit : > On Thu, Jan 26 2006, Nicolas Mailhot wrote: >> Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : >> > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : >> > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: >> > >> > >> What parts can be done one a pre-breakage kernel and what parts on >> a >> > >> problem kernel (I ask this because a problem kernel will corrupt >> > >> basically >> > >> any file it writes to, even in single login mode the damage is >> > >> significant >> > >> so I need to limit the corruption window to minimum). >> > > >> > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for >> > > instance. >> > >> > Ok, I'll do the test this evening (CET) with the rawhide/davej >> kernel-of >> > the day. >> >> I applied the fua backout patch and the kernel booted beautifully. >> Now I guess I need to see if Maxtor released a fixed firmware right ? >> (is it possible to change the firmware on a running system ?) > > If you can get an update firmware, it is usually done by booting from > DOS floppy and running a special flash utility from there. Can you send > me the hdparm -I /dev/sdX output of the problem drive? I think we should > just blacklist it for FUA. This bug is so obscure I think it's a better > solution than adding a FUA disable module parameter at this point. There is already fairly complete smart info available in https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123604 https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123605) I'll add hdparm info this evening if it's not sufficient Regards, -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 8:53 ` Nicolas Mailhot @ 2006-01-27 9:10 ` Jens Axboe 2006-01-27 9:20 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2006-01-27 9:10 UTC (permalink / raw) To: Nicolas Mailhot; +Cc: Tejun Heo, Jeff Garzik, Linux-ide On Fri, Jan 27 2006, Nicolas Mailhot wrote: > > Le Ven 27 janvier 2006 09:13, Jens Axboe a écrit : > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > >> Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : > >> > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : > >> > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > >> > > >> > >> What parts can be done one a pre-breakage kernel and what parts on > >> a > >> > >> problem kernel (I ask this because a problem kernel will corrupt > >> > >> basically > >> > >> any file it writes to, even in single login mode the damage is > >> > >> significant > >> > >> so I need to limit the corruption window to minimum). > >> > > > >> > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for > >> > > instance. > >> > > >> > Ok, I'll do the test this evening (CET) with the rawhide/davej > >> kernel-of > >> > the day. > >> > >> I applied the fua backout patch and the kernel booted beautifully. > >> Now I guess I need to see if Maxtor released a fixed firmware right ? > >> (is it possible to change the firmware on a running system ?) > > > > If you can get an update firmware, it is usually done by booting from > > DOS floppy and running a special flash utility from there. Can you send > > me the hdparm -I /dev/sdX output of the problem drive? I think we should > > just blacklist it for FUA. This bug is so obscure I think it's a better > > solution than adding a FUA disable module parameter at this point. > > There is already fairly complete smart info available in > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 > (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123604 > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123605) I didn't notice the smart info, yes that holds enough information. Thanks! -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 9:10 ` Jens Axboe @ 2006-01-27 9:20 ` Jens Axboe 2006-01-27 9:27 ` Nicolas Mailhot 2006-01-27 9:46 ` Bartlomiej Zolnierkiewicz 0 siblings, 2 replies; 33+ messages in thread From: Jens Axboe @ 2006-01-27 9:20 UTC (permalink / raw) To: Nicolas Mailhot; +Cc: Tejun Heo, Jeff Garzik, Linux-ide On Fri, Jan 27 2006, Jens Axboe wrote: > On Fri, Jan 27 2006, Nicolas Mailhot wrote: > > > > Le Ven 27 janvier 2006 09:13, Jens Axboe a écrit : > > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > >> Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : > > >> > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : > > >> > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > >> > > > >> > >> What parts can be done one a pre-breakage kernel and what parts on > > >> a > > >> > >> problem kernel (I ask this because a problem kernel will corrupt > > >> > >> basically > > >> > >> any file it writes to, even in single login mode the damage is > > >> > >> significant > > >> > >> so I need to limit the corruption window to minimum). > > >> > > > > >> > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for > > >> > > instance. > > >> > > > >> > Ok, I'll do the test this evening (CET) with the rawhide/davej > > >> kernel-of > > >> > the day. > > >> > > >> I applied the fua backout patch and the kernel booted beautifully. > > >> Now I guess I need to see if Maxtor released a fixed firmware right ? > > >> (is it possible to change the firmware on a running system ?) > > > > > > If you can get an update firmware, it is usually done by booting from > > > DOS floppy and running a special flash utility from there. Can you send > > > me the hdparm -I /dev/sdX output of the problem drive? I think we should > > > just blacklist it for FUA. This bug is so obscure I think it's a better > > > solution than adding a FUA disable module parameter at this point. > > > > There is already fairly complete smart info available in > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 > > (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123604 > > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123605) > > I didn't notice the smart info, yes that holds enough information. > Thanks! Can you try and boot a kernel with this patch applied (needs to be one of the newer ones, of course) and see if you still see the "w/ FUA" string next to your Maxtor drive(s)? diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c index cfbceb5..3feda07 100644 --- a/drivers/scsi/libata-scsi.c +++ b/drivers/scsi/libata-scsi.c @@ -1700,6 +1700,28 @@ static unsigned int ata_msense_rw_recove return sizeof(def_rw_recovery_mpage); } +/* + * We can turn this into a real blacklist if it's needed, for now just + * blacklist any Maxtor BANC1G10 revision firmware + */ +static int ata_dev_supports_fua(u16 *id) +{ + unsigned char model[41], fw[9]; + + if (!ata_id_has_fua(id)) + return 0; + + ata_dev_id_string(id, model, ATA_ID_PROD_OFS, sizeof(model)); + ata_dev_id_string(id, fw, ATA_ID_FW_REV_OFS, sizeof(fw)); + + if (strncmp(model, "Maxtor", 6)) + return 1; + if (strncmp(model, "BANC1G10", 8)) + return 1; + + return 0; /* blacklisted */ +} + /** * ata_scsiop_mode_sense - Simulate MODE SENSE 6, 10 commands * @args: device IDENTIFY data / SCSI command of interest. @@ -1797,7 +1819,7 @@ unsigned int ata_scsiop_mode_sense(struc return 0; dpofua = 0; - if (ata_id_has_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && + if (ata_dev_supports_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && (!(dev->flags & ATA_DFLAG_PIO) || dev->multi_count)) dpofua = 1 << 4; -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 9:20 ` Jens Axboe @ 2006-01-27 9:27 ` Nicolas Mailhot 2006-01-27 9:46 ` Bartlomiej Zolnierkiewicz 1 sibling, 0 replies; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-27 9:27 UTC (permalink / raw) To: Jens Axboe; +Cc: Tejun Heo, Jeff Garzik, Linux-ide Le Ven 27 janvier 2006 10:20, Jens Axboe a écrit : > On Fri, Jan 27 2006, Jens Axboe wrote: >> On Fri, Jan 27 2006, Nicolas Mailhot wrote: >> > >> > Le Ven 27 janvier 2006 09:13, Jens Axboe a écrit : >> > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: >> > >> Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : >> > >> > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : >> > >> > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: >> > >> > >> > >> > >> What parts can be done one a pre-breakage kernel and what >> parts on >> > >> a >> > >> > >> problem kernel (I ask this because a problem kernel will >> corrupt >> > >> > >> basically >> > >> > >> any file it writes to, even in single login mode the damage is >> > >> > >> significant >> > >> > >> so I need to limit the corruption window to minimum). >> > >> > > >> > >> > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 >> for >> > >> > > instance. >> > >> > >> > >> > Ok, I'll do the test this evening (CET) with the rawhide/davej >> > >> kernel-of >> > >> > the day. >> > >> >> > >> I applied the fua backout patch and the kernel booted beautifully. >> > >> Now I guess I need to see if Maxtor released a fixed firmware right >> ? >> > >> (is it possible to change the firmware on a running system ?) >> > > >> > > If you can get an update firmware, it is usually done by booting >> from >> > > DOS floppy and running a special flash utility from there. Can you >> send >> > > me the hdparm -I /dev/sdX output of the problem drive? I think we >> should >> > > just blacklist it for FUA. This bug is so obscure I think it's a >> better >> > > solution than adding a FUA disable module parameter at this point. >> > >> > There is already fairly complete smart info available in >> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 >> > (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123604 >> > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123605) >> >> I didn't notice the smart info, yes that holds enough information. >> Thanks! > > Can you try and boot a kernel with this patch applied (needs to be one > of the newer ones, of course) and see if you still see the "w/ FUA" > string next to your Maxtor drive(s)? > > diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c > index cfbceb5..3feda07 100644 > --- a/drivers/scsi/libata-scsi.c > +++ b/drivers/scsi/libata-scsi.c > @@ -1700,6 +1700,28 @@ static unsigned int ata_msense_rw_recove > return sizeof(def_rw_recovery_mpage); > } > > +/* > + * We can turn this into a real blacklist if it's needed, for now just > + * blacklist any Maxtor BANC1G10 revision firmware > + */ > +static int ata_dev_supports_fua(u16 *id) > +{ > + unsigned char model[41], fw[9]; > + > + if (!ata_id_has_fua(id)) > + return 0; > + > + ata_dev_id_string(id, model, ATA_ID_PROD_OFS, sizeof(model)); > + ata_dev_id_string(id, fw, ATA_ID_FW_REV_OFS, sizeof(fw)); > + > + if (strncmp(model, "Maxtor", 6)) > + return 1; > + if (strncmp(model, "BANC1G10", 8)) > + return 1; > + > + return 0; /* blacklisted */ > +} > + > /** > * ata_scsiop_mode_sense - Simulate MODE SENSE 6, 10 commands > * @args: device IDENTIFY data / SCSI command of interest. > @@ -1797,7 +1819,7 @@ unsigned int ata_scsiop_mode_sense(struc > return 0; > > dpofua = 0; > - if (ata_id_has_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && > + if (ata_dev_supports_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && > (!(dev->flags & ATA_DFLAG_PIO) || dev->multi_count)) > dpofua = 1 << 4; > Will do this evening (CET) -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 9:20 ` Jens Axboe 2006-01-27 9:27 ` Nicolas Mailhot @ 2006-01-27 9:46 ` Bartlomiej Zolnierkiewicz 2006-01-27 9:50 ` Jens Axboe 1 sibling, 1 reply; 33+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2006-01-27 9:46 UTC (permalink / raw) To: Jens Axboe; +Cc: Nicolas Mailhot, Tejun Heo, Jeff Garzik, Linux-ide On 1/27/06, Jens Axboe <axboe@suse.de> wrote: > On Fri, Jan 27 2006, Jens Axboe wrote: > > On Fri, Jan 27 2006, Nicolas Mailhot wrote: > > > > > > Le Ven 27 janvier 2006 09:13, Jens Axboe a écrit : > > > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > > >> Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : > > > >> > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : > > > >> > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > > >> > > > > >> > >> What parts can be done one a pre-breakage kernel and what parts on > > > >> a > > > >> > >> problem kernel (I ask this because a problem kernel will corrupt > > > >> > >> basically > > > >> > >> any file it writes to, even in single login mode the damage is > > > >> > >> significant > > > >> > >> so I need to limit the corruption window to minimum). > > > >> > > > > > >> > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for > > > >> > > instance. > > > >> > > > > >> > Ok, I'll do the test this evening (CET) with the rawhide/davej > > > >> kernel-of > > > >> > the day. > > > >> > > > >> I applied the fua backout patch and the kernel booted beautifully. > > > >> Now I guess I need to see if Maxtor released a fixed firmware right ? > > > >> (is it possible to change the firmware on a running system ?) > > > > > > > > If you can get an update firmware, it is usually done by booting from > > > > DOS floppy and running a special flash utility from there. Can you send > > > > me the hdparm -I /dev/sdX output of the problem drive? I think we should > > > > just blacklist it for FUA. This bug is so obscure I think it's a better > > > > solution than adding a FUA disable module parameter at this point. > > > > > > There is already fairly complete smart info available in > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 > > > (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123604 > > > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123605) > > > > I didn't notice the smart info, yes that holds enough information. > > Thanks! > > Can you try and boot a kernel with this patch applied (needs to be one > of the newer ones, of course) and see if you still see the "w/ FUA" > string next to your Maxtor drive(s)? > > diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c > index cfbceb5..3feda07 100644 > --- a/drivers/scsi/libata-scsi.c > +++ b/drivers/scsi/libata-scsi.c > @@ -1700,6 +1700,28 @@ static unsigned int ata_msense_rw_recove > return sizeof(def_rw_recovery_mpage); > } > > +/* > + * We can turn this into a real blacklist if it's needed, for now just > + * blacklist any Maxtor BANC1G10 revision firmware > + */ > +static int ata_dev_supports_fua(u16 *id) > +{ > + unsigned char model[41], fw[9]; > + > + if (!ata_id_has_fua(id)) > + return 0; > + > + ata_dev_id_string(id, model, ATA_ID_PROD_OFS, sizeof(model)); > + ata_dev_id_string(id, fw, ATA_ID_FW_REV_OFS, sizeof(fw)); > + > + if (strncmp(model, "Maxtor", 6)) > + return 1; > + if (strncmp(model, "BANC1G10", 8)) > + return 1; s/model/fw/ ? Bartlomiej ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 9:46 ` Bartlomiej Zolnierkiewicz @ 2006-01-27 9:50 ` Jens Axboe 2006-01-27 19:37 ` Nicolas Mailhot 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2006-01-27 9:50 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Nicolas Mailhot, Tejun Heo, Jeff Garzik, Linux-ide On Fri, Jan 27 2006, Bartlomiej Zolnierkiewicz wrote: > On 1/27/06, Jens Axboe <axboe@suse.de> wrote: > > On Fri, Jan 27 2006, Jens Axboe wrote: > > > On Fri, Jan 27 2006, Nicolas Mailhot wrote: > > > > > > > > Le Ven 27 janvier 2006 09:13, Jens Axboe a écrit : > > > > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > > > >> Le jeudi 26 janvier 2006 à 11:01 +0100, Nicolas Mailhot a écrit : > > > > >> > Le Jeu 26 janvier 2006 10:21, Jens Axboe a écrit : > > > > >> > > On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > > > >> > > > > > >> > >> What parts can be done one a pre-breakage kernel and what parts on > > > > >> a > > > > >> > >> problem kernel (I ask this because a problem kernel will corrupt > > > > >> > >> basically > > > > >> > >> any file it writes to, even in single login mode the damage is > > > > >> > >> significant > > > > >> > >> so I need to limit the corruption window to minimum). > > > > >> > > > > > > >> > > You need a new kernel (after the barrier rework), so 2.6.16-rc1 for > > > > >> > > instance. > > > > >> > > > > > >> > Ok, I'll do the test this evening (CET) with the rawhide/davej > > > > >> kernel-of > > > > >> > the day. > > > > >> > > > > >> I applied the fua backout patch and the kernel booted beautifully. > > > > >> Now I guess I need to see if Maxtor released a fixed firmware right ? > > > > >> (is it possible to change the firmware on a running system ?) > > > > > > > > > > If you can get an update firmware, it is usually done by booting from > > > > > DOS floppy and running a special flash utility from there. Can you send > > > > > me the hdparm -I /dev/sdX output of the problem drive? I think we should > > > > > just blacklist it for FUA. This bug is so obscure I think it's a better > > > > > solution than adding a FUA disable module parameter at this point. > > > > > > > > There is already fairly complete smart info available in > > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 > > > > (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123604 > > > > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123605) > > > > > > I didn't notice the smart info, yes that holds enough information. > > > Thanks! > > > > Can you try and boot a kernel with this patch applied (needs to be one > > of the newer ones, of course) and see if you still see the "w/ FUA" > > string next to your Maxtor drive(s)? > > > > diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c > > index cfbceb5..3feda07 100644 > > --- a/drivers/scsi/libata-scsi.c > > +++ b/drivers/scsi/libata-scsi.c > > @@ -1700,6 +1700,28 @@ static unsigned int ata_msense_rw_recove > > return sizeof(def_rw_recovery_mpage); > > } > > > > +/* > > + * We can turn this into a real blacklist if it's needed, for now just > > + * blacklist any Maxtor BANC1G10 revision firmware > > + */ > > +static int ata_dev_supports_fua(u16 *id) > > +{ > > + unsigned char model[41], fw[9]; > > + > > + if (!ata_id_has_fua(id)) > > + return 0; > > + > > + ata_dev_id_string(id, model, ATA_ID_PROD_OFS, sizeof(model)); > > + ata_dev_id_string(id, fw, ATA_ID_FW_REV_OFS, sizeof(fw)); > > + > > + if (strncmp(model, "Maxtor", 6)) > > + return 1; > > + if (strncmp(model, "BANC1G10", 8)) > > + return 1; > > s/model/fw/ Of course, silly typo! Thanks for catching that. Update patch below. diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c index cfbceb5..3feda07 100644 --- a/drivers/scsi/libata-scsi.c +++ b/drivers/scsi/libata-scsi.c @@ -1700,6 +1700,28 @@ static unsigned int ata_msense_rw_recove return sizeof(def_rw_recovery_mpage); } +/* + * We can turn this into a real blacklist if it's needed, for now just + * blacklist any Maxtor BANC1G10 revision firmware + */ +static int ata_dev_supports_fua(u16 *id) +{ + unsigned char model[41], fw[9]; + + if (!ata_id_has_fua(id)) + return 0; + + ata_dev_id_string(id, model, ATA_ID_PROD_OFS, sizeof(model)); + ata_dev_id_string(id, fw, ATA_ID_FW_REV_OFS, sizeof(fw)); + + if (strncmp(model, "Maxtor", 6)) + return 1; + if (strncmp(fw, "BANC1G10", 8)) + return 1; + + return 0; /* blacklisted */ +} + /** * ata_scsiop_mode_sense - Simulate MODE SENSE 6, 10 commands * @args: device IDENTIFY data / SCSI command of interest. @@ -1797,7 +1819,7 @@ unsigned int ata_scsiop_mode_sense(struc return 0; dpofua = 0; - if (ata_id_has_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && + if (ata_dev_supports_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && (!(dev->flags & ATA_DFLAG_PIO) || dev->multi_count)) dpofua = 1 << 4; -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 9:50 ` Jens Axboe @ 2006-01-27 19:37 ` Nicolas Mailhot 2006-01-27 23:54 ` Nicolas Mailhot 0 siblings, 1 reply; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-27 19:37 UTC (permalink / raw) To: Jens Axboe; +Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide [-- Attachment #1: Type: text/plain, Size: 527 bytes --] Le vendredi 27 janvier 2006 à 10:50 +0100, Jens Axboe a écrit : > Of course, silly typo! Thanks for catching that. Update patch below. ... a patched kernel reboots before finishing to initialize (Just before it prints a line starting with SCSI - the rest is too fast for me to catch) Now the kernel base is slightly different from yesterday, so the bug may be in the base not the patch. I'll rebuild a new kernel with the same base and yesterday's patch to check this now Regards, -- Nicolas Mailhot [-- Attachment #2: Ceci est une partie de message numériquement signée --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 19:37 ` Nicolas Mailhot @ 2006-01-27 23:54 ` Nicolas Mailhot 2006-01-30 15:08 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-27 23:54 UTC (permalink / raw) To: Jens Axboe; +Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide [-- Attachment #1: Type: text/plain, Size: 706 bytes --] Le vendredi 27 janvier 2006 à 20:37 +0100, Nicolas Mailhot a écrit : > Le vendredi 27 janvier 2006 à 10:50 +0100, Jens Axboe a écrit : > > > Of course, silly typo! Thanks for catching that. Update patch below. > > ... > > a patched kernel reboots before finishing to initialize (Just before it > prints a line starting with SCSI - the rest is too fast for me to catch) > > Now the kernel base is slightly different from yesterday, so the bug may > be in the base not the patch. I'll rebuild a new kernel with the same > base and yesterday's patch to check this now I can confirm today's patch is not OK. The same baseline with yesterday's patch boot fine. -- Nicolas Mailhot [-- Attachment #2: Ceci est une partie de message numériquement signée --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 23:54 ` Nicolas Mailhot @ 2006-01-30 15:08 ` Jens Axboe 2006-01-30 23:33 ` Nicolas Mailhot 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2006-01-30 15:08 UTC (permalink / raw) To: Nicolas Mailhot Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide On Sat, Jan 28 2006, Nicolas Mailhot wrote: > Le vendredi 27 janvier 2006 à 20:37 +0100, Nicolas Mailhot a écrit : > > Le vendredi 27 janvier 2006 à 10:50 +0100, Jens Axboe a écrit : > > > > > Of course, silly typo! Thanks for catching that. Update patch below. > > > > ... > > > > a patched kernel reboots before finishing to initialize (Just before it > > prints a line starting with SCSI - the rest is too fast for me to catch) > > > > Now the kernel base is slightly different from yesterday, so the bug may > > be in the base not the patch. I'll rebuild a new kernel with the same > > base and yesterday's patch to check this now > > I can confirm today's patch is not OK. The same baseline with > yesterday's patch boot fine. Is this any better? diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c index cfbceb5..07b1e7c 100644 --- a/drivers/scsi/libata-scsi.c +++ b/drivers/scsi/libata-scsi.c @@ -1700,6 +1700,31 @@ static unsigned int ata_msense_rw_recove return sizeof(def_rw_recovery_mpage); } +/* + * We can turn this into a real blacklist if it's needed, for now just + * blacklist any Maxtor BANC1G10 revision firmware + */ +static int ata_dev_supports_fua(u16 *id) +{ + unsigned char model[41], fw[9]; + + if (!ata_id_has_fua(id)) + return 0; + + model[40] = '\0'; + fw[8] = '\0'; + + ata_dev_id_string(id, model, ATA_ID_PROD_OFS, sizeof(model) - 1); + ata_dev_id_string(id, fw, ATA_ID_FW_REV_OFS, sizeof(fw) - 1); + + if (strncmp(model, "Maxtor", 6)) + return 1; + if (strncmp(fw, "BANC1G10", 8)) + return 1; + + return 0; /* blacklisted */ +} + /** * ata_scsiop_mode_sense - Simulate MODE SENSE 6, 10 commands * @args: device IDENTIFY data / SCSI command of interest. @@ -1797,7 +1822,7 @@ unsigned int ata_scsiop_mode_sense(struc return 0; dpofua = 0; - if (ata_id_has_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && + if (ata_dev_supports_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 && (!(dev->flags & ATA_DFLAG_PIO) || dev->multi_count)) dpofua = 1 << 4; -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-30 15:08 ` Jens Axboe @ 2006-01-30 23:33 ` Nicolas Mailhot 2006-01-31 7:26 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-30 23:33 UTC (permalink / raw) To: Jens Axboe; +Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide [-- Attachment #1: Type: text/plain, Size: 307 bytes --] Le lundi 30 janvier 2006 à 16:08 +0100, Jens Axboe a écrit : > On Sat, Jan 28 2006, Nicolas Mailhot wrote: > > I can confirm today's patch is not OK. The same baseline with > > yesterday's patch boot fine. > > Is this any better? This one seems to work fine. Regards, -- Nicolas Mailhot [-- Attachment #2: Ceci est une partie de message numériquement signée --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-30 23:33 ` Nicolas Mailhot @ 2006-01-31 7:26 ` Jens Axboe 2006-01-31 8:39 ` Nicolas Mailhot 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2006-01-31 7:26 UTC (permalink / raw) To: Nicolas Mailhot Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide On Tue, Jan 31 2006, Nicolas Mailhot wrote: > Le lundi 30 janvier 2006 à 16:08 +0100, Jens Axboe a écrit : > > On Sat, Jan 28 2006, Nicolas Mailhot wrote: > > > I can confirm today's patch is not OK. The same baseline with > > > yesterday's patch boot fine. > > > > Is this any better? > > This one seems to work fine. And you don't get "w/ FUA" messages from the problematic drives - and your data appears safe? Just checking, we cannot take these corruption things lightly. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-31 7:26 ` Jens Axboe @ 2006-01-31 8:39 ` Nicolas Mailhot 2006-01-31 8:47 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-31 8:39 UTC (permalink / raw) To: Jens Axboe; +Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide Le Mar 31 janvier 2006 08:26, Jens Axboe a écrit : > On Tue, Jan 31 2006, Nicolas Mailhot wrote: >> Le lundi 30 janvier 2006 à 16:08 +0100, Jens Axboe a écrit : >> > On Sat, Jan 28 2006, Nicolas Mailhot wrote: >> > > I can confirm today's patch is not OK. The same baseline with >> > > yesterday's patch boot fine. >> > >> > Is this any better? >> >> This one seems to work fine. > > And you don't get "w/ FUA" messages from the problematic drives - and > your data appears safe? Just checking, we cannot take these corruption > things lightly. I didn't spend a lot of time on this, the build finished rather late in the evening/night. What I can say is the dramatic breakage I had before is gone and I don't think there was any error in dmesg (will post it this evening if you want). With dm+raid when FUA broke things it was difficult to miss (screenfulls of ATA/raid errors, fs corruption on reboot, etc) Now if you ask me if I did some heavy I/O to stress the system no I didn't yet. If problems still lurk they are a lot less extensive than they were before. I did trigger a full FS autorelabel so at least the read part was tested a bit. Regards, -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-31 8:39 ` Nicolas Mailhot @ 2006-01-31 8:47 ` Jens Axboe 2006-01-31 22:54 ` Nicolas Mailhot 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2006-01-31 8:47 UTC (permalink / raw) To: Nicolas Mailhot Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide On Tue, Jan 31 2006, Nicolas Mailhot wrote: > > Le Mar 31 janvier 2006 08:26, Jens Axboe a écrit : > > On Tue, Jan 31 2006, Nicolas Mailhot wrote: > >> Le lundi 30 janvier 2006 à 16:08 +0100, Jens Axboe a écrit : > >> > On Sat, Jan 28 2006, Nicolas Mailhot wrote: > >> > > I can confirm today's patch is not OK. The same baseline with > >> > > yesterday's patch boot fine. > >> > > >> > Is this any better? > >> > >> This one seems to work fine. > > > > And you don't get "w/ FUA" messages from the problematic drives - and > > your data appears safe? Just checking, we cannot take these corruption > > things lightly. > > I didn't spend a lot of time on this, the build finished rather late > in the evening/night. What I can say is the dramatic breakage I had > before is gone and I don't think there was any error in dmesg (will > post it this evening if you want). With dm+raid when FUA broke things > it was difficult to miss (screenfulls of ATA/raid errors, fs > corruption on reboot, etc) > > Now if you ask me if I did some heavy I/O to stress the system no I > didn't yet. If problems still lurk they are a lot less extensive than > they were before. I did trigger a full FS autorelabel so at least the > read part was tested a bit. Sounds like it works, if you saw the errors so quickly. Just trying to be absolutely sure, if you could check for the "w/ FUA" prints not being there now it would confirm that the blacklist does its job. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-31 8:47 ` Jens Axboe @ 2006-01-31 22:54 ` Nicolas Mailhot 0 siblings, 0 replies; 33+ messages in thread From: Nicolas Mailhot @ 2006-01-31 22:54 UTC (permalink / raw) To: Jens Axboe; +Cc: Bartlomiej Zolnierkiewicz, Tejun Heo, Jeff Garzik, Linux-ide Jens Axboe a écrit : > Sounds like it works, if you saw the errors so quickly. Well you know while I was testing broken kernels the problem was more to keep the finger on reset on boot and act before too much damage was done rather than waiting for hard-to-spot symptoms. Seems root on md1+lvm is very good to flush fua problems. > Just trying to > be absolutely sure, if you could check for the "w/ FUA" prints not being > there now it would confirm that the blacklist does its job. The patched kernel ran for a day without hiccups. I've attached its demesg to the redhat bug - you can check if it's ok for you (yes I've rebooted since and no it was not a ata problem - just you run-of-the-mill rawhide xorg freeze) https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=123941 Regards, -- Nicolas Mailhot ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 8:13 ` Jens Axboe 2006-01-27 8:53 ` Nicolas Mailhot @ 2006-01-27 12:12 ` Ric Wheeler 2006-01-27 12:23 ` Jens Axboe 1 sibling, 1 reply; 33+ messages in thread From: Ric Wheeler @ 2006-01-27 12:12 UTC (permalink / raw) To: Jens Axboe; +Cc: Nicolas Mailhot, Tejun Heo, Jeff Garzik, Linux-ide Jens Axboe wrote: >On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > >> >>I applied the fua backout patch and the kernel booted beautifully. >>Now I guess I need to see if Maxtor released a fixed firmware right ? >>(is it possible to change the firmware on a running system ?) >> >> > >If you can get an update firmware, it is usually done by booting from >DOS floppy and running a special flash utility from there. Can you send >me the hdparm -I /dev/sdX output of the problem drive? I think we should >just blacklist it for FUA. This bug is so obscure I think it's a better >solution than adding a FUA disable module parameter at this point. > > > I am not sure that drive vendors support firmware upgrades - the downside is that you can produce a nice paper weight if the firmware upgrade fails ;-) Also, there are specific versions where I know that you cannot jump from firmware version X to version X + 1. ric ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-27 12:12 ` Ric Wheeler @ 2006-01-27 12:23 ` Jens Axboe 0 siblings, 0 replies; 33+ messages in thread From: Jens Axboe @ 2006-01-27 12:23 UTC (permalink / raw) To: Ric Wheeler; +Cc: Nicolas Mailhot, Tejun Heo, Jeff Garzik, Linux-ide On Fri, Jan 27 2006, Ric Wheeler wrote: > Jens Axboe wrote: > > >On Thu, Jan 26 2006, Nicolas Mailhot wrote: > > > > > >> > >>I applied the fua backout patch and the kernel booted beautifully. > >>Now I guess I need to see if Maxtor released a fixed firmware right ? > >>(is it possible to change the firmware on a running system ?) > >> > >> > > > >If you can get an update firmware, it is usually done by booting from > >DOS floppy and running a special flash utility from there. Can you send > >me the hdparm -I /dev/sdX output of the problem drive? I think we should > >just blacklist it for FUA. This bug is so obscure I think it's a better > >solution than adding a FUA disable module parameter at this point. > > > > > > > I am not sure that drive vendors support firmware upgrades - the > downside is that you can produce a nice paper weight if the firmware > upgrade fails ;-) Yeah, hence most of them don't put it online. It would be nice to have, though... > Also, there are specific versions where I know that you cannot jump from > firmware version X to version X + 1. That's unfortunate, too. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 5:50 regarding bug #5914 - fs corruption on SATA Tejun Heo 2006-01-26 5:51 ` Tejun Heo @ 2006-01-26 9:18 ` Jens Axboe 2006-01-26 14:11 ` Bartlomiej Zolnierkiewicz 2006-01-26 16:41 ` David Greaves 2 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2006-01-26 9:18 UTC (permalink / raw) To: Tejun Heo; +Cc: Nicolas.Mailhot, Jeff Garzik, Linux-ide On Thu, Jan 26 2006, Tejun Heo wrote: > Hello, Nicolas. Hello, all. > > Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA > (forced-unit-access)thing made into the mainline lately, and it seems > that your drive is reporting FUA support but doesn't really do it > properly when it's asked to. It's strange. I have 3 out of 4 drives in a box here reporting FUA capability, and I have now tested all three of them both with plain FUA writes and NCQ FUA tagged writes. I used data integrity verifying writes, and the data is sound as well. fs likewise, I used ext3 mounted with barriers enabled. What exact model drive is this? It could also be a raid funny. Tejuns proposal with testing the drive alone with ext3+barriers is a good one. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 9:18 ` Jens Axboe @ 2006-01-26 14:11 ` Bartlomiej Zolnierkiewicz 2006-01-26 14:27 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2006-01-26 14:11 UTC (permalink / raw) To: Jens Axboe; +Cc: Tejun Heo, Nicolas.Mailhot, Jeff Garzik, Linux-ide On 1/26/06, Jens Axboe <axboe@suse.de> wrote: > On Thu, Jan 26 2006, Tejun Heo wrote: > > Hello, Nicolas. Hello, all. > > > > Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA > > (forced-unit-access)thing made into the mainline lately, and it seems > > that your drive is reporting FUA support but doesn't really do it > > properly when it's asked to. > > It's strange. I have 3 out of 4 drives in a box here reporting FUA > capability, and I have now tested all three of them both with plain FUA > writes and NCQ FUA tagged writes. I used data integrity verifying > writes, and the data is sound as well. fs likewise, I used ext3 mounted > with barriers enabled. You are just lucky ;-). There are drives out there having buggy NCQ support. Windows driver for Sil have them listed in .inf file, you can also find discussions on various internet forums about problems (including data corruption) with these drives and controllers supporting NCQ. However there is official firmware update so hopefully it can be fixed (unfortunately we still need to blacklist buggy firmware revisions). I wouldn't be surprised if there is similar situation with FUA support. Does anybody know if Windows use FUA? Bartlomiej ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 14:11 ` Bartlomiej Zolnierkiewicz @ 2006-01-26 14:27 ` Jens Axboe 0 siblings, 0 replies; 33+ messages in thread From: Jens Axboe @ 2006-01-26 14:27 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Tejun Heo, Nicolas.Mailhot, Jeff Garzik, Linux-ide On Thu, Jan 26 2006, Bartlomiej Zolnierkiewicz wrote: > On 1/26/06, Jens Axboe <axboe@suse.de> wrote: > > On Thu, Jan 26 2006, Tejun Heo wrote: > > > Hello, Nicolas. Hello, all. > > > > > > Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA > > > (forced-unit-access)thing made into the mainline lately, and it seems > > > that your drive is reporting FUA support but doesn't really do it > > > properly when it's asked to. > > > > It's strange. I have 3 out of 4 drives in a box here reporting FUA > > capability, and I have now tested all three of them both with plain FUA > > writes and NCQ FUA tagged writes. I used data integrity verifying > > writes, and the data is sound as well. fs likewise, I used ext3 mounted > > with barriers enabled. > > You are just lucky ;-). There are drives out there having buggy NCQ > support. Windows driver for Sil have them listed in .inf file, you can also Oh yeah, I'm very well aware of NCQ firmware issues. Interesting about the Sil inf file having them listed, I started a blacklist myself for NCQ and I'll be sure to look at theirs! > find discussions on various internet forums about problems (including > data corruption) with these drives and controllers supporting NCQ. But that's a seperate story, it's just plain FUA that is the case here. > However there is official firmware update so hopefully it can be fixed > (unfortunately we still need to blacklist buggy firmware revisions). > > I wouldn't be surprised if there is similar situation with FUA support. > Does anybody know if Windows use FUA? Actually I'm a little surprised, honestly. It's a pretty simple feature. It's not like NCQ where I can understand that bugs can creap into the firmware (although some are so buggy it's unbelivable). -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 5:50 regarding bug #5914 - fs corruption on SATA Tejun Heo 2006-01-26 5:51 ` Tejun Heo 2006-01-26 9:18 ` Jens Axboe @ 2006-01-26 16:41 ` David Greaves 2006-01-26 16:58 ` Jeff Garzik 2 siblings, 1 reply; 33+ messages in thread From: David Greaves @ 2006-01-26 16:41 UTC (permalink / raw) To: Tejun Heo Cc: Nicolas.Mailhot, Jeff Garzik, Jens Axboe, Linux-ide, Christopher Smith, Erik Slagter, hahn, mlaks, Soeren Sonnenburg, mlaks Have you guys seen the parallel threads (in linux-ide and linux-raid) that have been reporting very similar problems for a few days now. Have a look for subjects such as Problems with multiple Promise SATA150 TX4 cards Possible libata/sata/Asus problem (was Re: Need to upgrade to latest stable mdadm version?) For me, please see: http://marc.theaimsgroup.com/?l=linux-kernel&m=113769509617034&w=2 David PS I'm using XFS on md5 and md1 PPS Buying a new £60+ PSU didn't fix my problem - <sigh> Tejun Heo wrote: >Hello, Nicolas. Hello, all. > >Nicolas, I'm probably the guy who broke your filesystem. :-p This FUA >(forced-unit-access)thing made into the mainline lately, and it seems >that your drive is reporting FUA support but doesn't really do it >properly when it's asked to. > >Can you try the followings to verify the problem? > >1. make a small partition on the affected drive and do mkfs.ext3 on it. >2. mount -o barrier new_partition /mnt/tmp >3. cd /mnt/tmp; touch asdf; sync > >This should give something like the following. > >====== >ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0 >ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 >ata2: status=0x51 { DriveReady SeekComplete Error } >ata2: error=0x04 { DriveStatusError } >ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0 >ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 >ata2: status=0x51 { DriveReady SeekComplete Error } >ata2: error=0x04 { DriveStatusError } >ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0 >ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 >ata2: status=0x51 { DriveReady SeekComplete Error } >ata2: error=0x04 { DriveStatusError } >ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0 >ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 >ata2: status=0x51 { DriveReady SeekComplete Error } >ata2: error=0x04 { DriveStatusError } >ata2: port reset, p_is 40000001 is 2 pis 0 cmd 44017 tf 451 ss 123 se 0 >ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 >ata2: status=0x51 { DriveReady SeekComplete Error } >ata2: error=0x04 { DriveStatusError } >sd 2:0:0:0: SCSI error: return code = 0x8000002 >sdc: Current: sense key: Aborted Command > Additional sense: No additional sense information >end_request: I/O error, dev sdc, sector 4359 >Buffer I/O error on device sdc1, logical block 537 >lost page write due to I/O error on sdc1 >Aborting journal on device sdc1. >journal commit I/O error >====== > >The ext3 fs will back off and won't use any barrier from this point. > >If this is what you see, please apply the patch at the end of this >mail, which makes libata issue non-FUA commmands even if FUA commands >are asked for. After recompiling repeat above, create some files, >unmount, mount, verify stuff, unmount and fsck... All should succeed >without any complaint from the kernel. > >If my guess turns out to be true, we'll need a blacklist for those >lying drives. Damn it. > >diff --git a/drivers/scsi/libata-core.c b/drivers/scsi/libata-core.c >index 46c4cdb..6ba6ad2 100644 >--- a/drivers/scsi/libata-core.c >+++ b/drivers/scsi/libata-core.c >@@ -565,7 +565,7 @@ static const u8 ata_rw_cmds[] = { > 0, > 0, > 0, >- ATA_CMD_WRITE_MULTI_FUA_EXT, >+ ATA_CMD_WRITE_MULTI_EXT, > /* pio */ > ATA_CMD_PIO_READ, > ATA_CMD_PIO_WRITE, >@@ -583,7 +583,7 @@ static const u8 ata_rw_cmds[] = { > 0, > 0, > 0, >- ATA_CMD_WRITE_FUA_EXT >+ ATA_CMD_WRITE_EXT > }; > > /** >- >To unsubscribe from this list: send the line "unsubscribe linux-ide" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 16:41 ` David Greaves @ 2006-01-26 16:58 ` Jeff Garzik 2006-01-26 17:15 ` David Greaves 2006-01-26 17:20 ` regarding bug #5914 - fs corruption on SATA Soeren Sonnenburg 0 siblings, 2 replies; 33+ messages in thread From: Jeff Garzik @ 2006-01-26 16:58 UTC (permalink / raw) To: David Greaves Cc: Tejun Heo, Nicolas.Mailhot, Jens Axboe, Linux-ide, Christopher Smith, Erik Slagter, hahn, mlaks, Soeren Sonnenburg, mlaks David Greaves wrote: > Have a look for subjects such as > Problems with multiple Promise SATA150 TX4 cards This is almost certainly either a power or PCI bus/slot issue. > Possible libata/sata/Asus problem (was Re: Need to upgrade to latest > stable mdadm version?) Highly likely to be a motherboard/BIOS issue related to properly tuning and timing the hardware. HOWEVER, libata can help (via Tejun's recent patches) by properly handling the error when throw to us by hardware. Jeff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 16:58 ` Jeff Garzik @ 2006-01-26 17:15 ` David Greaves 2006-02-07 18:35 ` SMART on SATA reporting errors? (was Re: regarding bug #5914 - fs corruption on SATA) David Greaves 2006-01-26 17:20 ` regarding bug #5914 - fs corruption on SATA Soeren Sonnenburg 1 sibling, 1 reply; 33+ messages in thread From: David Greaves @ 2006-01-26 17:15 UTC (permalink / raw) To: Jeff Garzik Cc: Tejun Heo, Nicolas.Mailhot, Jens Axboe, Linux-ide, Christopher Smith, Erik Slagter, hahn, mlaks, Soeren Sonnenburg, mlaks Jeff Garzik wrote: > David Greaves wrote: > >> Possible libata/sata/Asus problem (was Re: Need to upgrade to latest >> stable mdadm version?) > > Highly likely to be a motherboard/BIOS issue related to properly > tuning and timing the hardware. > > HOWEVER, libata can help (via Tejun's recent patches) by properly > handling the error when throw to us by hardware. OK - I thought my messages: Jan 20 06:25:04 haze kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 20 06:25:04 haze kernel: ata2: error=0x04 { DriveStatusError } Jan 20 06:25:10 haze kernel: ata2: no sense translation for status: 0x51 Jan 20 06:25:10 haze kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 20 06:25:18 haze kernel: ata2: no sense translation for status: 0x51 Jan 20 06:25:18 haze kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 20 06:25:18 haze kernel: ata2: no sense translation for status: 0x51 Jan 20 06:25:18 haze kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 20 06:25:20 haze kernel: ata2: no sense translation for status: 0x51 Jan 20 06:25:20 haze kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 20 06:25:22 haze kernel: ata2: no sense translation for status: 0x51 Jan 20 06:25:22 haze kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 20 06:25:52 haze kernel: ata2: no sense translation for status: 0x51 Jan 20 06:25:52 haze kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jan 20 06:25:52 haze kernel: sd 1:0:0:0: SCSI error: return code = 0x8000002 Jan 20 06:25:52 haze kernel: sdb: Current: sense key: Medium Error Jan 20 06:25:52 haze kernel: Additional sense: Unrecovered read error - auto reallocate failed Jan 20 06:25:52 haze kernel: end_request: I/O error, dev sdb, sector 390787713 bore a certain similarity to those in Tejun/Nicolas' mail: Different problem? as irq might ask: "does anybody care?" :) (and yes badblocks and SMART reports all is well) David -- ^ permalink raw reply [flat|nested] 33+ messages in thread
* SMART on SATA reporting errors? (was Re: regarding bug #5914 - fs corruption on SATA) 2006-01-26 17:15 ` David Greaves @ 2006-02-07 18:35 ` David Greaves 2006-02-07 19:30 ` Jeff Garzik 0 siblings, 1 reply; 33+ messages in thread From: David Greaves @ 2006-02-07 18:35 UTC (permalink / raw) To: Linux-ide Cc: Jeff Garzik, Tejun Heo, Nicolas.Mailhot, Jens Axboe, Christopher Smith, Erik Slagter, hahn, mlaks, Soeren Sonnenburg, mlaks, smartmontools-support This is a followon to the email below. Basically, it seems some SMART commands produce unexpected errrors. My Debian smartd config has "-o on" and "-S on" for every drive so it puts out lots of errors every time I boot. I did a little investigation and I see that when I do: # smartctl -o on -data /dev/sdb smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === Error SMART Disable Automatic Offline failed: Input/output error Smartctl: SMART Disable Automatic Offline Failed. (Which is fine if the drive doesn't support it.) I unexpectedly get this in dmesg: ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata2: status=0x51 { DriveReady SeekComplete Error } ata2: error=0x04 { DriveStatusError } ata2: no sense translation for status: 0x51 ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 ata2: status=0x51 { DriveReady SeekComplete Error } ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata2: status=0x51 { DriveReady SeekComplete Error } ata2: error=0x04 { DriveStatusError } ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata2: status=0x51 { DriveReady SeekComplete Error } ata2: error=0x04 { DriveStatusError } ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata2: status=0x51 { DriveReady SeekComplete Error } ata2: error=0x04 { DriveStatusError } If I try with sda the first time it fails: # smartctl -o off -data /dev/sda smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === Error SMART Disable Automatic Offline failed: Input/output error Smartctl: SMART Disable Automatic Offline Failed. and I get: ata1: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x51 { DriveReady SeekComplete Error } ata1: error=0x04 { DriveStatusError } ata1: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x51 { DriveReady SeekComplete Error } ata1: error=0x04 { DriveStatusError } ata1: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x51 { DriveReady SeekComplete Error } ata1: error=0x04 { DriveStatusError } thereafter it works: # smartctl -s on -data /dev/sda smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled. # smartctl -s off -data /dev/sda smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Disabled. Use option -s with argument 'on' to enable it. (no dmesg output this time) If I try this on sdc, it succeeds *and* I get error messages: # smartctl -S off -data /dev/sdc smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Disabled. Use option -s with argument 'on' to enable it. I still get this in dmesg: ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata3: status=0x51 { DriveReady SeekComplete Error } ata3: error=0x04 { DriveStatusError } ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata3: status=0x51 { DriveReady SeekComplete Error } ata3: error=0x04 { DriveStatusError } ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata3: status=0x51 { DriveReady SeekComplete Error } ata3: error=0x04 { DriveStatusError } Some more boot time dmesg info Linux version 2.6.15 (root@haze) (gcc version 4.0.3 20051201 (prerelease) (Debian 4.0.2-5)) #4 PREEMPT Tue Jan 24 08:30:31 UTC 2006 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009d800 (usable) BIOS-e820: 000000000009d800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000003fffb000 (usable) BIOS-e820: 000000003fffb000 - 000000003ffff000 (ACPI data) BIOS-e820: 000000003ffff000 - 0000000040000000 (ACPI NVS) BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved) 127MB HIGHMEM available. 896MB LOWMEM available. On node 0 totalpages: 262139 DMA zone: 4096 pages, LIFO batch:0 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 225280 pages, LIFO batch:31 HighMem zone: 32763 pages, LIFO batch:7 DMI 2.3 present. ACPI: RSDP (v000 ASUS ) @ 0x000f5e30 ACPI: RSDT (v001 ASUS A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x3fffb000 ACPI: FADT (v001 ASUS A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x3fffb0b2 ACPI: BOOT (v001 ASUS A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x3fffb030 ACPI: MADT (v001 ASUS A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x3fffb058 ACPI: DSDT (v001 ASUS A7V600-X 0x00001000 MSFT 0x0100000b) @ 0x00000000 ACPI: Local APIC address 0xfee00000 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 6:10 APIC version 16 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 3, address 0xfec00000, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl edge) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Enabling APIC mode: Flat. Using 1 I/O APICs Using ACPI (MADT) for SMP configuration information Allocating PCI resources starting at 50000000 (gap: 40000000:bec00000) Built 1 zonelists Kernel command line: root=/dev/md0 ro mapped APIC to ffffd000 (fee00000) mapped IOAPIC to ffffc000 (fec00000) Initializing CPU#0 PID hash table entries: 4096 (order: 12, 65536 bytes) Detected 2125.801 MHz processor. Using tsc for high-res timesource Console: colour VGA+ 80x25 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 1035636k/1048556k available (2326k kernel code, 12328k reserved, 576k data, 176k init, 131052k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay using timer specific routine.. 4258.01 BogoMIPS (lpj=8516036) Mount-cache hash table entries: 512 CPU: After generic identify, caps: 0383fbff c1c3fbff 00000000 00000000 00000000 00000000 00000000 CPU: After vendor identify, caps: 0383fbff c1c3fbff 00000000 00000000 00000000 00000000 00000000 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU: After all inits, caps: 0383fbff c1c3fbff 00000000 00000020 00000000 00000000 00000000 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. mtrr: v2.0 (20020519) CPU: AMD Athlon(TM) XP 3000+ stepping 00 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. ENABLING IO-APIC IRQs ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 NET: Registered protocol family 16 ACPI: bus type pci registered PCI: PCI BIOS revision 2.10 entry at 0xf1970, last bus=1 PCI: Using configuration type 1 ACPI: Subsystem revision 20050902 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 11 12) *0, disabled. ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 11 12) *0, disabled. ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 11 12) *0, disabled. ACPI: PCI Interrupt Link [LNKE] (IRQs *3 4 5 6 7 9 10 11 12) ACPI: PCI Interrupt Link [LNKF] (IRQs *3 4 5 6 7 9 10 11 12) ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 *7 9 10 11 12) ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 11 12) *15, disabled. ACPI: PCI Root Bridge [PCI0] (0000:00) PCI: Probing PCI hardware (bus 00) ACPI: Assume root bridge [\_SB_.PCI0] bus is 0 Boot video device is 0000:01:00.0 ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT] SCSI subsystem initialized PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report PCI: Bridge: 0000:00:01.0 IO window: d000-dfff MEM window: be800000-bfefffff PREFETCH window: c0000000-f7ffffff PCI: Setting latency timer of device 0000:00:01.0 to 64 Simple Boot Flag at 0x3a set to 0x80 Machine check exception polling timer started. highmem bounce pool size: 64 pages SGI XFS with no debug enabled io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered PCI: Bypassing VIA 8237 APIC De-Assert Message serio: i8042 AUX port at 0x60,0x64 irq 12 serio: i8042 KBD port at 0x60,0x64 irq 1 RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: IDE controller at PCI slot 0000:00:0f.1 ACPI: PCI Interrupt 0000:00:0f.1[A] -> GSI 20 (level, low) -> IRQ 16 PCI: Via IRQ fixup for 0000:00:0f.1, from 14 to 0 VP_IDE: chipset revision 6 VP_IDE: not 100% native mode: will probe irqs later VP_IDE: VIA vt8237 (rev 00) IDE UDMA133 controller on pci0000:00:0f.1 ide0: BM-DMA at 0x7800-0x7807, BIOS settings: hda:DMA, hdb:DMA ide1: BM-DMA at 0x7808-0x780f, BIOS settings: hdc:pio, hdd:pio Probing IDE interface ide0... hda: PLEXTOR DVDR PX-708A, ATAPI CD/DVD-ROM drive hdb: TSSTcorpCD/DVDW SH-W162C, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 Probing IDE interface ide1... hda: ATAPI 40X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.20 hdb: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33) libata version 1.20 loaded. sata_sil 0000:00:0a.0: version 0.9 ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 16 (level, low) -> IRQ 17 ata1: SATA max UDMA/100 cmd 0xF8804080 ctl 0xF880408A bmdma 0xF8804000 irq 17 ata2: SATA max UDMA/100 cmd 0xF88040C0 ctl 0xF88040CA bmdma 0xF8804008 irq 17 ata1: dev 0 cfg 49:2f00 82:7869 83:7d09 84:4043 85:7869 86:3c01 87:4043 88:203f ata1: dev 0 ATA-7, max UDMA/100, 390721968 sectors: LBA48 ata1: dev 0 configured for UDMA/100 scsi0 : sata_sil ata2: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4063 85:7c69 86:3e01 87:4063 88:007f ata2: dev 0 ATA-7, max UDMA/133, 398297088 sectors: LBA48 ata2: dev 0 configured for UDMA/100 scsi1 : sata_sil Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC Type: Direct-Access ANSI SCSI revision: 05 Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC Type: Direct-Access ANSI SCSI revision: 05 sata_via 0000:00:0f.0: version 1.1 ACPI: PCI Interrupt 0000:00:0f.0[B] -> GSI 20 (level, low) -> IRQ 16 sata_via 0000:00:0f.0: routed to hard irq line 0 ata3: SATA max UDMA/133 cmd 0x9800 ctl 0x9402 bmdma 0x8400 irq 16 ata4: SATA max UDMA/133 cmd 0x9000 ctl 0x8802 bmdma 0x8408 irq 16 ata3: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:407f ata3: dev 0 ATA-6, max UDMA/133, 312581808 sectors: LBA48 ata3: dev 0 configured for UDMA/133 scsi2 : sata_via ata4: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4063 85:7c69 86:3e01 87:4063 88:407f ata4: dev 0 ATA-7, max UDMA/133, 398297088 sectors: LBA48 ata4: dev 0 configured for UDMA/133 scsi3 : sata_via Vendor: ATA Model: ST3160023AS Rev: 3.18 Type: Direct-Access ANSI SCSI revision: 05 Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) SCSI device sda: drive cache: write back SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) SCSI device sda: drive cache: write back sda: sda1 sd 0:0:0:0: Attached scsi disk sda SCSI device sdb: 398297088 512-byte hdwr sectors (203928 MB) SCSI device sdb: drive cache: write back SCSI device sdb: 398297088 512-byte hdwr sectors (203928 MB) SCSI device sdb: drive cache: write back sdb: sdb1 sdb2 sd 1:0:0:0: Attached scsi disk sdb SCSI device sdc: 312581808 512-byte hdwr sectors (160042 MB) SCSI device sdc: drive cache: write back SCSI device sdc: 312581808 512-byte hdwr sectors (160042 MB) SCSI device sdc: drive cache: write back sdc: sdc1 sdc2 sdc3 sdc4 sd 2:0:0:0: Attached scsi disk sdc SCSI device sdd: 398297088 512-byte hdwr sectors (203928 MB) SCSI device sdd: drive cache: write back SCSI device sdd: 398297088 512-byte hdwr sectors (203928 MB) SCSI device sdd: drive cache: write back sdd: sdd1 sdd2 sd 3:0:0:0: Attached scsi disk sdd sd 0:0:0:0: Attached scsi generic sg0 type 0 sd 1:0:0:0: Attached scsi generic sg1 type 0 sd 2:0:0:0: Attached scsi generic sg2 type 0 sd 3:0:0:0: Attached scsi generic sg3 type 0 David David Greaves wrote: >Jeff Garzik wrote: > > >>David Greaves wrote: >> >> >>> Possible libata/sata/Asus problem (was Re: Need to upgrade to latest >>>stable mdadm version?) >>> >>> >>Highly likely to be a motherboard/BIOS issue related to properly >>tuning and timing the hardware. >> >>HOWEVER, libata can help (via Tejun's recent patches) by properly >>handling the error when throw to us by hardware. >> >> >OK - I thought my messages: > >Jan 20 06:25:04 haze kernel: ata2: status=0x51 { DriveReady SeekComplete >Error } >Jan 20 06:25:04 haze kernel: ata2: error=0x04 { DriveStatusError } >Jan 20 06:25:10 haze kernel: ata2: no sense translation for status: 0x51 >Jan 20 06:25:10 haze kernel: ata2: status=0x51 { DriveReady SeekComplete >Error } >Jan 20 06:25:18 haze kernel: ata2: no sense translation for status: 0x51 >Jan 20 06:25:18 haze kernel: ata2: status=0x51 { DriveReady SeekComplete >Error } >Jan 20 06:25:18 haze kernel: ata2: no sense translation for status: 0x51 >Jan 20 06:25:18 haze kernel: ata2: status=0x51 { DriveReady SeekComplete >Error } >Jan 20 06:25:20 haze kernel: ata2: no sense translation for status: 0x51 >Jan 20 06:25:20 haze kernel: ata2: status=0x51 { DriveReady SeekComplete >Error } >Jan 20 06:25:22 haze kernel: ata2: no sense translation for status: 0x51 >Jan 20 06:25:22 haze kernel: ata2: status=0x51 { DriveReady SeekComplete >Error } >Jan 20 06:25:52 haze kernel: ata2: no sense translation for status: 0x51 >Jan 20 06:25:52 haze kernel: ata2: status=0x51 { DriveReady SeekComplete >Error } >Jan 20 06:25:52 haze kernel: sd 1:0:0:0: SCSI error: return code = 0x8000002 >Jan 20 06:25:52 haze kernel: sdb: Current: sense key: Medium Error >Jan 20 06:25:52 haze kernel: Additional sense: Unrecovered read >error - auto reallocate failed >Jan 20 06:25:52 haze kernel: end_request: I/O error, dev sdb, sector >390787713 > >bore a certain similarity to those in Tejun/Nicolas' mail: > >Different problem? as irq might ask: "does anybody care?" :) > >(and yes badblocks and SMART reports all is well) > > -- ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: SMART on SATA reporting errors? (was Re: regarding bug #5914 - fs corruption on SATA) 2006-02-07 18:35 ` SMART on SATA reporting errors? (was Re: regarding bug #5914 - fs corruption on SATA) David Greaves @ 2006-02-07 19:30 ` Jeff Garzik 2006-02-08 7:21 ` David Greaves 0 siblings, 1 reply; 33+ messages in thread From: Jeff Garzik @ 2006-02-07 19:30 UTC (permalink / raw) To: David Greaves Cc: Linux-ide, Tejun Heo, Nicolas.Mailhot, Jens Axboe, Christopher Smith, Erik Slagter, hahn, mlaks, Soeren Sonnenburg, mlaks, smartmontools-support David Greaves wrote: > This is a followon to the email below. > > Basically, it seems some SMART commands produce unexpected errrors. > > My Debian smartd config has "-o on" and "-S on" for every drive so it > puts out lots of errors every time I boot. > > I did a little investigation and I see that when I do: > # smartctl -o on -data /dev/sdb > smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF ENABLE/DISABLE COMMANDS SECTION === > Error SMART Disable Automatic Offline failed: Input/output error > Smartctl: SMART Disable Automatic Offline Failed. > > (Which is fine if the drive doesn't support it.) > > I unexpectedly get this in dmesg: > > ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 > ata2: status=0x51 { DriveReady SeekComplete Error } > ata2: error=0x04 { DriveStatusError } > ata2: no sense translation for status: 0x51 > ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 > ata2: status=0x51 { DriveReady SeekComplete Error } > ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 > ata2: status=0x51 { DriveReady SeekComplete Error } > ata2: error=0x04 { DriveStatusError } > ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 > ata2: status=0x51 { DriveReady SeekComplete Error } > ata2: error=0x04 { DriveStatusError } > ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 > ata2: status=0x51 { DriveReady SeekComplete Error } > ata2: error=0x04 { DriveStatusError } All of your commands are missing "-d ata" Jeff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: SMART on SATA reporting errors? (was Re: regarding bug #5914 - fs corruption on SATA) 2006-02-07 19:30 ` Jeff Garzik @ 2006-02-08 7:21 ` David Greaves 0 siblings, 0 replies; 33+ messages in thread From: David Greaves @ 2006-02-08 7:21 UTC (permalink / raw) To: Jeff Garzik Cc: Linux-ide, Tejun Heo, Nicolas.Mailhot, Jens Axboe, Christopher Smith, Erik Slagter, hahn, mlaks, Soeren Sonnenburg, mlaks, smartmontools-support Jeff Garzik wrote: > David Greaves wrote: > >> I did a little investigation and I see that when I do: >> # smartctl -o on -data /dev/sdb > <snip> > All of your commands are missing "-d ata" well, technically yes, I used -data in all of them, is the space or option order important? David Greaves wrote: # smartctl -o on -data /dev/sdb # smartctl -o off -data /dev/sda # smartctl -s on -data /dev/sda # smartctl -s off -data /dev/sda # smartctl -S off -data /dev/sdc David -- ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: regarding bug #5914 - fs corruption on SATA 2006-01-26 16:58 ` Jeff Garzik 2006-01-26 17:15 ` David Greaves @ 2006-01-26 17:20 ` Soeren Sonnenburg 1 sibling, 0 replies; 33+ messages in thread From: Soeren Sonnenburg @ 2006-01-26 17:20 UTC (permalink / raw) To: Jeff Garzik Cc: David Greaves, Tejun Heo, Nicolas.Mailhot, Jens Axboe, Linux-ide, Christopher Smith, Erik Slagter, hahn, mlaks, mlaks On Thu, 2006-01-26 at 11:58 -0500, Jeff Garzik wrote: > David Greaves wrote: > > Have a look for subjects such as > > Problems with multiple Promise SATA150 TX4 cards > > This is almost certainly either a power or PCI bus/slot issue. So you mean the freeze I am observing when I copy files from the sata disk to some ieee1394 device (they are sharing interrupt 16) 16: 430796 IO-APIC-level ide2, ide3, libata, ohci1394 leading to lots of output as... ata2: translated ATA stat/err 0x51/0c to SCSI SK/ASC/ASCQ 0xb/00/00 ata2: status=0x51 { DriveReady SeekComplete Error } ata2: error=0x0c { DriveStatusError } could be a cause of just that ? > > Possible libata/sata/Asus problem (was Re: Need to upgrade to latest > > stable mdadm version?) > > Highly likely to be a motherboard/BIOS issue related to properly tuning > and timing the hardware. > > HOWEVER, libata can help (via Tejun's recent patches) by properly > handling the error when throw to us by hardware. So it could help in the first case but also with this: I can freeze the system bei hdparm -y /dev/sda ... will this patch also help in that case ? Soeren. -- Sometimes, there's a moment as you're waking, when you become aware of the real world around you, but you're still dreaming. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2006-02-08 7:21 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-26 5:50 regarding bug #5914 - fs corruption on SATA Tejun Heo
2006-01-26 5:51 ` Tejun Heo
2006-01-26 9:14 ` Nicolas Mailhot
2006-01-26 9:21 ` Jens Axboe
2006-01-26 10:01 ` Nicolas Mailhot
[not found] ` <5840.192.54.193.25.1138269692.squirrel@rousalka.dyndns.org>
2006-01-26 21:04 ` Nicolas Mailhot
2006-01-27 8:13 ` Jens Axboe
2006-01-27 8:53 ` Nicolas Mailhot
2006-01-27 9:10 ` Jens Axboe
2006-01-27 9:20 ` Jens Axboe
2006-01-27 9:27 ` Nicolas Mailhot
2006-01-27 9:46 ` Bartlomiej Zolnierkiewicz
2006-01-27 9:50 ` Jens Axboe
2006-01-27 19:37 ` Nicolas Mailhot
2006-01-27 23:54 ` Nicolas Mailhot
2006-01-30 15:08 ` Jens Axboe
2006-01-30 23:33 ` Nicolas Mailhot
2006-01-31 7:26 ` Jens Axboe
2006-01-31 8:39 ` Nicolas Mailhot
2006-01-31 8:47 ` Jens Axboe
2006-01-31 22:54 ` Nicolas Mailhot
2006-01-27 12:12 ` Ric Wheeler
2006-01-27 12:23 ` Jens Axboe
2006-01-26 9:18 ` Jens Axboe
2006-01-26 14:11 ` Bartlomiej Zolnierkiewicz
2006-01-26 14:27 ` Jens Axboe
2006-01-26 16:41 ` David Greaves
2006-01-26 16:58 ` Jeff Garzik
2006-01-26 17:15 ` David Greaves
2006-02-07 18:35 ` SMART on SATA reporting errors? (was Re: regarding bug #5914 - fs corruption on SATA) David Greaves
2006-02-07 19:30 ` Jeff Garzik
2006-02-08 7:21 ` David Greaves
2006-01-26 17:20 ` regarding bug #5914 - fs corruption on SATA Soeren Sonnenburg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).