Re: MD/RAID: what's wrong with sector 1953519935?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrei Tanas <andrei@tanas.ca>
To: Ric Wheeler <rwheeler@redhat.com>
Cc: NeilBrown <neilb@suse.de>,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org
Subject: Re: MD/RAID: what's wrong with sector 1953519935?
Date: Wed, 26 Aug 2009 14:12:42 -0400	[thread overview]
Message-ID: <1571f45804875514762f60c0097171e6@localhost> (raw)
In-Reply-To: <4A95573A.6090404@redhat.com>

On Wed, 26 Aug 2009 11:39:38 -0400, Ric Wheeler <rwheeler@redhat.com>
wrote:
> On 08/26/2009 10:46 AM, Andrei Tanas wrote:
>> On Wed, 26 Aug 2009 06:34:14 -0400, Ric Wheeler<rwheeler@redhat.com>
>> wrote:
>>> On 08/25/2009 11:45 PM, Andrei Tanas wrote:
>>>>>>> I would suggest that Andrei might try to write and clear the IO
>>>>> error
>>>>>>> at that
>>>>>>> offset. You can use Mark Lord's hdparm to clear a specific sector
or
>>>>>>> just do the
>>>>>>> math (carefully!) and dd over it. It the write succeeds (without
>>>>>>> bumping your
>>>>>>> remapped sectors count) this is a likely match to this problem,
>>>>>>>
>>>>>> I've tried dd multiple times, it always succeeds, and the relocated
>>>>> sector
>>>>>> count is currently 1 on this drive, even though this particular
fault
>>>>>> happened at least 3 times so far.
>>>>>>
>>>>>>

>>>  you need to set the tunable:
>>>
>>> /sys/block/mdX/md/safe_mode_delay
>>>
>>> to something like "2" to prevent that sector from being a hotspot...
>>
>> I did that as soon as you suggested that it's possible to tune it. The
>> array is still being rebuilt (it's a fairly busy machine, so rebuilding
>> is
>> slow). I'll monitor it, but I don't expect to see the results soon as
>> even
>> with the default value of 0.2 it used to happen once in several weeks.
>>
>> On the other note: is it possible that the drive was actually working
>> properly but was not given enough time to complete the write request?
>> These
>> newer drives have 32MB cache but the same rotational speed and seek
times
>> as the older ones so they must need more time to flush their cache?
>>
> 
> Timeouts on IO requests are pretty large, usually drives won't fail an IO
> unless 
> there is a real problem but I will add the linux-ide list to this
response
> so 
> they can weigh in.
> 
> I suspect that the error was real, but might be this "repairable" type of

> adjacent track issue I mentioned before. Interesting to note that just
> following 
> the error, you see that it was indeed the super block that did not get
> updated...

The relevant portions of the log file are below (two independent events,
there is nothing related to ata before the "exception" message):

[901292.247428] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
frozen
[901292.247492] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[901292.247494]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
[901292.247500] ata2.00: status: { DRDY }
[901292.247512] ata2: hard resetting link
[901294.090746] ata2: SRST failed (errno=-19)
[901294.101922] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[901294.101938] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[901294.101943] ata2.00: revalidation failed (errno=-5)
[901299.100347] ata2: hard resetting link
[901299.974103] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[901300.105734] ata2.00: configured for UDMA/133
[901300.105776] ata2: EH complete
[901300.137059] end_request: I/O error, dev sdb, sector 1953519935
[901300.137069] md: super_written gets error=-5, uptodate=0
[901300.137077] raid1: Disk failure on sdb1, disabling device.
[901300.137079] raid1: Operation continuing on 1 devices.

[90307.328266] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
frozen
[90307.328275] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[90307.328277]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
[90307.328280] ata2.00: status: { DRDY }
[90307.328288] ata2: hard resetting link
[90313.218511] ata2: link is slow to respond, please be patient (ready=0)
[90317.377711] ata2: SRST failed (errno=-16)
[90317.377720] ata2: hard resetting link
[90318.251720] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[90318.338026] ata2.00: configured for UDMA/133
[90318.338062] ata2: EH complete
[90318.370625] end_request: I/O error, dev sdb, sector 1953519935
[90318.370632] md: super_written gets error=-5, uptodate=0
[90318.370636] raid1: Disk failure on sdb1, disabling device.
[90318.370637] raid1: Operation continuing on 1 devices.

And here's the story for linux-ide from the earlier messages:
> I'm using two ST31000528AS drives in RAID1 array using MD. I've had
several
> failures occur over a period of few months (see logs below). I've RMA'd
the
> drive, but then got curious why an otherwise normal drive locks up while
> trying to write the same sector once a month or so, but does not report
> having bad sectors, doesn't fail any tests, and does just fine if I do
> dd if=/dev/urandom of=/dev/sdb bs=512 seek=1953519935 count=1
> however many times I try.
> I then tried Googling for this number (1953519935) and found that it
comes
> up quite a few times and most of the time (or always) in context of
> md/raid.

Regards,
Andrei.

WARNING: multiple messages have this Message-ID (diff)

From: Andrei Tanas <andrei@tanas.ca>
To: Ric Wheeler <rwheeler@redhat.com>
Cc: NeilBrown <neilb@suse.de>, <linux-kernel@vger.kernel.org>,
	<linux-ide@vger.kernel.org>
Subject: Re: MD/RAID: what's wrong with sector 1953519935?
Date: Wed, 26 Aug 2009 14:12:42 -0400	[thread overview]
Message-ID: <1571f45804875514762f60c0097171e6@localhost> (raw)
In-Reply-To: <4A95573A.6090404@redhat.com>

On Wed, 26 Aug 2009 11:39:38 -0400, Ric Wheeler <rwheeler@redhat.com>
wrote:
> On 08/26/2009 10:46 AM, Andrei Tanas wrote:
>> On Wed, 26 Aug 2009 06:34:14 -0400, Ric Wheeler<rwheeler@redhat.com>
>> wrote:
>>> On 08/25/2009 11:45 PM, Andrei Tanas wrote:
>>>>>>> I would suggest that Andrei might try to write and clear the IO
>>>>> error
>>>>>>> at that
>>>>>>> offset. You can use Mark Lord's hdparm to clear a specific sector
or
>>>>>>> just do the
>>>>>>> math (carefully!) and dd over it. It the write succeeds (without
>>>>>>> bumping your
>>>>>>> remapped sectors count) this is a likely match to this problem,
>>>>>>>
>>>>>> I've tried dd multiple times, it always succeeds, and the relocated
>>>>> sector
>>>>>> count is currently 1 on this drive, even though this particular
fault
>>>>>> happened at least 3 times so far.
>>>>>>
>>>>>>

>>>  you need to set the tunable:
>>>
>>> /sys/block/mdX/md/safe_mode_delay
>>>
>>> to something like "2" to prevent that sector from being a hotspot...
>>
>> I did that as soon as you suggested that it's possible to tune it. The
>> array is still being rebuilt (it's a fairly busy machine, so rebuilding
>> is
>> slow). I'll monitor it, but I don't expect to see the results soon as
>> even
>> with the default value of 0.2 it used to happen once in several weeks.
>>
>> On the other note: is it possible that the drive was actually working
>> properly but was not given enough time to complete the write request?
>> These
>> newer drives have 32MB cache but the same rotational speed and seek
times
>> as the older ones so they must need more time to flush their cache?
>>
> 
> Timeouts on IO requests are pretty large, usually drives won't fail an IO
> unless 
> there is a real problem but I will add the linux-ide list to this
response
> so 
> they can weigh in.
> 
> I suspect that the error was real, but might be this "repairable" type of

> adjacent track issue I mentioned before. Interesting to note that just
> following 
> the error, you see that it was indeed the super block that did not get
> updated...

The relevant portions of the log file are below (two independent events,
there is nothing related to ata before the "exception" message):

[901292.247428] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
frozen
[901292.247492] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[901292.247494]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
[901292.247500] ata2.00: status: { DRDY }
[901292.247512] ata2: hard resetting link
[901294.090746] ata2: SRST failed (errno=-19)
[901294.101922] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[901294.101938] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[901294.101943] ata2.00: revalidation failed (errno=-5)
[901299.100347] ata2: hard resetting link
[901299.974103] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[901300.105734] ata2.00: configured for UDMA/133
[901300.105776] ata2: EH complete
[901300.137059] end_request: I/O error, dev sdb, sector 1953519935
[901300.137069] md: super_written gets error=-5, uptodate=0
[901300.137077] raid1: Disk failure on sdb1, disabling device.
[901300.137079] raid1: Operation continuing on 1 devices.

[90307.328266] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
frozen
[90307.328275] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[90307.328277]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
[90307.328280] ata2.00: status: { DRDY }
[90307.328288] ata2: hard resetting link
[90313.218511] ata2: link is slow to respond, please be patient (ready=0)
[90317.377711] ata2: SRST failed (errno=-16)
[90317.377720] ata2: hard resetting link
[90318.251720] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[90318.338026] ata2.00: configured for UDMA/133
[90318.338062] ata2: EH complete
[90318.370625] end_request: I/O error, dev sdb, sector 1953519935
[90318.370632] md: super_written gets error=-5, uptodate=0
[90318.370636] raid1: Disk failure on sdb1, disabling device.
[90318.370637] raid1: Operation continuing on 1 devices.

And here's the story for linux-ide from the earlier messages:
> I'm using two ST31000528AS drives in RAID1 array using MD. I've had
several
> failures occur over a period of few months (see logs below). I've RMA'd
the
> drive, but then got curious why an otherwise normal drive locks up while
> trying to write the same sector once a month or so, but does not report
> having bad sectors, doesn't fail any tests, and does just fine if I do
> dd if=/dev/urandom of=/dev/sdb bs=512 seek=1953519935 count=1
> however many times I try.
> I then tried Googling for this number (1953519935) and found that it
comes
> up quite a few times and most of the time (or always) in context of
> md/raid.

Regards,
Andrei.

next prev parent reply	other threads:[~2009-08-26 18:12 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-26  0:32 MD/RAID: what's wrong with sector 1953519935? Andrei Tanas
2009-08-26  0:50 ` NeilBrown
2009-08-26  1:06   ` Ric Wheeler
2009-08-26  1:24     ` NeilBrown
2009-08-26  1:31       ` Ric Wheeler
2009-08-26  2:22         ` Andrei Tanas
2009-08-26  2:41           ` Ric Wheeler
2009-08-26  3:45             ` Andrei Tanas
2009-08-26 10:34               ` Ric Wheeler
2009-08-26 14:46                 ` Andrei Tanas
2009-08-26 14:49                   ` Andrei Tanas
2009-08-26 15:39                   ` Ric Wheeler
2009-08-26 18:12                     ` Andrei Tanas [this message]
2009-08-26 18:12                       ` Andrei Tanas
2009-08-27  0:07                       ` Mark Lord
2009-08-27  1:37                         ` Andrei Tanas
2009-08-27  1:37                           ` Andrei Tanas
2009-08-27  2:33                       ` Robert Hancock
2009-08-27 21:22                       ` MD/RAID time out writing superblock Andrei Tanas
2009-08-27 21:57                         ` Ric Wheeler
2009-08-31  8:10                           ` Tejun Heo
2009-08-31 12:04                             ` Ric Wheeler
2009-08-31 12:20                               ` Tejun Heo
2009-09-07 11:44                                 ` Chris Webb
2009-09-07 11:59                                   ` Chris Webb
2009-09-09 12:02                                     ` Chris Webb
2009-09-14  7:41                                       ` Tejun Heo
2009-09-14  7:44                                         ` Tejun Heo
2009-09-14 12:48                                           ` Mark Lord
2009-09-14 13:05                                             ` Tejun Heo
2009-09-14 14:25                                               ` Mark Lord
2009-09-16 23:19                                                 ` Chris Webb
2009-09-17 13:29                                                   ` Mark Lord
2009-09-17 13:32                                                     ` Mark Lord
2009-09-17 13:37                                                     ` Chris Webb
2009-09-17 15:35                                                     ` Tejun Heo
2009-09-17 16:16                                                       ` Mark Lord
2009-09-17 16:17                                                         ` Mark Lord
2009-09-18 17:05                                                           ` Chris Webb
2009-09-20 17:35                                                             ` Allan Wind
2009-09-28  5:32                                                               ` Allan Wind
2009-09-21 10:26                                                             ` Chris Webb
2009-09-21 19:47                                                               ` Mark Lord
2009-09-22  6:16                                                               ` Robert Hancock
2009-09-20 18:36                                                         ` Robert Hancock
2009-09-14 13:11                                           ` Henrique de Moraes Holschuh
2009-09-14 13:24                                             ` Tejun Heo
2009-09-14 14:02                                               ` Henrique de Moraes Holschuh
2009-09-14 14:34                                                 ` Tejun Heo
2009-09-14 13:14                                         ` Gabor Gombas
2009-09-07 16:55                                   ` Allan Wind
2009-09-07 16:55                                   ` Allan Wind
2009-09-07 23:26                                     ` Thomas Fjellstrom
2009-09-07 23:26                                       ` Thomas Fjellstrom
2009-09-14  7:46                                       ` Tejun Heo
2009-09-14 21:13                                         ` Thomas Fjellstrom
2009-09-14 22:23                                           ` Tejun Heo
2009-09-16 22:28                                 ` Chris Webb
2009-09-16 23:47                                   ` Tejun Heo
2009-09-17  0:34                                     ` Neil Brown
2009-09-17 12:00                                       ` Chris Webb
2009-09-17 11:57                                     ` Chris Webb
2009-09-17 15:44                                       ` Tejun Heo
2009-09-17 16:36                                         ` Allan Wind
2009-09-18  0:16                                           ` Tejun Heo
2009-09-18  2:47                                             ` Allan Wind
2009-09-18 17:07                                         ` Chris Webb
2009-09-20 18:46                                         ` Robert Hancock
2009-09-21  0:02                                           ` Kyle Moffett
2009-09-17 13:35                                     ` Mark Lord
2009-09-17 15:47                                       ` Tejun Heo
2009-08-31 12:21                             ` Mark Lord
2009-08-31 23:45                               ` Mark Lord
2009-09-01 13:07                                 ` Andrei Tanas
2009-09-01 13:07                                   ` Andrei Tanas
2009-09-01 13:15                                   ` Mark Lord
2009-09-01 13:30                                     ` Tejun Heo
2009-09-01 13:47                                       ` Ric Wheeler
2009-09-01 14:18                                         ` Andrei Tanas
2009-09-01 14:18                                           ` Andrei Tanas
2009-09-14  5:30                                           ` Marc Giger
2009-09-14  5:30                                             ` Marc Giger
2009-09-02 21:58                                   ` Allan Wind
2009-09-04 19:39                                     ` Andrei Tanas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1571f45804875514762f60c0097171e6@localhost \
    --to=andrei@tanas.ca \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.