linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: TomK <tk@mdevsys.com>
To: linux-raid@vger.kernel.org
Subject: Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
Date: Sun, 30 Oct 2016 15:16:13 -0400	[thread overview]
Message-ID: <cf4dedf2-93d1-2cf9-80e4-85c4caaf167e@mdevsys.com> (raw)
In-Reply-To: <73e35e17-80aa-c7e6-535c-3665d9789e16@mdevsys.com>

On 10/30/2016 2:56 PM, TomK wrote:
> Hey Guy's,
>
> We recently saw a situation where smartctl -A errored out eventually in
> a short time of a few days the disk cascaded into bad blocks eventually
> becoming a completely unrecognizable SATA disk.  It apparently was
> limping along for 6 months causing random timeout and slowdowns
> accessing the array.  But the RAID array did not pull it out or and did
> not mark it as bad.  The RAID 6 we have has been running for 6 years,
> however we did have alot of disk replacements in it yet it was always
> very very reliable.  Disks started as all 1TB Seagates but are now 2 WD
> 2TB, 1 2TB Seagate with 2 left as 1TB Seagates and the last one as
> 1.5TB.  Has a mix of green, red, blue etc.  Yet very rock solid.
>
> We did not do a thorough R/W test to see how the error and bad disk
> affected the data stored on the array but did notice pauses and
> slowdowns on the CIFS share presented from it with pauses and generally
> difficulty in reading data, however no data errors that we could see.
> Since then we replaced the 2TB Seagate with a new 2TB WD and everything
> is fine even if the array is degraded.  But as soon as we put in this
> bad disk, it degraded to it's previous behaviour.  Yet the array didn't
> catch it as a failed disk until the disk was nearly completely
> inaccessible.
>
> So the question is how come the mdadm RAID did not catch this disk as a
> failed disk and pull it out of the array?  Seams this disk was going bad
> for a while now but as long as the array reported all 6 healthy, there
> was no cause for alarm.  Also how does the array not detect the disk
> failure while issues in applications using the array show up?  Removing
> the disk and leaving the array in a degraded state also solved the
> accessibility issue on the array.  So appears the disk was generating
> some sort of errors (Possibly bad PCB) that were not caught before.
>
> Looking at the changelogs, has a similar case been addressed?
>
> On a separate topic, if I eventually expand the array to 6 2TB disks,
> will the array be smart enough to allow me to expand it to the new size?
>  Have not tried that yet and wanted to ask first.
>
> Cheers,
> Tom
>
>
> [root@mbpc-pc modprobe.d]# rpm -qf /sbin/mdadm
> mdadm-3.3.2-5.el6.x86_64
> [root@mbpc-pc modprobe.d]#
>
>
> (The 100% util lasts roughly 30 seconds)
> 10/23/2016 10:18:20 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   25.19    0.00   74.56
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.03   27.00  27.00   2.70
> sdc               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.01   15.00  15.00   1.50
> sdd               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.02   18.00  18.00   1.80
> sde               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.02   23.00  23.00   2.30
> sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.15    0.00   0.00 100.00
> sdg               0.00     2.00    1.00    4.00     4.00   172.00 70.40
>    0.04    8.40   2.80   1.40
> sda               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.04   37.00  37.00   3.70
> sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    1.00    6.00     4.00   172.00 50.29
>    0.05    7.29   2.00   1.40
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.00    0.00   0.00 100.00
>
> 10/23/2016 10:18:21 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   24.81    0.00   74.94
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdc               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdd               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 2.00    0.00   0.00 100.00
> sdg               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sda               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.00    0.00   0.00 100.00
>
>
> We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
> 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
> mark when something occurs and it drops down to below 100% numbers.
>
> So I checked the array which shows all clean, even across reboots:
>
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
> [UUUUUU]
>       bitmap: 1/8 pages [4KB], 65536KB chunk
>
> unused devices: <none>
> [root@mbpc-pc ~]#
>
>
> Then I run smartctl across all disks and sure enough /dev/sdf prints this:
>
> [root@mbpc-pc ~]# smartctl -A /dev/sdf
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Error SMART Values Read failed: scsi error badly formed scsi parameters
> Smartctl: SMART Read Values failed.
>
> === START OF READ SMART DATA SECTION ===
> [root@mbpc-pc ~]#
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bit trigger happy.  Here's a better version of the first sentence.  :)

We recently saw a situation where smartctl -A errored out but mdadm 
didn't pick this up. Eventually, in a short time of a few days, the disk 
cascaded into bad blocks then became a completely unrecognizable SATA disk.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


  reply	other threads:[~2016-10-30 19:16 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-30  2:16 Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
2016-10-30  9:33 ` Andreas Klauer
2016-10-30 15:38   ` Marc MERLIN
2016-10-30 16:19     ` Andreas Klauer
2016-10-30 16:34       ` Phil Turmel
2016-10-30 17:12         ` clearing blocks wrongfully marked as bad if --update=no-bbl can't be used? Marc MERLIN
2016-10-30 17:16           ` Marc MERLIN
2016-11-04 18:18             ` Marc MERLIN
2016-11-04 18:22               ` Phil Turmel
2016-11-04 18:50                 ` Marc MERLIN
2016-11-04 18:59                   ` Roman Mamedov
2016-11-04 19:31                     ` Roman Mamedov
2016-11-04 20:02                       ` Marc MERLIN
2016-11-04 19:51                     ` Marc MERLIN
2016-11-07  0:16                       ` NeilBrown
2016-11-07  1:13                         ` Marc MERLIN
2016-11-07  3:36                           ` Phil Turmel
2016-10-30 18:56         ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
2016-10-30 19:16           ` TomK [this message]
2016-10-30 20:13           ` Andreas Klauer
2016-10-30 21:08             ` TomK
2016-10-31 19:29           ` Wols Lists
2016-11-01  2:40             ` TomK
2016-10-30 16:43       ` Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
2016-10-30 17:02         ` Andreas Klauer
2016-10-31 19:24         ` Wols Lists

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cf4dedf2-93d1-2cf9-80e4-85c4caaf167e@mdevsys.com \
    --to=tk@mdevsys.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).