From: TomK <tk@mdevsys.com>
To: linux-raid@vger.kernel.org
Subject: Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
Date: Sun, 30 Oct 2016 15:16:13 -0400 [thread overview]
Message-ID: <cf4dedf2-93d1-2cf9-80e4-85c4caaf167e@mdevsys.com> (raw)
In-Reply-To: <73e35e17-80aa-c7e6-535c-3665d9789e16@mdevsys.com>
On 10/30/2016 2:56 PM, TomK wrote:
> Hey Guy's,
>
> We recently saw a situation where smartctl -A errored out eventually in
> a short time of a few days the disk cascaded into bad blocks eventually
> becoming a completely unrecognizable SATA disk. It apparently was
> limping along for 6 months causing random timeout and slowdowns
> accessing the array. But the RAID array did not pull it out or and did
> not mark it as bad. The RAID 6 we have has been running for 6 years,
> however we did have alot of disk replacements in it yet it was always
> very very reliable. Disks started as all 1TB Seagates but are now 2 WD
> 2TB, 1 2TB Seagate with 2 left as 1TB Seagates and the last one as
> 1.5TB. Has a mix of green, red, blue etc. Yet very rock solid.
>
> We did not do a thorough R/W test to see how the error and bad disk
> affected the data stored on the array but did notice pauses and
> slowdowns on the CIFS share presented from it with pauses and generally
> difficulty in reading data, however no data errors that we could see.
> Since then we replaced the 2TB Seagate with a new 2TB WD and everything
> is fine even if the array is degraded. But as soon as we put in this
> bad disk, it degraded to it's previous behaviour. Yet the array didn't
> catch it as a failed disk until the disk was nearly completely
> inaccessible.
>
> So the question is how come the mdadm RAID did not catch this disk as a
> failed disk and pull it out of the array? Seams this disk was going bad
> for a while now but as long as the array reported all 6 healthy, there
> was no cause for alarm. Also how does the array not detect the disk
> failure while issues in applications using the array show up? Removing
> the disk and leaving the array in a degraded state also solved the
> accessibility issue on the array. So appears the disk was generating
> some sort of errors (Possibly bad PCB) that were not caught before.
>
> Looking at the changelogs, has a similar case been addressed?
>
> On a separate topic, if I eventually expand the array to 6 2TB disks,
> will the array be smart enough to allow me to expand it to the new size?
> Have not tried that yet and wanted to ask first.
>
> Cheers,
> Tom
>
>
> [root@mbpc-pc modprobe.d]# rpm -qf /sbin/mdadm
> mdadm-3.3.2-5.el6.x86_64
> [root@mbpc-pc modprobe.d]#
>
>
> (The 100% util lasts roughly 30 seconds)
> 10/23/2016 10:18:20 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 0.25 25.19 0.00 74.56
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
> avgrq-sz avgqu-sz await svctm %util
> sdb 0.00 0.00 0.00 1.00 0.00 2.50 5.00
> 0.03 27.00 27.00 2.70
> sdc 0.00 0.00 0.00 1.00 0.00 2.50 5.00
> 0.01 15.00 15.00 1.50
> sdd 0.00 0.00 0.00 1.00 0.00 2.50 5.00
> 0.02 18.00 18.00 1.80
> sde 0.00 0.00 0.00 1.00 0.00 2.50 5.00
> 0.02 23.00 23.00 2.30
> sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 1.15 0.00 0.00 100.00
> sdg 0.00 2.00 1.00 4.00 4.00 172.00 70.40
> 0.04 8.40 2.80 1.40
> sda 0.00 0.00 0.00 1.00 0.00 2.50 5.00
> 0.04 37.00 37.00 3.70
> sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> fd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 1.00 6.00 4.00 172.00 50.29
> 0.05 7.29 2.00 1.40
> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 1.00 0.00 0.00 100.00
>
> 10/23/2016 10:18:21 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 0.25 24.81 0.00 74.94
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
> avgrq-sz avgqu-sz await svctm %util
> sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 2.00 0.00 0.00 100.00
> sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> fd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
> dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 1.00 0.00 0.00 100.00
>
>
> We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
> 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
> mark when something occurs and it drops down to below 100% numbers.
>
> So I checked the array which shows all clean, even across reboots:
>
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
> 3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
> [UUUUUU]
> bitmap: 1/8 pages [4KB], 65536KB chunk
>
> unused devices: <none>
> [root@mbpc-pc ~]#
>
>
> Then I run smartctl across all disks and sure enough /dev/sdf prints this:
>
> [root@mbpc-pc ~]# smartctl -A /dev/sdf
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Error SMART Values Read failed: scsi error badly formed scsi parameters
> Smartctl: SMART Read Values failed.
>
> === START OF READ SMART DATA SECTION ===
> [root@mbpc-pc ~]#
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Bit trigger happy. Here's a better version of the first sentence. :)
We recently saw a situation where smartctl -A errored out but mdadm
didn't pick this up. Eventually, in a short time of a few days, the disk
cascaded into bad blocks then became a completely unrecognizable SATA disk.
--
Cheers,
Tom K.
-------------------------------------------------------------------------------------
Living on earth is expensive, but it includes a free trip around the sun.
next prev parent reply other threads:[~2016-10-30 19:16 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-30 2:16 Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
2016-10-30 9:33 ` Andreas Klauer
2016-10-30 15:38 ` Marc MERLIN
2016-10-30 16:19 ` Andreas Klauer
2016-10-30 16:34 ` Phil Turmel
2016-10-30 17:12 ` clearing blocks wrongfully marked as bad if --update=no-bbl can't be used? Marc MERLIN
2016-10-30 17:16 ` Marc MERLIN
2016-11-04 18:18 ` Marc MERLIN
2016-11-04 18:22 ` Phil Turmel
2016-11-04 18:50 ` Marc MERLIN
2016-11-04 18:59 ` Roman Mamedov
2016-11-04 19:31 ` Roman Mamedov
2016-11-04 20:02 ` Marc MERLIN
2016-11-04 19:51 ` Marc MERLIN
2016-11-07 0:16 ` NeilBrown
2016-11-07 1:13 ` Marc MERLIN
2016-11-07 3:36 ` Phil Turmel
2016-10-30 18:56 ` [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds TomK
2016-10-30 19:16 ` TomK [this message]
2016-10-30 20:13 ` Andreas Klauer
2016-10-30 21:08 ` TomK
2016-10-31 19:29 ` Wols Lists
2016-11-01 2:40 ` TomK
2016-10-30 16:43 ` Buffer I/O error on dev md5, logical block 7073536, async page read Marc MERLIN
2016-10-30 17:02 ` Andreas Klauer
2016-10-31 19:24 ` Wols Lists
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cf4dedf2-93d1-2cf9-80e4-85c4caaf167e@mdevsys.com \
--to=tk@mdevsys.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).