* Occasional read error on idle USB attached disk
@ 2015-03-05 22:04 Sandy McArthur Jr
2015-03-05 23:48 ` Duncan
0 siblings, 1 reply; 2+ messages in thread
From: Sandy McArthur Jr @ 2015-03-05 22:04 UTC (permalink / raw)
To: linux-btrfs@vger.kernel.org
I have a btrfs filesystem that gets read errors that appear to only
happen after a disk has been idle a while. I don't know if the error
output below is BTRFS, USB, both or other related. I suspect it's
timing related. If I should take this error report somewhere else,
please point me in the right direction.
I have a large RAID1 btrfs filesystem (>13TB) that provides
archive/backup space that is housed in a multi-drive USB enclosures
comprising of WD Red drives. I noticed errors in dmesg, so I'd run a
`btrfs scrub` and for two days it'd report zero errors. Within hours
of the scrub completing I'd start seeing "csum failed ino" or other
errors again. Not wanting to run a btrfs scrub 24/7 as it impacts load
and available I/O I thought of a crude workaround...
My workaround is every minute cron runs the unfortunate script below
which is my hack to create some minimal random activity and this has
had the effect of eliminating btrfs errors in dmesg since I installed
it ~5 days ago.
#!/bin/bash
for dev in /dev/disk/by-path/*-usb-*
do
dd "if=$dev" skip=$RANDOM of=/dev/null bs=1k count=1 conv=noerror
sleep 1
done
Also: I have used `idle3ctl -d` on every WD drive to configure them
not to idle spin down.
I'd like to eliminate the need for script above but I don't know what
to look into for more insight.
Below are examples of errors in dmesg output that I believe to be from
after an idle time.
[Thu Feb 26 21:14:53 2015] sd 6:0:0:6: [sdh]
[Thu Feb 26 21:14:53 2015] Result: hostbyte=0x00 driverbyte=0x08
[Thu Feb 26 21:14:53 2015] sd 6:0:0:6: [sdh]
[Thu Feb 26 21:14:53 2015] Sense Key : 0x5 [current]
[Thu Feb 26 21:14:53 2015] sd 6:0:0:6: [sdh]
[Thu Feb 26 21:14:53 2015] ASC=0x21 ASCQ=0x0
[Thu Feb 26 21:14:53 2015] sd 6:0:0:6: [sdh] CDB:
[Thu Feb 26 21:14:53 2015] cdb[0]=0x88: 88 00 00 00 00 01 62 a6 b2 10
00 00 00 08 00 00
[Thu Feb 26 21:14:53 2015] blk_update_request: critical target error,
dev sdh, sector 5950059024
[Thu Feb 26 21:14:53 2015] BTRFS: bdev /dev/sdh1 errs: wr 0, rd 1,
flush 0, corrupt 0, gen 0
[Thu Feb 26 21:14:53 2015] BTRFS: read error corrected: ino 1 off
44992282501120 (dev /dev/sdh1 sector 5950056976)
[Thu Feb 26 22:45:22 2015] __readpage_endio_check: 18 callbacks suppressed
[Thu Feb 26 22:45:22 2015] BTRFS info (device sdj1): csum failed ino
6367 off 950272 csum 1607841533 expected csum 1974928297
[Thu Feb 26 22:45:55 2015] BTRFS info (device sdj1): csum failed ino
6367 off 950272 csum 3541269260 expected csum 1974928297
[Thu Feb 26 22:45:55 2015] BTRFS: read error corrected: ino 6367 off
950272 (dev /dev/sdn1 sector 5100129584)
[Thu Feb 26 23:03:27 2015] __readpage_endio_check: 19 callbacks suppressed
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 262144 csum 343380379 expected csum 1424044590
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 315392 csum 1424678262 expected csum 2679854845
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 266240 csum 2640353180 expected csum 3029459156
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 319488 csum 2451256998 expected csum 2468347880
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 270336 csum 345966290 expected csum 2742069942
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 323584 csum 3197427733 expected csum 85045692
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 274432 csum 471582907 expected csum 2556165357
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 327680 csum 2709949441 expected csum 2084817183
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 278528 csum 1074665437 expected csum 3742172546
[Thu Feb 26 23:03:27 2015] BTRFS info (device sdj1): csum failed ino
59095 off 331776 csum 960121098 expected csum 1047166743
[Sat Feb 28 22:37:27 2015] BTRFS (device sdj1): bad tree block start
16702985684700295141 44148565934080
[Sat Feb 28 22:38:28 2015] BTRFS (device sdj1): bad tree block start
18405645867681351400 44148565934080
[Sat Feb 28 22:47:37 2015] BTRFS (device sdj1): bad tree block start
9667861177667953406 37828540731392
[Sat Feb 28 22:48:29 2015] BTRFS (device sdj1): bad tree block start
13145213771949975882 37828540731392
[Sat Feb 28 23:07:42 2015] BTRFS (device sdj1): bad tree block start
3229042711727727555 37828408807424
[Sat Feb 28 23:08:43 2015] BTRFS (device sdj1): bad tree block start
10868841450782383314 37828408807424
[Sat Feb 28 23:17:45 2015] BTRFS (device sdj1): bad tree block start
7898245970193992494 38304904036352
[Sat Feb 28 23:18:46 2015] BTRFS (device sdj1): bad tree block start
11326950486664265401 38304904036352
[Sat Feb 28 23:47:49 2015] BTRFS (device sdj1): bad tree block start
8268695429503799068 44148995371008
[Sat Feb 28 23:47:49 2015] BTRFS: read error corrected: ino 1 off
44148995371008 (dev /dev/sdl1 sector 1641660376)
# btrfs --version
Btrfs v3.18.2
# uname -a
Linux mcplex 3.18.4-gentoo #1 SMP Wed Jan 28 22:25:43 EST 2015 x86_64
Intel(R) Core(TM) i7-2600S CPU @ 2.80GHz GenuineIntel GNU/Linux
--
Sandy McArthur, Jr.
"No nation could preserve its freedom in the midst of continual warfare."
- Letters and Other Writings of James Madison (1865), Vol. IV, p. 491
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Occasional read error on idle USB attached disk
2015-03-05 22:04 Occasional read error on idle USB attached disk Sandy McArthur Jr
@ 2015-03-05 23:48 ` Duncan
0 siblings, 0 replies; 2+ messages in thread
From: Duncan @ 2015-03-05 23:48 UTC (permalink / raw)
To: linux-btrfs
Sandy McArthur Jr posted on Thu, 05 Mar 2015 17:04:18 -0500 as excerpted:
> I have a btrfs filesystem that gets read errors that appear to only
> happen after a disk has been idle a while. I don't know if the error
> output below is BTRFS, USB, both or other related. I suspect it's timing
> related. If I should take this error report somewhere else,
> please point me in the right direction.
>
> I have a large RAID1 btrfs filesystem (>13TB) that provides
> archive/backup space that is housed in a multi-drive USB enclosures
> comprising of WD Red drives. I noticed errors in dmesg, so I'd run a
> `btrfs scrub` and for two days it'd report zero errors. Within hours of
> the scrub completing I'd start seeing "csum failed ino" or other errors
> again. Not wanting to run a btrfs scrub 24/7 as it impacts load and
> available I/O I thought of a crude workaround...
>
> My workaround is every minute cron runs the unfortunate script below
> which is my hack to create some minimal random activity and this has had
> the effect of eliminating btrfs errors in dmesg since I installed it ~5
> days ago.
>
> #!/bin/bash
> for dev in /dev/disk/by-path/*-usb-*
> do
> dd "if=$dev" skip=$RANDOM of=/dev/null bs=1k count=1 conv=noerror
> sleep 1
> done
>
> Also: I have used `idle3ctl -d` on every WD drive to configure them not
> to idle spin down.
>
> I'd like to eliminate the need for script above but I don't know what to
> look into for more insight.
>
>
> Below are examples of errors in dmesg output that I believe to be from
> after an idle time.
[snip, but thanks for providing]
> # btrfs --version
> Btrfs v3.18.2
>
> # uname -a
> Linux mcplex 3.18.4-gentoo #1 SMP Wed Jan 28 22:25:43 EST
> 2015 x86_64 Intel(R) Core(TM) i7-2600S CPU @ 2.80GHz GenuineIntel
> GNU/Linux
First, I'm not a dev, let alone a kernel/btrfs dev, just an admin and
regular on the list. And also a fellow gentooer. =:^)
Given the symptoms and the relatively modern and thus power-saving
platform, I too suspect it's idle-timeout related. Specifically, given
that the drives are attached via USB, I strongly suspect that it's
automatic USB device-idle power-down, as enabled by the kernel. On old
enough equipment (or for that matter kernels, but of course btrfs wasn't
around or at least not reasonably usable then) you'd likely not see it,
as power-saving to that degree is a relatively recent innovation.
I strongly suspect there's power-related sysfs files you can poke, to
disable the power-saving for the USB, which should solve the problem.
Tho not being a dev I'd have to do the same sysfs browsing you'll need to
do to find them (unless someone else points you to them), so I might as
well leave that for you (or them).
Alternatively, of course, since you mentioned an apparent timeout of
hours and are obviously triggering it, you could configure your archiving/
backup scripts to reactivate the USB before access, and even power-down
after access completes, until the next time.
Meanwhile, if you're going to run a script, might as well have it do
something (semi-)useful. =:^)
I suspect that you don't actually have to do drive I/O to keep the USB
from timing out, pretty much any activity should do. And as it happens,
while it's direct SATA connections not USB here and that might add a kink
to things, I monitor and graph device temps, with the temp queries
showing up as traffic on the device-activity LEDs as if it were I/O.
Which it is, over the (SATA in my case) bus to the device logic, just not
to the physical media.
What I'm suggesting, assuming it works over USB which I'd hope it can, is
that you do device temperature monitoring. If I'm not mistaken, that'll
effectively kill two birds with one stone, giving you better device
information and possibly warning if a device starts to overheat before it
goes bad, /and/ providing bus activity, thus avoiding the idle-timeout
power-downs.
As you're a gentooer as well, merge app-admin/hddtemp and read the
manpage. There's a daemon mode, which I believe would do the polling and
avoid the idle timeouts by itself, and you can either have it log to
syslog every N seconds, or listen on a tcp port (7634 by default), and
run a script to query that port with net-analyzer/netcat or the like.
Alternatively, simply run hddtemp and let it output to STDOUT, scripting
that and logging/grabbing its output using whatever (superkaramba on my
desktop, here), which is what I do.
Two minor doubts of the "I've not actually tried it here using USB as
you're using" level, tho I suspect it'll work fine once setup:
1) Will hddtemp will work over USB? It can handle SCSI, which is what USB
storage connects with, so in theory it should work fine, tho you might
have to specify type SCSI or whatever, if hddtemp's autodetect doesn't
work.
2) Will it actually stop the idle-timeouts? I strongly suspect it will
as I see no reason why it should have to be physical media I/O, but since
I've not actually tried it, that's still theory, which unfortunately
doesn't always match reality.
[sig:]
> "No nation could preserve its freedom in the midst of continual
> warfare." - Letters and Other Writings of James Madison (1865), Vol. IV,
> p. 491
=:^)
You see my list sig below. My general mail sig is Ben Franklin:
"They that can give up essential liberty to obtain a little temporary
safety, deserve neither liberty nor safety."
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2015-03-05 23:48 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-05 22:04 Occasional read error on idle USB attached disk Sandy McArthur Jr
2015-03-05 23:48 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).