From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Thommandra Gowtham <trgowtham123@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: BTRFS suddenly moving to read-only
Date: Wed, 12 Aug 2020 06:58:22 +0800 [thread overview]
Message-ID: <bbaffb9c-8aaf-1f21-d2a3-2b89bf37c248@gmx.com> (raw)
In-Reply-To: <CA+XNQ=g1WzZ6h+MGETbK34iUyHno_vUcufXiaJ3dKfVva+b=cQ@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 7158 bytes --]
On 2020/8/11 下午11:12, Thommandra Gowtham wrote:
> Thank you for the response.
>
>>
>>> - How do we determine the Disk health apart from SMART attributes? Can
>>> we do a Disk write/read test to figure it out?
>>
>> AFAIK SMART is the only thing we can rely on now.
>
> Thank you. The reason I asked the question is sometimes, though SMART
> reports the Percent Life remaining as > 80, we see issues with the
> disk.
> So I was looking if we can use dd or other tools to determine disk
> write speed and compare with the new SSD's. Like below.
>
> # dd if=/dev/zero of=/var/tmp/test1.img bs=1G count=1 oflag=dsync
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.90537 s, 564 MB/s
>
>>
>>>
>>> mount options used:
>>> rw,noatime,compress=lzo,ssd,space_cache,commit=60,subvolid=263
>>>
>>> # btrfs --version
>>> btrfs-progs v4.4
>>>
>>> Ubuntu 16.04: 4.15.0-36-generic #1 SMP Mon Oct 22 21:20:30 PDT 2018
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> mkstemp: Read-only file system
>>> [35816007.175210] print_req_error: I/O error, dev sda, sector 4472632
>>> [35816007.182192] BTRFS error (device sda4): bdev /dev/sda4 errs: wr
>>> 66, rd 725, flush 0, corrupt 0, gen 0
>>
>> This means some read error happened.
>
> Yes. The errors started to occur when we were upgrading the packages.
> Eventually the upgrade failed with read-only filesystem errors.
>
>>
>> Do you have extra log context?
>
> Not much on this system as we couldn't get anything from syslog after
> power-cycle.
>
> But on other instances, we see errors like below
>
> # cat syslog | grep error
> Jun 25 13:12:13 kernel: [154559.788764] res
> 41/04:00:80:08:00/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:13 kernel: [154559.821041] res
> 41/04:00:80:08:00/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:13 kernel: [154559.900810] res
> 41/04:00:00:08:02/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:13 kernel: [154559.933070] res
> 41/04:00:00:08:02/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:13 kernel: [154560.016591] res
> 41/04:00:80:08:00/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:13 kernel: [154560.048882] res
> 41/04:00:80:08:00/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:13 kernel: [154560.114361] ata2.00: NCQ disabled due to
> excessive errors
> Jun 25 13:12:13 kernel: [154560.132361] res
> 41/04:00:00:08:02/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:13 kernel: [154560.154507] ata2.00: error: { ABRT }
This means the disk has some command failed to be executed.
Full context would help to locate the problem.
> Jun 25 13:12:13 kernel: [154560.164580] res
> 41/04:00:00:08:02/00:00:00:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:14 kernel: [154560.339129] ata2.00: error: { ABRT }
> Jun 25 13:12:14 kernel: [154560.346548] print_req_error: I/O error,
> dev sdb, sector 67111040
> Jun 25 13:12:14 kernel: [154560.360192] BTRFS error (device sdb3):
> bdev /dev/sdb3 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
> Jun 25 13:12:14 kernel: [154560.417322] res
> 51/04:00:00:08:02/00:00:04:00:00/60 Emask 0x1 (device error)
> Jun 25 13:12:14 kernel: [154560.511036] ata2.00: error: { ABRT }
> Jun 25 13:12:14 kernel: [154560.518434] print_req_error: I/O error,
> dev sdb, sector 67241984
> Jun 25 13:12:14 kernel: [154560.525291] BTRFS error (device sdb3):
> bdev /dev/sdb3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
>
>>
>>> [35816007.192913] print_req_error: I/O error, dev sda, sector 4472632
>>> [35816007.199855] BTRFS error (device sda4): bdev /dev/sda4 errs: wr
>>> 66, rd 726, flush 0, corrupt 0, gen 0
>>> [35816007.210675] print_req_error: I/O error, dev sda, sector 10180680
>>> [35816007.217748] BTRFS error (device sda4): bdev /dev/sda4 errs: wr
>>> 66, rd 727, flush 0, corrupt 0, gen 0
>>> [35816007.461941] print_req_error: I/O error, dev sda, sector 4472048
>>> [35816007.468903] BTRFS error (device sda4): bdev /dev/sda4 errs: wr
>>> 66, rd 728, flush 0, corrupt 0, gen 0
>>> [35816007.479611] systemd[7035]: serial-getty@ttyS0.service: Failed at
>>> step EXEC spawning /sbin/agetty: Input/output error
>>> [35816007.712006] print_req_error: I/O error, dev sda, sector 4472048
>>
>> This means, we failed to read some data from sda.
>>
>> It's not the error from btrfs checksum verification, but directly read
>> error from the device driver.
>>
>> So the command, agetty can't be executed due to we failed to read the
>> content of that executable file.
>>
>>>
>>> # dmesg | tail
>>> bash: /bin/dmesg: Input/output error
>>>
>>> Doesn't Input/output error mean the disk is inaccessible?
>>
>> This means, we can't even access /bin/dmesg the file itself.
>
> Yes. That would technically mean that the Disk is not accessible
> though it is being reported as read-only by 'mount -t btrfs'.
>
> If a disk is missing or offline, is it done by kernel (bug) or
> something related to hardware. This is being seen on multiple systems.
> So there has to be some commonality among them and as the disk moves
> to sudden read-only, we are unable to get much logs on all cases.
>
> How can we debug these instances? Can you please give some pointers?
I would recommend to setup a netconsole environment to catch all logs.
Then next time you can provide full context about the problem.
>
>>
>>>
>>> # btrfs fi show
>>> Label: 'rpool' uuid: 42d39990-e4eb-414b-8b17-0c4a2f76cc76
>>> Total devices 1 FS bytes used 11.80GiB
>>> devid 1 size 27.20GiB used 19.01GiB path /dev/sda4
>>>
>>> # smartctl -a /dev/sda
>>> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.15.0-36-generic] (local build)
>>> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>>>
>>> Short INQUIRY response, skip product id
>>> A mandatory SMART command failed: exiting. To continue, add one or
>>> more '-T permissive' options.
>>>
>>>
>>> We were able to get smartctl o/p after a power-cycle
>>
>> Did you get dmesg/agetty run after a power-cycle?
>>
>> Or it still triggers the same -EIO error?
>
> No. After power-cycle everything is back to normal(rw mounted) with
> logs not showing any abnormalities.
> Subsequent IO activity(upgrading the packages) was successful as well.
That's tricky, now looks more like a bug in block layer.
Thus netconsole setup is strongly recommended.
>
>>
>> BTW, if the smartctl doesn't record above read error as error, maybe
>> it's some unstable cables causing temporary errors?
>
>>> 169 Unknown_Attribute 0x0000 100 100 000 Old_age
>>> Offline - 66
>
> The Disk Percent life remaining is at '66' for this system which is
> low in my opinion. Can a disk go offline suddenly when the health
> drops low?
Not sure for the hardware, needs better context to determine though.
Thanks,
Qu
>
> Regards,
> Gowtham
>
>>
>> Thanks,
>> Qu
>>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
prev parent reply other threads:[~2020-08-11 22:58 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-11 9:39 BTRFS suddenly moving to read-only Thommandra Gowtham
2020-08-11 10:38 ` Qu Wenruo
2020-08-11 15:12 ` Thommandra Gowtham
2020-08-11 22:58 ` Qu Wenruo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bbaffb9c-8aaf-1f21-d2a3-2b89bf37c248@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=trgowtham123@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox