* open_ctree failed after ATA errors
@ 2014-11-11 15:51 Florian Bruhin
2014-11-11 18:07 ` Robert White
2014-11-11 20:08 ` Chris Murphy
0 siblings, 2 replies; 4+ messages in thread
From: Florian Bruhin @ 2014-11-11 15:51 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4642 bytes --]
I have the following setup:
- Two harddisks
- Both individually encrypted using LUKS
- Both combined into a btrfs using the btrfs raid1 feature
- The above duplicated twice:
- /dev/mapper/data1 and /dev/mapper/data2 -> /mnt/data
- /dev/mapper/secdata1 and /dev/mapper/secdata2 -> /mnt/secdata
Recently, I saw the following messages in my kernel logs all few days:
ata6.00: exception Emask 0x10 SAct 0x40000 SErr 0x400000 action 0x6 frozen
ata6.00: irq_stat 0x08000000, interface fatal error
ata6: SError: { Handshk }
ata6.00: failed command: WRITE FPDMA QUEUED
ata6.00: cmd 61/08:90:e8:29:85/01:00:03:00:00/40 tag 18 ncq 135168 out
res 40/00:94:e8:29:85/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
ata6: EH complete
ata6.00: exception Emask 0x10 SAct 0x800000 SErr 0x400000 action 0x6 frozen
ata6.00: irq_stat 0x08000000, interface fatal error
ata6: SError: { Handshk }
ata6.00: failed command: WRITE FPDMA QUEUED
ata6.00: cmd 61/00:b8:f0:2a:85/02:00:03:00:00/40 tag 23 ncq 262144 out
res 40/00:bc:f0:2a:85/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
ata6: EH complete
I thought maybe it was just a temporary problem or related to
upgrading the kernel recently (3.17.1 -> 3.17.2) and not rebooting
yet, so I rebooted.
Since then, I could run cryptsetup luksOpen without problems, but
mounting the devices hanged for ~15 seconds and then returned without
error, but didn't mount anything.
When strace'ing mount, it hanged here:
mount("/dev/mapper/data1", "/mnt/data", "btrfs", MS_MGC_VAL, NULL)
(which then returned 0). I didn't see anything in the kernel logs.
I then tried the following:
# cryptsetup luksClose ... # for all 4 disks
# cryptsetup luksOpen ... # for all 4 disks
# btrfs device scan --all-devices
# mount /dev/mapper/data1 /mnt/data
# mount /dev/mapper/secdata1 /mnt/data
The same thing happened, and I then saw this in the kernel logs:
[Nov11 15:33] BTRFS info (device dm-3): disk space caching is enabled
[Nov11 15:34] BTRFS info (device dm-3): disk space caching is enabled
[Nov11 15:35] BTRFS info (device dm-3): disk space caching is enabled
[Nov11 15:36] BTRFS info (device dm-3): disk space caching is enabled
[Nov11 15:37] BTRFS info (device dm-3): disk space caching is enabled
[ +16.054127] BTRFS: open_ctree failed
[Nov11 15:38] BTRFS info (device dm-3): disk space caching is enabled
[Nov11 16:02] BTRFS info (device dm-2): disk space caching is enabled
How could I mount these volumes again? Is it a good idea to use
btrfs-zero-log as described in [1]?
Some other information:
- Distribution: Archlinux
- uname -a: Linux moody 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux
- btrfs --version: Btrfs v3.17
- btrfs fi show:
Label: 'secdata2' uuid: 38267260-b656-4c66-a123-5f9214066ae1
Total devices 2 FS bytes used 2.06TiB
devid 1 size 3.64TiB used 2.06TiB path /dev/mapper/secdata2
devid 2 size 3.64TiB used 2.06TiB path /dev/mapper/secdata1
Label: 'data2' uuid: b67ca50d-dbde-445d-922a-3479849b5499
Total devices 2 FS bytes used 2.35TiB
devid 1 size 2.73TiB used 2.69TiB path /dev/mapper/data1
devid 3 size 2.73TiB used 2.69TiB path /dev/mapper/data2
- btrfs fi df /mnt/data:
Data, single: total=58.42GiB, used=14.48GiB
System, single: total=4.00MiB, used=12.00KiB
Metadata, single: total=1.01GiB, used=471.80MiB
GlobalReserve, single: total=160.00MiB, used=0.00B
- btrfs fi df /mnt/secdata
Data, single: total=58.42GiB, used=14.48GiB
System, single: total=4.00MiB, used=12.00KiB
Metadata, single: total=1.01GiB, used=471.80MiB
GlobalReserve, single: total=160.00MiB, used=0.00B
If there's anything else I can provide please let me know. Please Cc
me on replies, as I'm not on the list. Thanks in advance!
Florian
[1] https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
--
http://www.the-compiler.org | me@the-compiler.org (Mail/XMPP)
GPG 0xFD55A072 | http://the-compiler.org/pubkey.asc
I love long mails! | http://email.is-not-s.ms/
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: open_ctree failed after ATA errors
2014-11-11 15:51 open_ctree failed after ATA errors Florian Bruhin
@ 2014-11-11 18:07 ` Robert White
2014-11-12 5:42 ` Florian Bruhin
2014-11-11 20:08 ` Chris Murphy
1 sibling, 1 reply; 4+ messages in thread
From: Robert White @ 2014-11-11 18:07 UTC (permalink / raw)
To: Florian Bruhin, linux-btrfs
The below is a hard disk going bad or other systematic problem at the
hardware level (controller card, interrupt conflict, etc).
In fact, given "ata6.00: irq_stat 0x08000000, interface fatal error" its
pretty much a smoking gun about your controller.
Since you just upgraded your kernel I'd check to make sure you have the
correct chipset and controller card selected. Look at /proc/interrupts
and see if the controller is sharing an interrupt with some other device
that could be crossing it up. Play with your MSI/MSI-X settings (if they
are in use try disabling them).
I'd also actvate SMART and get the smart tools (e.g. "smartmontools" in
gentoo, so probably something similar for your distro) and check the
drive health.
So the stack is
Application ->
File System ->
Device Mapper ->
Encryption ->
Controller ->
Wiring ->
Drive
You are seeing write failures in the controller->wiring->drive section
somewhere.
Cryptsetup is succeeding because the open operation is read-only. That
is cryptsetup reads the LUKS block (first 4k of the partition) and does
the key work and device mapper setup completely in memory without
writing to the physical media at all.
Another possible area is if you ever resized the physical partitions but
didn't properly resize the cryptsetup layer with "cryptsetup resize",
but that woudl be unlikly to affect multiple drives (unless the mistake
was symmetric, e.g. you did it to both drives).
Basically your problem is _way_ below the BTRFS level, but BTRFS is the
first layer thats actually trying to write to the drives so it's the
first level client to fail.
On 11/11/2014 07:51 AM, Florian Bruhin wrote:
> ata6.00: exception Emask 0x10 SAct 0x40000 SErr 0x400000 action 0x6 frozen
> ata6.00: irq_stat 0x08000000, interface fatal error
> ata6: SError: { Handshk }
> ata6.00: failed command: WRITE FPDMA QUEUED
> ata6.00: cmd 61/08:90:e8:29:85/01:00:03:00:00/40 tag 18 ncq 135168 out
> res 40/00:94:e8:29:85/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
> ata6.00: status: { DRDY }
> ata6: hard resetting link
> ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata6.00: configured for UDMA/133
> ata6: EH complete
> ata6.00: exception Emask 0x10 SAct 0x800000 SErr 0x400000 action 0x6 frozen
> ata6.00: irq_stat 0x08000000, interface fatal error
> ata6: SError: { Handshk }
> ata6.00: failed command: WRITE FPDMA QUEUED
> ata6.00: cmd 61/00:b8:f0:2a:85/02:00:03:00:00/40 tag 23 ncq 262144 out
> res 40/00:bc:f0:2a:85/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
> ata6.00: status: { DRDY }
> ata6: hard resetting link
> ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata6.00: configured for UDMA/133
> ata6: EH complete
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: open_ctree failed after ATA errors
2014-11-11 18:07 ` Robert White
@ 2014-11-12 5:42 ` Florian Bruhin
0 siblings, 0 replies; 4+ messages in thread
From: Florian Bruhin @ 2014-11-12 5:42 UTC (permalink / raw)
To: Robert White; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4776 bytes --]
Hi,
First of all: I noticed was able to mount my partitions when doing
with a different path, which made me investigate my /etc/fstab.
It contained this:
LABEL=data1 /mnt/data btrfs defaults,noatime,nofail,device=/dev/disk/by-label/data1,device=/dev/disk/by-label/data2 0 0
LABEL=secdata1 /mnt/secdata btrfs defaults,noatime,nofail,device=/dev/disk/by-label/secdata1,device=/dev/disk/by-label/secdata2 0 0
I now changed it to:
/dev/mapper/data1 /mnt/data btrfs defaults,noatime,nofail 0 0
/dev/mapper/secdata1 /mnt/secdata btrfs defaults,noatime,nofail 0 0
since my initramfs scans for btrfs devices anyways. Looking at
/dev/disk/by-label, only the second disk respectively shows up:
lrwxrwxrwx 1 root root 10 Nov 11 19:32 bootfs -> ../../sde1
lrwxrwxrwx 1 root root 10 Nov 11 19:32 data2 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Nov 11 19:32 secdata2 -> ../../dm-4
However in /dev/mapper, all of them are listed:
lrwxrwxrwx 1 root root 7 Nov 11 19:32 data1 -> ../dm-3
lrwxrwxrwx 1 root root 7 Nov 11 19:32 data2 -> ../dm-1
lrwxrwxrwx 1 root root 7 Nov 11 19:32 rootfs -> ../dm-0
lrwxrwxrwx 1 root root 7 Nov 11 19:32 secdata1 -> ../dm-2
lrwxrwxrwx 1 root root 7 Nov 11 19:32 secdata2 -> ../dm-4
I don't know what's going on there exactly (pointers welcome!) but it
seems the inability to mount is a different issue than the error
messages.
* Robert White <rwhite@pobox.com> [2014-11-11 10:07:25 -0800]:
> Since you just upgraded your kernel I'd check to make sure you have the
> correct chipset and controller card selected. Look at /proc/interrupts and
> see if the controller is sharing an interrupt with some other device that
> could be crossing it up.
I don't really get how to interpret that file I'm afraid. These are
the contents:
CPU0 CPU1
0: 754372 0 IO-APIC-edge timer
8: 0 1 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
17: 600 114573 IO-APIC 17-fasteoi ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3
18: 521 1240697 IO-APIC 18-fasteoi ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, radeon
24: 0 0 PCI-MSI-edge PCIe PME
25: 667 373771 PCI-MSI-edge ahci
26: 408 179898 PCI-MSI-edge eth0
NMI: 31 39 Non-maskable interrupts
LOC: 76628 498768 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 31 39 Performance monitoring interrupts
IWI: 0 2 IRQ work interrupts
RTR: 0 0 APIC ICR read retries
RES: 826089 252701 Rescheduling interrupts
CAL: 270 504 Function call interrupts
TLB: 5704 5023 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 129 129 Machine check polls
THR: 0 0 Hypervisor callback interrupts
ERR: 0
MIS: 0
> Play with your MSI/MSI-X settings (if they are in use try disabling them).
I'll try that if the errors show up again in the next few days - maybe
the reboot actually fixed it after all.
> I'd also actvate SMART and get the smart tools (e.g. "smartmontools" in
> gentoo, so probably something similar for your distro) and check the drive
> health.
I already have a monitoring running which also checks SMART, never had
any problems there. But I'll re-check by hand to be sure.
> So the stack is
> Application ->
> File System ->
> Device Mapper ->
> Encryption ->
> Controller ->
> Wiring ->
> Drive
>
> You are seeing write failures in the controller->wiring->drive section
> somewhere.
Since it started happening after the upgrade, I can still hope it was
just some temporary issue if it doesn't show up again, right? ;)
> Another possible area is if you ever resized the physical partitions but
> didn't properly resize the cryptsetup layer with "cryptsetup resize", but
> that woudl be unlikly to affect multiple drives (unless the mistake was
> symmetric, e.g. you did it to both drives).
This isn't the case.
Thanks!
Florian
--
http://www.the-compiler.org | me@the-compiler.org (Mail/XMPP)
GPG 0xFD55A072 | http://the-compiler.org/pubkey.asc
I love long mails! | http://email.is-not-s.ms/
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: open_ctree failed after ATA errors
2014-11-11 15:51 open_ctree failed after ATA errors Florian Bruhin
2014-11-11 18:07 ` Robert White
@ 2014-11-11 20:08 ` Chris Murphy
1 sibling, 0 replies; 4+ messages in thread
From: Chris Murphy @ 2014-11-11 20:08 UTC (permalink / raw)
To: Btrfs BTRFS
On Nov 11, 2014, at 8:51 AM, Florian Bruhin <me@the-compiler.org> wrote:
> I have the following setup:
>
> - Two harddisks
> - Both individually encrypted using LUKS
> - Both combined into a btrfs using the btrfs raid1 feature
>
> - The above duplicated twice:
> - /dev/mapper/data1 and /dev/mapper/data2 -> /mnt/data
> - /dev/mapper/secdata1 and /dev/mapper/secdata2 -> /mnt/secdata
>
> Recently, I saw the following messages in my kernel logs all few days:
>
> ata6.00: exception Emask 0x10 SAct 0x40000 SErr 0x400000 action 0x6 frozen
> ata6.00: irq_stat 0x08000000, interface fatal error
> ata6: SError: { Handshk }
> ata6.00: failed command: WRITE FPDMA QUEUED
> ata6.00: cmd 61/08:90:e8:29:85/01:00:03:00:00/40 tag 18 ncq 135168 out
> res 40/00:94:e8:29:85/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
> ata6.00: status: { DRDY }
> ata6: hard resetting link
> ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata6.00: configured for UDMA/133
> ata6: EH complete
> ata6.00: exception Emask 0x10 SAct 0x800000 SErr 0x400000 action 0x6 frozen
> ata6.00: irq_stat 0x08000000, interface fatal error
> ata6: SError: { Handshk }
> ata6.00: failed command: WRITE FPDMA QUEUED
> ata6.00: cmd 61/00:b8:f0:2a:85/02:00:03:00:00/40 tag 23 ncq 262144 out
> res 40/00:bc:f0:2a:85/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
> ata6.00: status: { DRDY }
> ata6: hard resetting link
> ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata6.00: configured for UDMA/133
> ata6: EH complete
>
> I thought maybe it was just a temporary problem or related to
> upgrading the kernel recently (3.17.1 -> 3.17.2) and not rebooting
> yet, so I rebooted.
>
> Since then, I could run cryptsetup luksOpen without problems, but
> mounting the devices hanged for ~15 seconds and then returned without
> error, but didn't mount anything.
>
> When strace'ing mount, it hanged here:
>
> mount("/dev/mapper/data1", "/mnt/data", "btrfs", MS_MGC_VAL, NULL)
>
> (which then returned 0). I didn't see anything in the kernel logs.
>
> I then tried the following:
>
> # cryptsetup luksClose ... # for all 4 disks
> # cryptsetup luksOpen ... # for all 4 disks
> # btrfs device scan --all-devices
> # mount /dev/mapper/data1 /mnt/data
> # mount /dev/mapper/secdata1 /mnt/data
>
> The same thing happened, and I then saw this in the kernel logs:
>
> [Nov11 15:33] BTRFS info (device dm-3): disk space caching is enabled
> [Nov11 15:34] BTRFS info (device dm-3): disk space caching is enabled
> [Nov11 15:35] BTRFS info (device dm-3): disk space caching is enabled
> [Nov11 15:36] BTRFS info (device dm-3): disk space caching is enabled
> [Nov11 15:37] BTRFS info (device dm-3): disk space caching is enabled
> [ +16.054127] BTRFS: open_ctree failed
> [Nov11 15:38] BTRFS info (device dm-3): disk space caching is enabled
> [Nov11 16:02] BTRFS info (device dm-2): disk space caching is enabled
>
> How could I mount these volumes again? Is it a good idea to use
> btrfs-zero-log as described in [1]?
First sort out the cause of the hardware problems reported. Persistent errors are going to make things worse. Then you can try -o ro,recovery and see if that works while likely also not altering anything on the drive. If it works, take the opportunity to update backups. Then you can see whether -o recovery works and fixes the problem permanently.
Chris Murphy
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-11-12 5:42 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-11 15:51 open_ctree failed after ATA errors Florian Bruhin
2014-11-11 18:07 ` Robert White
2014-11-12 5:42 ` Florian Bruhin
2014-11-11 20:08 ` Chris Murphy
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.