* 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
@ 2017-05-19 4:16 Marc MERLIN
2017-05-19 19:03 ` Liu Bo
0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2017-05-19 4:16 UTC (permalink / raw)
To: linux-btrfs
Looks like all the unhelpful BUG() aren't gone yet :-/
This one is really not helpful, I don't even know which one of my filesystems caused the crash :(
Why is this not remounting the filesystem read only?
Really, from a user and admin perspective, this is really not helpful.
Could someone who know more than me do a pass and eradicate those?
Btrfs cannot be a production filesystem as long as those are still around IMO.
Thanks,
Marc
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:1779!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in: veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic snd_cmipci snd_hda_intel snd_mpu401_uart snd_hda_codec snd_opl3_lib snd_hda_core snd_rawmidi eeepc_wmi snd_hwdep snd_seq_device asus_wmi snd_pcm sparse_keymap
rfkill snd_timer hwmon snd i915 lpc_ich tpm_infineon rc_ati_x10 asix mei_me usbnet ati_remote pcspkr libphy tpm_tis rc_core usbserial tpm_tis_core wmi tpm parport_pc parport input_leds battery i2c_i801 soundcore evdev e1000e ptp pps_core fuse raid456 multipath mmc_block mmc_core lrw ablk_helper dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx crc32c_intel blowfish_x86_64 blowfish_common pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd xhci_pci ehci_pci xhci_hcd ehci_hcd mvsas r8169 sata_sil24 mii libsas usbcore scsi_transport_sas thermal fan [last unloaded: ftdi_sio]
CPU: 2 PID: 22204 Comm: kworker/u16:20 Tainted: G U 4.11.0-amd64-preempt-sysrq-20170406 #2
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
task: ffff9417d6de2240 task.stack: ffffa1314e7e0000
RIP: 0010:btrfs_extent_inline_ref_size+0x29/0x39
RSP: 0018:ffffa1314e7e3b10 EFLAGS: 00010297
RAX: 000000000000001d RBX: ffff941849fd3700 RCX: ffff941aaa669000
RDX: 0000000000002000 RSI: 000000000000245a RDI: 0000000000000000
RBP: ffffa1314e7e3b10 R08: 0000000000004000 R09: ffffa1314e7e3ad8
R10: 0000000000000000 R11: 0000000000002000 R12: 000000000000245a
R13: 0000000000000000 R14: ffff94183c20b5b8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff941d9e280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f7557d76 CR3: 00000003f8c09000 CR4: 00000000001406e0
Call Trace:
lookup_inline_extent_backref+0x302/0x436
? ___cache_free+0x200/0x25c
__btrfs_free_extent+0xf1/0xb18
__btrfs_run_delayed_refs+0xb2f/0xd15
? __wake_up_common+0x4d/0x81
btrfs_run_delayed_refs+0x7a/0x1cc
delayed_ref_async_start+0x5e/0x9b
btrfs_scrubparity_helper+0x111/0x271
? pwq_activate_delayed_work+0x4d/0x5b
btrfs_extent_refs_helper+0xe/0x10
process_one_work+0x193/0x2b0
? rescuer_thread+0x2b1/0x2b1
worker_thread+0x1e9/0x2c1
? rescuer_thread+0x2b1/0x2b1
kthread+0xfb/0x100
? init_completion+0x24/0x24
? do_fast_syscall_32+0xb7/0xfe
ret_from_fork+0x2c/0x40
Code: 5d c3 55 81 ff b0 00 00 00 48 89 e5 74 1f 81 ff b6 00 00 00 74 17 81 ff b8 00 00 00 74 16 81 ff b2 00 00 00 b8 1d 00 00 00 74 0e <0f> 0b b8 09 00 00 00 eb 05 b8 0d 00 00 00 5d c3 55 48 89 f0 48
RIP: btrfs_extent_inline_ref_size+0x29/0x39 RSP: ffffa1314e7e3b10
---[ end trace 8bd2bf161055b042 ]---
static inline u32 btrfs_extent_inline_ref_size(int type)
{
if (type == BTRFS_TREE_BLOCK_REF_KEY ||
type == BTRFS_SHARED_BLOCK_REF_KEY)
return sizeof(struct btrfs_extent_inline_ref);
if (type == BTRFS_SHARED_DATA_REF_KEY)
return sizeof(struct btrfs_shared_data_ref) +
sizeof(struct btrfs_extent_inline_ref);
if (type == BTRFS_EXTENT_DATA_REF_KEY)
return sizeof(struct btrfs_extent_data_ref) +
offsetof(struct btrfs_extent_inline_ref, offset);
BUG(); <<<<<<<<<<<<<<<<<
return 0;
}
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
2017-05-19 4:16 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779! Marc MERLIN
@ 2017-05-19 19:03 ` Liu Bo
2017-05-20 0:11 ` Marc MERLIN
0 siblings, 1 reply; 8+ messages in thread
From: Liu Bo @ 2017-05-19 19:03 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
Hi Marc,
On Thu, May 18, 2017 at 09:16:38PM -0700, Marc MERLIN wrote:
> Looks like all the unhelpful BUG() aren't gone yet :-/
> This one is really not helpful, I don't even know which one of my filesystems caused the crash :(
>
> Why is this not remounting the filesystem read only?
> Really, from a user and admin perspective, this is really not helpful.
>
> Could someone who know more than me do a pass and eradicate those?
> Btrfs cannot be a production filesystem as long as those are still around IMO.
Looks like there's a security hole hidden in code, I don't think it's
a bug in code, it's more like caused by a corrupted metadata reading
from disk rather than a memory corruption.
A quick glance at the stack shows in extent-tree.c:lookup_inline_extent_backref()
type = btrfs_extent_inline_ref_type(leaf, iref);
then...
ptr += btrfs_extent_inline_ref_size(type);
I agree that a corrupted image should not corrupt the kernel, so we
can fix it by forcing it to readonly.
-liubo
>
> Thanks,
> Marc
>
> ------------[ cut here ]------------
> kernel BUG at fs/btrfs/ctree.h:1779!
> invalid opcode: 0000 [#1] PREEMPT SMP
> Modules linked in: veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic snd_cmipci snd_hda_intel snd_mpu401_uart snd_hda_codec snd_opl3_lib snd_hda_core snd_rawmidi eeepc_wmi snd_hwdep snd_seq_device asus_wmi snd_pcm sparse_keymap
> rfkill snd_timer hwmon snd i915 lpc_ich tpm_infineon rc_ati_x10 asix mei_me usbnet ati_remote pcspkr libphy tpm_tis rc_core usbserial tpm_tis_core wmi tpm parport_pc parport input_leds battery i2c_i801 soundcore evdev e1000e ptp pps_core fuse raid456 multipath mmc_block mmc_core lrw ablk_helper dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx crc32c_intel blowfish_x86_64 blowfish_common pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd xhci_pci ehci_pci xhci_hcd ehci_hcd mvsas r8169 sata_sil24 mii libsas usbcore scsi_transport_sas thermal fan [last unloaded: ftdi_sio]
> CPU: 2 PID: 22204 Comm: kworker/u16:20 Tainted: G U 4.11.0-amd64-preempt-sysrq-20170406 #2
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
> task: ffff9417d6de2240 task.stack: ffffa1314e7e0000
> RIP: 0010:btrfs_extent_inline_ref_size+0x29/0x39
> RSP: 0018:ffffa1314e7e3b10 EFLAGS: 00010297
> RAX: 000000000000001d RBX: ffff941849fd3700 RCX: ffff941aaa669000
> RDX: 0000000000002000 RSI: 000000000000245a RDI: 0000000000000000
> RBP: ffffa1314e7e3b10 R08: 0000000000004000 R09: ffffa1314e7e3ad8
> R10: 0000000000000000 R11: 0000000000002000 R12: 000000000000245a
> R13: 0000000000000000 R14: ffff94183c20b5b8 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff941d9e280000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000f7557d76 CR3: 00000003f8c09000 CR4: 00000000001406e0
> Call Trace:
> lookup_inline_extent_backref+0x302/0x436
> ? ___cache_free+0x200/0x25c
> __btrfs_free_extent+0xf1/0xb18
> __btrfs_run_delayed_refs+0xb2f/0xd15
> ? __wake_up_common+0x4d/0x81
> btrfs_run_delayed_refs+0x7a/0x1cc
> delayed_ref_async_start+0x5e/0x9b
> btrfs_scrubparity_helper+0x111/0x271
> ? pwq_activate_delayed_work+0x4d/0x5b
> btrfs_extent_refs_helper+0xe/0x10
> process_one_work+0x193/0x2b0
> ? rescuer_thread+0x2b1/0x2b1
> worker_thread+0x1e9/0x2c1
> ? rescuer_thread+0x2b1/0x2b1
> kthread+0xfb/0x100
> ? init_completion+0x24/0x24
> ? do_fast_syscall_32+0xb7/0xfe
> ret_from_fork+0x2c/0x40
> Code: 5d c3 55 81 ff b0 00 00 00 48 89 e5 74 1f 81 ff b6 00 00 00 74 17 81 ff b8 00 00 00 74 16 81 ff b2 00 00 00 b8 1d 00 00 00 74 0e <0f> 0b b8 09 00 00 00 eb 05 b8 0d 00 00 00 5d c3 55 48 89 f0 48
> RIP: btrfs_extent_inline_ref_size+0x29/0x39 RSP: ffffa1314e7e3b10
> ---[ end trace 8bd2bf161055b042 ]---
>
> static inline u32 btrfs_extent_inline_ref_size(int type)
> {
> if (type == BTRFS_TREE_BLOCK_REF_KEY ||
> type == BTRFS_SHARED_BLOCK_REF_KEY)
> return sizeof(struct btrfs_extent_inline_ref);
> if (type == BTRFS_SHARED_DATA_REF_KEY)
> return sizeof(struct btrfs_shared_data_ref) +
> sizeof(struct btrfs_extent_inline_ref);
> if (type == BTRFS_EXTENT_DATA_REF_KEY)
> return sizeof(struct btrfs_extent_data_ref) +
> offsetof(struct btrfs_extent_inline_ref, offset);
> BUG(); <<<<<<<<<<<<<<<<<
> return 0;
> }
>
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
> .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
2017-05-19 19:03 ` Liu Bo
@ 2017-05-20 0:11 ` Marc MERLIN
2017-05-20 0:37 ` Hugo Mills
0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2017-05-20 0:11 UTC (permalink / raw)
To: Liu Bo, Chris Mason; +Cc: linux-btrfs
On Fri, May 19, 2017 at 12:03:58PM -0700, Liu Bo wrote:
> Hi Marc,
>
> On Thu, May 18, 2017 at 09:16:38PM -0700, Marc MERLIN wrote:
> > Looks like all the unhelpful BUG() aren't gone yet :-/
> > This one is really not helpful, I don't even know which one of my filesystems caused the crash :(
> >
> > Why is this not remounting the filesystem read only?
> > Really, from a user and admin perspective, this is really not helpful.
> >
> > Could someone who know more than me do a pass and eradicate those?
> > Btrfs cannot be a production filesystem as long as those are still around IMO.
>
> Looks like there's a security hole hidden in code, I don't think it's
> a bug in code, it's more like caused by a corrupted metadata reading
> from disk rather than a memory corruption.
>
> A quick glance at the stack shows in extent-tree.c:lookup_inline_extent_backref()
>
> type = btrfs_extent_inline_ref_type(leaf, iref);
> then...
> ptr += btrfs_extent_inline_ref_size(type);
>
> I agree that a corrupted image should not corrupt the kernel, so we
> can fix it by forcing it to readonly.
Thanks.
Can I make another plea for just removing all those BUG/BUG_ON?
They really have no place in production code, there is no excuse for a
filesystem to bring down the entire and in the process not even tell you
which of your filesystems had the issue to start with.
Could this be made part of a cleanup for this build to remove them all?
Pretty please with cherry on top? :)
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
2017-05-20 0:11 ` Marc MERLIN
@ 2017-05-20 0:37 ` Hugo Mills
2017-05-20 0:47 ` Marc MERLIN
0 siblings, 1 reply; 8+ messages in thread
From: Hugo Mills @ 2017-05-20 0:37 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Liu Bo, Chris Mason, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2519 bytes --]
On Fri, May 19, 2017 at 05:11:34PM -0700, Marc MERLIN wrote:
> On Fri, May 19, 2017 at 12:03:58PM -0700, Liu Bo wrote:
> > Hi Marc,
> >
> > On Thu, May 18, 2017 at 09:16:38PM -0700, Marc MERLIN wrote:
> > > Looks like all the unhelpful BUG() aren't gone yet :-/
> > > This one is really not helpful, I don't even know which one of my filesystems caused the crash :(
> > >
> > > Why is this not remounting the filesystem read only?
> > > Really, from a user and admin perspective, this is really not helpful.
> > >
> > > Could someone who know more than me do a pass and eradicate those?
> > > Btrfs cannot be a production filesystem as long as those are still around IMO.
> >
> > Looks like there's a security hole hidden in code, I don't think it's
> > a bug in code, it's more like caused by a corrupted metadata reading
> > from disk rather than a memory corruption.
> >
> > A quick glance at the stack shows in extent-tree.c:lookup_inline_extent_backref()
> >
> > type = btrfs_extent_inline_ref_type(leaf, iref);
> > then...
> > ptr += btrfs_extent_inline_ref_size(type);
> >
> > I agree that a corrupted image should not corrupt the kernel, so we
> > can fix it by forcing it to readonly.
>
> Thanks.
> Can I make another plea for just removing all those BUG/BUG_ON?
> They really have no place in production code, there is no excuse for a
> filesystem to bring down the entire and in the process not even tell you
> which of your filesystems had the issue to start with.
>
> Could this be made part of a cleanup for this build to remove them all?
The removal of these has been an ongoing process for at least the
last 5 years.
I don't understand the specifics of the kernel code in question(*),
but compared to 5 years ago, btrfs has got rid of most of the
BUG_ONs(**) a few years ago. The remaining ones are probably
complicated to deal with in any way more elegant than just stopping.
I recall seeing someone's stats on BUG_ON locations a couple of
years ago, and btrfs had managed to get the number of locations down
below XFS (but no other FS). It's a kind of success, at least...
Hugo.
(*) I don't have the spare time to fully comprehend the process. Sorry.
(**) A good fraction went away in the first year after the decision to
get rid of them.
> Pretty please with cherry on top? :)
>
> Marc
--
Hugo Mills | IMPROVE YOUR ORGANISMS!!
hugo@... carfax.org.uk |
http://carfax.org.uk/ |
PGP: E2AB1DE4 | Subject line of spam email
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
2017-05-20 0:37 ` Hugo Mills
@ 2017-05-20 0:47 ` Marc MERLIN
2017-05-20 0:57 ` Hugo Mills
0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2017-05-20 0:47 UTC (permalink / raw)
To: Hugo Mills, Liu Bo, Chris Mason, linux-btrfs
On Sat, May 20, 2017 at 12:37:47AM +0000, Hugo Mills wrote:
> > Can I make another plea for just removing all those BUG/BUG_ON?
> > They really have no place in production code, there is no excuse for a
> > filesystem to bring down the entire and in the process not even tell you
> > which of your filesystems had the issue to start with.
> >
> > Could this be made part of a cleanup for this build to remove them all?
>
> The removal of these has been an ongoing process for at least the
> last 5 years.
That's great news, thanks. I guess I'm a bit edgy because I've hit too many
of them already :) but glad to hear that there are a lot fewer now.
> I don't understand the specifics of the kernel code in question(*),
> but compared to 5 years ago, btrfs has got rid of most of the
> BUG_ONs(**) a few years ago. The remaining ones are probably
> complicated to deal with in any way more elegant than just stopping.
The biggest problem is that those BUG* do not even tell you where the
problem.
The assumption that you'd only ever have a single btrfs filesystem mounted,
is flawed to say the least :)
(I have 5 different ones on my server)
> I recall seeing someone's stats on BUG_ON locations a couple of
> years ago, and btrfs had managed to get the number of locations down
> below XFS (but no other FS). It's a kind of success, at least...
Good to know, thanks, and thanks to anyone who has worked on removing those.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
2017-05-20 0:47 ` Marc MERLIN
@ 2017-05-20 0:57 ` Hugo Mills
2017-05-20 1:25 ` Marc MERLIN
0 siblings, 1 reply; 8+ messages in thread
From: Hugo Mills @ 2017-05-20 0:57 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Liu Bo, Chris Mason, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2384 bytes --]
On Fri, May 19, 2017 at 05:47:48PM -0700, Marc MERLIN wrote:
> On Sat, May 20, 2017 at 12:37:47AM +0000, Hugo Mills wrote:
> > > Can I make another plea for just removing all those BUG/BUG_ON?
> > > They really have no place in production code, there is no excuse for a
> > > filesystem to bring down the entire and in the process not even tell you
> > > which of your filesystems had the issue to start with.
> > >
> > > Could this be made part of a cleanup for this build to remove them all?
> >
> > The removal of these has been an ongoing process for at least the
> > last 5 years.
>
> That's great news, thanks. I guess I'm a bit edgy because I've hit too many
> of them already :) but glad to hear that there are a lot fewer now.
>
> > I don't understand the specifics of the kernel code in question(*),
> > but compared to 5 years ago, btrfs has got rid of most of the
> > BUG_ONs(**) a few years ago. The remaining ones are probably
> > complicated to deal with in any way more elegant than just stopping.
>
> The biggest problem is that those BUG* do not even tell you where the
> problem.
> The assumption that you'd only ever have a single btrfs filesystem mounted,
> is flawed to say the least :)
> (I have 5 different ones on my server)
I think from the POV of removing these BUG_ONs, it doesn't matter
which FS causes them. "All" you need to know is where the error
happened. From there, you can (in theory) work out what was wrong and
handle it more elagantly than simply stopping.
Obviously it would be nice, from the POV of the sysadmin, to know
which FS was complaining, but as an FS developer it's secondary to
identifying a BUG_ON which happens in real life, which offers an
opportunity to make the error path more elegant.
> > I recall seeing someone's stats on BUG_ON locations a couple of
> > years ago, and btrfs had managed to get the number of locations down
> > below XFS (but no other FS). It's a kind of success, at least...
>
> Good to know, thanks, and thanks to anyone who has worked on removing those.
I don't know what the current state is. Maybe someone on IRC will
be able to do the greps/stats to give proper numbers to it.
Hugo.
--
Hugo Mills | IMPROVE YOUR ORGANISMS!!
hugo@... carfax.org.uk |
http://carfax.org.uk/ |
PGP: E2AB1DE4 | Subject line of spam email
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
2017-05-20 0:57 ` Hugo Mills
@ 2017-05-20 1:25 ` Marc MERLIN
2017-05-20 1:48 ` Hugo Mills
0 siblings, 1 reply; 8+ messages in thread
From: Marc MERLIN @ 2017-05-20 1:25 UTC (permalink / raw)
To: Hugo Mills, Liu Bo, Chris Mason, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]
On Sat, May 20, 2017 at 12:57:09AM +0000, Hugo Mills wrote:
> I think from the POV of removing these BUG_ONs, it doesn't matter
> which FS causes them. "All" you need to know is where the error
> happened. From there, you can (in theory) work out what was wrong and
> handle it more elagantly than simply stopping.
Sorry, "you" being the code author, or the user?
If code author, I'd rather this be worked out without the extra steps of
having to guess or spend more time to see which FS.
My FS takes up to a day to scrub and btrfs check, clearly making me do this
over 3 of them is not a good use of time and a loss of up to 3 days of wall
clock time.
Not counting that during that time, I have loss of service on all my
filesystems because I don't want to mount them read-write.
> Obviously it would be nice, from the POV of the sysadmin, to know
> which FS was complaining, but as an FS developer it's secondary to
> identifying a BUG_ON which happens in real life, which offers an
> opportunity to make the error path more elegant.
If the FS is remounted R/O, further damage is averted and it's obvious to
the admin which FS has a problem.
Is there a reason why all errors that are serious enough, do not cause the
FS to remount R/O instead of having any BUG/BUG_ON at all?
WARN_ON is also fine obviously if the error is not serious enough, or doing
a WARN_ON + a remount R/O
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 291 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!
2017-05-20 1:25 ` Marc MERLIN
@ 2017-05-20 1:48 ` Hugo Mills
0 siblings, 0 replies; 8+ messages in thread
From: Hugo Mills @ 2017-05-20 1:48 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Liu Bo, Chris Mason, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3141 bytes --]
On Fri, May 19, 2017 at 06:25:22PM -0700, Marc MERLIN wrote:
> On Sat, May 20, 2017 at 12:57:09AM +0000, Hugo Mills wrote:
> > I think from the POV of removing these BUG_ONs, it doesn't matter
> > which FS causes them. "All" you need to know is where the error
> > happened. From there, you can (in theory) work out what was wrong and
> > handle it more elagantly than simply stopping.
>
> Sorry, "you" being the code author, or the user?
Author.
> If code author, I'd rather this be worked out without the extra steps of
> having to guess or spend more time to see which FS.
As I understand it, it doesn't really matter which FS it comes
from. The question is: The kernel has hit this BUG_ON. What do you
actually want to do when this happens? You can't bring the kernel to a
grinding halt (BUG_ON), so how do you handle this more elegantly?
It actually doesn't matter about the state of any specific FS that
caused this particular problem. The fact is, someone decided to check
on the FS's state, and punted the problem of handling the check's
failure to someone later (the BUG_ON). You(*)'ve got to pick up that punt
and deal with it more cleanly.
(*) You == some kernel developer.
> My FS takes up to a day to scrub and btrfs check, clearly making me do this
> over 3 of them is not a good use of time and a loss of up to 3 days of wall
> clock time.
> Not counting that during that time, I have loss of service on all my
> filesystems because I don't want to mount them read-write.
>
> > Obviously it would be nice, from the POV of the sysadmin, to know
> > which FS was complaining, but as an FS developer it's secondary to
> > identifying a BUG_ON which happens in real life, which offers an
> > opportunity to make the error path more elegant.
>
> If the FS is remounted R/O, further damage is averted and it's obvious to
> the admin which FS has a problem.
>
> Is there a reason why all errors that are serious enough, do not cause the
> FS to remount R/O instead of having any BUG/BUG_ON at all?
Simply that it's easier to write a BUG_ON than to write the code to
bubble up a failure to the point that the FS can be made RO. This is a
clean-up kind of process: BUG_ONs should mostly be changed into a
proper error-handling path leading to remount-RO (in the worst
cases). As I understand it, it's not massively difficult, but it's
probably non-trivial effort to get right in each case.
> WARN_ON is also fine obviously if the error is not serious enough, or doing
> a WARN_ON + a remount R/O
Sure, but everything shouild really be turned into either a proper
error-handling path (most likely remount RO), or explicitly defined as
BUG_ON (i.e. "this must never happen -- if it does, then the hardware
is fucked up, and we're not responsible for the consequences") It's
that latter definition that's part of the hard decision-making process
for the kernel dev.
Hugo.
--
Hugo Mills | Great oxymorons of the world, no. 7:
hugo@... carfax.org.uk | The Simple Truth
http://carfax.org.uk/ |
PGP: E2AB1DE4 |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-05-20 1:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-19 4:16 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779! Marc MERLIN
2017-05-19 19:03 ` Liu Bo
2017-05-20 0:11 ` Marc MERLIN
2017-05-20 0:37 ` Hugo Mills
2017-05-20 0:47 ` Marc MERLIN
2017-05-20 0:57 ` Hugo Mills
2017-05-20 1:25 ` Marc MERLIN
2017-05-20 1:48 ` Hugo Mills
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox