Re: Global reserve and ENOSPC while deleting snapshots on 5.0.9 - still happens on 5.1.11

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Martin Raiber <martin@urbackup.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Global reserve and ENOSPC while deleting snapshots on 5.0.9 - still happens on 5.1.11
Date: Mon, 24 Jun 2019 07:36:07 -0400	[thread overview]
Message-ID: <20190624113607.GC11820@hungrycats.org> (raw)
In-Reply-To: <0102016b88eef8ed-42aa5e72-7483-4c58-9555-5b1991a0da5e-000000@eu-west-1.amazonses.com>

[-- Attachment #1: Type: text/plain, Size: 17097 bytes --]

On Mon, Jun 24, 2019 at 10:01:31AM +0000, Martin Raiber wrote:
> I've fixed the same problem(s) by increasing the global metadata size as
> well. Though I haven't encountered them since Josef Bacik's block rsv
> rework in 5.0.

Now _that_ is interesting--for me, this issue never happened in five
years _before_ 5.0.

> Another problem with increasing the global metadata size is, that I
> think it is the only way dirty metadata is throttled. If increased too
> much (as a percentage of RAM) the system goes OOM depending on work
> load. As far as I can see dirty metdata isn't included into the
> dirty_ratio calculation as well, causing issues on that front as well.
> Another thing that I think helps is to run with "nodatasum" -- probably
> because then less metadata needs to be changed when deleting.
> 
> On 23.06.2019 16:14 Zygo Blaxell wrote:
> > On Tue, Apr 23, 2019 at 07:06:51PM -0400, Zygo Blaxell wrote:
> >> I had a test filesystem that ran out of unallocated space, then ran
> >> out of metadata space during a snapshot delete, and forced readonly.
> >> The workload before the failure was a lot of rsync and bees dedupe
> >> combined with random snapshot creates and deletes.
> > Had this happen again on a production filesystem, this time on 5.1.11,
> > and it happened during orphan inode cleanup instead of snapshot delete:
> >
> > 	[14303.076134][T20882] BTRFS: error (device dm-21) in add_to_free_space_tree:1037: errno=-28 No space left
> > 	[14303.076144][T20882] BTRFS: error (device dm-21) in __btrfs_free_extent:7196: errno=-28 No space left
> > 	[14303.076157][T20882] BTRFS: error (device dm-21) in btrfs_run_delayed_refs:3008: errno=-28 No space left
> > 	[14303.076203][T20882] BTRFS error (device dm-21): Error removing orphan entry, stopping orphan cleanup
> > 	[14303.076210][T20882] BTRFS error (device dm-21): could not do orphan cleanup -22
> > 	[14303.076281][T20882] BTRFS error (device dm-21): commit super ret -30
> > 	[14303.357337][T20882] BTRFS error (device dm-21): open_ctree failed
> >
> > Same fix:  I bumped the reserved size limit from 512M to 2G and mounted
> > normally.  (OK, technically, I booted my old 5.0.21 kernel--but my 5.0.21
> > kernel has the 2G reserved space patch below in it.)
> >
> > I've not been able to repeat this ENOSPC behavior under test conditions
> > in the last two months of trying, but it's now happened twice in different
> > places, so it has non-zero repeatability.
> >
> >> I tried the usual fix strategies:
> >>
> >> 	1.  Immediately after mount, try to balance to free space for
> >> 	metadata
> >>
> >> 	2.  Immediately after mount, add additional disks to provide
> >> 	unallocated space for metadata
> >>
> >> 	3.  Mount -o nossd to increase metadata density
> >>
> >> #3 had no effect.  #1 failed consistently.
> >>
> >> #2 was successful, but the additional space was not used because
> >> btrfs couldn't allocate chunks for metadata because it ran out of
> >> metadata space for new metadata chunks.
> >>
> >> When btrfs-cleaner tried to remove the first pending deleted snapshot,
> >> it started a transaction that failed due to lack of metadata space.
> >> Since the transaction failed, the filesystem reverts to its earlier state,
> >> and exactly the same thing happens on the next mount.  The 'btrfs dev
> >> add' in #2 is successful only if it is executed immediately after mount,
> >> before the btrfs-cleaner thread wakes up.
> >>
> >> Here's what the kernel said during one of the attempts:
> >>
> >> 	[41263.822252] BTRFS info (device dm-3): use zstd compression, level 0
> >> 	[41263.825135] BTRFS info (device dm-3): using free space tree
> >> 	[41263.827319] BTRFS info (device dm-3): has skinny extents
> >> 	[42046.463356] ------------[ cut here ]------------
> >> 	[42046.463387] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> >> 	[42046.463404] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> >> 	[42046.463407] BTRFS info (device dm-3): forced readonly
> >> 	[42046.463414] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> >> 	[42046.463429] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10517: errno=-28 No space left
> >> 	[42046.463548] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10520: errno=-28 No space left
> >> 	[42046.471363] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> >> 	[42046.471475] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10517: errno=-28 No space left
> >> 	[42046.471506] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10520: errno=-28 No space left
> >> 	[42046.473672] BTRFS: error (device dm-3) in btrfs_drop_snapshot:9489: errno=-28 No space left
> >> 	[42046.475643] WARNING: CPU: 0 PID: 10187 at fs/btrfs/extent-tree.c:7056 __btrfs_free_extent+0x364/0xf60
> >> 	[42046.475645] Modules linked in: mq_deadline bfq dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio joydev ppdev crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel dm_mod snd_pcm aesni_intel aes_x86_64 snd_timer crypto_simd cryptd glue_helper sr_mod snd cdrom psmouse sg soundcore input_leds pcspkr serio_raw ide_pci_generic i2c_piix4 bochs_drm parport_pc piix rtc_cmos floppy parport pcc_cpufreq ide_core qemu_fw_cfg evbug evdev ip_tables x_tables ipv6 crc_ccitt autofs4
> >> 	[42046.475677] CPU: 0 PID: 10187 Comm: btrfs-transacti Tainted: G    B   W         5.0.8-zb64-10a85e8a1569+ #1
> >> 	[42046.475678] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> >> 	[42046.475681] RIP: 0010:__btrfs_free_extent+0x364/0xf60
> >> 	[42046.475684] Code: 50 f0 48 0f ba a8 90 22 00 00 02 72 1f 8b 85 88 fe ff ff 83 f8 fb 0f 84 59 04 00 00 89 c6 48 c7 c7 00 85 be b8 e8 3c 1b 9b ff <0f> 0b 8b 8d 88 fe ff ff 48 8b bd 90 fe ff ff ba 90 1b 00 00 48 c7
> >> 	[42046.475685] RSP: 0018:ffff888103d8f8e0 EFLAGS: 00010282
> >> 	[42046.475688] RAX: 0000000000000000 RBX: ffff88802d9ce370 RCX: ffffffffb7224ca2
> >> 	[42046.475689] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88811960dfac
> >> 	[42046.475691] RBP: ffff888103d8fa98 R08: ffffed10232c24c1 R09: ffffed10232c24c0
> >> 	[42046.475693] R10: ffff888025c94aeb R11: ffffed10232c24c1 R12: 0000000000a5054e
> >> 	[42046.475694] R13: 0000000000003000 R14: 000000000000214a R15: ffff888103d8fa70
> >> 	[42046.475696] FS:  0000000000000000(0000) GS:ffff888119400000(0000) knlGS:0000000000000000
> >> 	[42046.475698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> 	[42046.475700] CR2: 00007f1b84271718 CR3: 00000000a3636001 CR4: 00000000001606f0
> >> 	[42046.475702] Call Trace:
> >> 	[42046.475707]  ? trace_hardirqs_off+0x24/0x120
> >> 	[42046.475714]  ? update_block_group+0x6d0/0x6d0
> >> 	[42046.475718]  ? __lock_acquire+0xf8/0x26e0
> >> 	[42046.475721]  ? __kasan_slab_free+0x14d/0x230
> >> 	[42046.475724]  ? btrfs_delayed_ref_lock+0x3e/0x100
> >> 	[42046.475727]  ? __lock_acquire+0xf8/0x26e0
> >> 	[42046.475730]  ? kthread+0x1a9/0x200
> >> 	[42046.475733]  ? debug_show_all_locks+0x210/0x210
> >> 	[42046.475737]  ? __btrfs_run_delayed_refs+0xce/0x210
> >> 	[42046.475742]  ? debug_show_all_locks+0x210/0x210
> >> 	[42046.475747]  run_delayed_data_ref+0x15a/0x400
> >> 	[42046.475752]  ? alloc_reserved_file_extent+0x520/0x520
> >> 	[42046.475756]  ? rb_next+0x25/0x90
> >> 	[42046.475761]  run_one_delayed_ref+0x71/0xe0
> >> 	[42046.475765]  btrfs_run_delayed_refs_for_head+0x284/0x580
> >> 	[42046.475772]  __btrfs_run_delayed_refs+0xeb/0x210
> >> 	[42046.475777]  ? btrfs_run_delayed_refs_for_head+0x580/0x580
> >> 	[42046.475781]  ? debug_show_all_locks+0x210/0x210
> >> 	[42046.475786]  btrfs_run_delayed_refs+0xc0/0x260
> >> 	[42046.475792]  btrfs_commit_transaction+0xf0/0x10f0
> >> 	[42046.475795]  ? do_raw_spin_unlock+0xa8/0x140
> >> 	[42046.475799]  ? _raw_spin_unlock+0x27/0x40
> >> 	[42046.475803]  ? btrfs_record_root_in_trans+0x24/0xb0
> >> 	[42046.475807]  ? btrfs_apply_pending_changes+0xb0/0xb0
> >> 	[42046.475809]  ? start_transaction+0x14a/0x760
> >> 	[42046.475816]  transaction_kthread+0x218/0x250
> >> 	[42046.475822]  kthread+0x1a9/0x200
> >> 	[42046.475824]  ? btrfs_cleanup_transaction+0xb80/0xb80
> >> 	[42046.475827]  ? kthread_park+0xb0/0xb0
> >> 	[42046.475831]  ret_from_fork+0x3a/0x50
> >> 	[42046.475837] irq event stamp: 0
> >> 	[42046.475839] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
> >> 	[42046.475843] hardirqs last disabled at (0): [<ffffffffb70efe95>] copy_process.part.7+0xad5/0x3b80
> >> 	[42046.475846] softirqs last  enabled at (0): [<ffffffffb70efe95>] copy_process.part.7+0xad5/0x3b80
> >> 	[42046.475847] softirqs last disabled at (0): [<0000000000000000>]           (null)
> >> 	[42046.475849] ---[ end trace b02d556137ea688c ]---
> >> 	[42046.475852] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> >> 	[42046.475859] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> >> 	[42046.475878] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10517: errno=-28 No space left
> >> 	[42046.475905] BTRFS: error (device dm-3) in btrfs_create_pending_block_groups:10520: errno=-28 No space left
> >>
> >> I noticed that during the few minutes when the filesystem was running,
> >> the "global reserve" number in "btrfs fi usage" output kept going up to
> >> 503M or so, then the transaction failed.
> >>
> >> I then applied this patch to the kernel (5.0.8):
> >>
> >> 	diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >> 	index 83791d13c204..59ba456e2314 100644
> >> 	--- a/fs/btrfs/extent-tree.c
> >> 	+++ b/fs/btrfs/extent-tree.c
> >> 	@@ -5790,7 +5790,7 @@ static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
> >> 		spin_lock(&sinfo->lock);
> >> 		spin_lock(&block_rsv->lock);
> >>
> >> 	-       block_rsv->size = min_t(u64, num_bytes, SZ_512M);
> >> 	+       block_rsv->size = min_t(u64, num_bytes, SZ_1G);
> >>
> >> 		if (block_rsv->reserved < block_rsv->size) {
> >> 			num_bytes = btrfs_space_info_used(sinfo, true);
> >>
> >> Then mounted the filesystem.
> >>
> >> With the larger reserved block size, btrfs-cleaner was able to get
> >> past its earlier failures, allocating 3GB of new metadata chunks in the
> >> first transaction after it started deleting snapshots.  After this, the
> >> filesystem deleted some more snapshots (there are currently 2470 deleted
> >> snapshots remaining on this filesystem).
> >>
> >> This isn't really a solution, though, for two reasons:
> >>
> >> 1.  The commit that adds the limit on block_rsv->size (fdf30d1c1b3
> >> "Btrfs: limit the global reserve to 512mb") points out:
> >>
> >>     A user reported a problem where he was getting early ENOSPC with
> >>     hundreds of gigs of free data space and 6 gigs of free metadata space.
> >>     This is because the global block reserve was taking up the entire
> >>     free metadata space.  This is ridiculous, we have infrastructure in
> >>     place to throttle if we start using too much of the global reserve,
> >>     so instead of letting it get this huge just limit it to 512mb so that
> >>     users can still get work done.  This allowed the user to complete
> >>     his rsync without issues. 
> >>
> >> In other words, something should be handling the case when reserved
> >> space runs out, and that something is failing.
> >>
> >> 2.  The 1G reserved limit still isn't sufficient.  The filesystem deletes
> >> a few dozen snapshots, then fails again later:
> >>
> >> 	[ 3705.477036] CPU: 2 PID: 8388 Comm: btrfs-transacti Tainted: G    B             5.0.9-zb64-b6eed6abc880+ #3
> >> 	[ 3705.481510] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> >> 	[ 3705.484038] RIP: 0010:__btrfs_free_extent+0x364/0xf60
> >> 	[ 3705.485061] Code: 50 f0 48 0f ba a8 90 22 00 00 02 72 1f 8b 85 88 fe ff ff 83 f8 fb 0f 84 59 04 00 00 89 c6 48 c7 c7 80 85 be 8b e8 fc 1b 9b ff <0f> 0b 8b 8d 88 fe ff ff 48 8b bd 90 fe ff ff ba 90 1b 00 00 48 c7
> >> 	[ 3705.488046] RSP: 0018:ffff8881c29cf8e0 EFLAGS: 00010282
> >> 	[ 3705.488868] RAX: 0000000000000000 RBX: ffff888121dbdf20 RCX: ffffffff8a224a92
> >> 	[ 3705.490129] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8881f5e0dfac
> >> 	[ 3705.491246] RBP: ffff8881c29cfa98 R08: ffffed103ebc24c1 R09: ffffed103ebc24c0
> >> 	[ 3705.492376] R10: ffff88819e6d430b R11: ffffed103ebc24c1 R12: 00000000009ecb21
> >> 	[ 3705.493501] R13: 0000000000040000 R14: 00000000000021b3 R15: ffff8881c29cfa70
> >> 	[ 3705.494619] FS:  0000000000000000(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000
> >> 	[ 3705.495883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> 	[ 3705.496781] CR2: 0000556bc833ec70 CR3: 00000001ef9e0001 CR4: 00000000001606e0
> >> 	[ 3705.497872] Call Trace:
> >> 	[ 3705.498279]  ? update_block_group+0x6d0/0x6d0
> >> 	[ 3705.498950]  ? btrfs_run_delayed_refs_for_head+0x352/0x580
> >> 	[ 3705.499816]  ? kthread+0x1a9/0x200
> >> 	[ 3705.500379]  ? debug_show_all_locks+0x210/0x210
> >> 	[ 3705.501078]  ? __btrfs_run_delayed_refs+0xce/0x210
> >> 	[ 3705.501813]  ? debug_show_all_locks+0x210/0x210
> >> 	[ 3705.502516]  run_delayed_data_ref+0x15a/0x400
> >> 	[ 3705.503193]  ? alloc_reserved_file_extent+0x520/0x520
> >> 	[ 3705.503968]  ? rb_next+0x3c/0x90
> >> 	[ 3705.504479]  run_one_delayed_ref+0x71/0xe0
> >> 	[ 3705.505116]  btrfs_run_delayed_refs_for_head+0x284/0x580
> >> 	[ 3705.505935]  __btrfs_run_delayed_refs+0xeb/0x210
> >> 	[ 3705.506649]  ? btrfs_run_delayed_refs_for_head+0x580/0x580
> >> 	[ 3705.507495]  ? debug_show_all_locks+0x210/0x210
> >> 	[ 3705.508201]  btrfs_run_delayed_refs+0xc0/0x260
> >> 	[ 3705.508912]  btrfs_commit_transaction+0xf0/0x10f0
> >> 	[ 3705.509654]  ? do_raw_spin_unlock+0xa8/0x140
> >> 	[ 3705.510343]  ? _raw_spin_unlock+0x27/0x40
> >> 	[ 3705.510979]  ? btrfs_record_root_in_trans+0x24/0xb0
> >> 	[ 3705.511748]  ? btrfs_apply_pending_changes+0xb0/0xb0
> >> 	[ 3705.512533]  ? start_transaction+0x14a/0x760
> >> 	[ 3705.513216]  transaction_kthread+0x218/0x250
> >> 	[ 3705.513900]  kthread+0x1a9/0x200
> >> 	[ 3705.514425]  ? btrfs_cleanup_transaction+0xb80/0xb80
> >> 	[ 3705.515233]  ? kthread_park+0xb0/0xb0
> >> 	[ 3705.515829]  ret_from_fork+0x3a/0x50
> >> 	[ 3705.516408] irq event stamp: 0
> >> 	[ 3705.516898] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
> >> 	[ 3705.518054] hardirqs last disabled at (0): [<ffffffff8a0efe65>] copy_process.part.7+0xad5/0x3b80
> >> 	[ 3705.519419] softirqs last  enabled at (0): [<ffffffff8a0efe65>] copy_process.part.7+0xad5/0x3b80
> >> 	[ 3705.520779] softirqs last disabled at (0): [<0000000000000000>]           (null)
> >> 	[ 3705.521944] ---[ end trace 3a5f720e889bc0d3 ]---
> >> 	[ 3705.522713] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: errno=-28 No space left
> >> 	[ 3705.524069] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:3011: errno=-28 No space left
> >>
> >> If I increase the reserved limit again to 2G, btrfs-cleaner gets past
> >> this shapshot (so far).  It takes more than 10 minutes to delete some
> >> of the snapshots, so this will take a while to finish all 2470.
> >>
> >> The current btrfs fi usage output worries me:
> >>
> >> 	Overall:
> >> 	    Device size:                   2.02TiB
> >> 	    Device allocated:              2.01TiB
> >> 	    Device unallocated:           10.03GiB
> >> 	    Device missing:                  0.00B
> >> 	    Used:                          1.70TiB
> >> 	    Free (estimated):            296.02GiB      (min: 291.01GiB)
> >> 	    Data ratio:                       1.00
> >> 	    Metadata ratio:                   2.00
> >> 	    Global reserve:                2.00GiB      (used: 864.00KiB)
> >>
> >> 	Data,single: Size:1.81TiB, Used:1.53TiB
> >> 	   /dev/mapper/vgnebtest-tvdb    928.97GiB
> >> 	   /dev/mapper/vgnebtest-tvdc    929.00GiB
> >>
> >> 	Metadata,RAID1: Size:99.99GiB, Used:82.29GiB
> >> 	   /dev/mapper/vgnebtest-lvol1     5.00GiB
> >> 	   /dev/mapper/vgnebtest-lvol2     5.00GiB
> >> 	   /dev/mapper/vgnebtest-tvdb     94.99GiB
> >> 	   /dev/mapper/vgnebtest-tvdc     94.99GiB
> >>
> >> 	System,RAID1: Size:8.00MiB, Used:240.00KiB
> >> 	   /dev/mapper/vgnebtest-tvdb      8.00MiB
> >> 	   /dev/mapper/vgnebtest-tvdc      8.00MiB
> >>
> >> 	Unallocated:
> >> 	   /dev/mapper/vgnebtest-lvol1     5.00GiB
> >> 	   /dev/mapper/vgnebtest-lvol2     5.00GiB
> >> 	   /dev/mapper/vgnebtest-tvdb     30.00MiB
> >> 	   /dev/mapper/vgnebtest-tvdc      1.00MiB
> >>
> >> Note "lvol1" and "lvol2" were added after the initial failure.  
> >>
> >> There is clearly a lot of metadata space available, but the filesystem still
> >> intermittently hits ENOSPC while trying to free space.
> >
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

next prev parent reply	other threads:[~2019-06-24 11:36 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-23 23:06 Global reserve and ENOSPC while deleting snapshots on 5.0.9 Zygo Blaxell
2019-04-24  2:57 ` Paul Jones
2019-04-24  3:58   ` Zygo Blaxell
2019-06-23 14:14 ` Global reserve and ENOSPC while deleting snapshots on 5.0.9 - still happens on 5.1.11 Zygo Blaxell
2019-06-24 10:01   ` Martin Raiber
2019-06-24 11:36     ` Zygo Blaxell [this message]
2019-07-01  3:07   ` Global reserve and ENOSPC while deleting orphan inodes on 5.1.14 Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190624113607.GC11820@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=martin@urbackup.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).