From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH 1/5] btrfs: fix metadata reservation for fallocate that leads to transaction aborts
Date: Thu, 10 Sep 2020 10:48:11 -0400 [thread overview]
Message-ID: <632a347e-fe9d-88e9-9129-002605de2e96@toxicpanda.com> (raw)
In-Reply-To: <31a60143d7f01172265ed3120f2133a84722422e.1599560101.git.fdmanana@suse.com>
On 9/8/20 6:27 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When doing an fallocate(), specially a zero range operation, we assume
> that reserving 3 units of metadata space is enough, that at most we touch
> one leaf in subvolume/fs tree for removing existing file extent items and
> inserting a new file extent item. This assumption is generally true for
> most common use cases. However when we end up needing to remove file extent
> items from multiple leaves, we can end up failing with -ENOSPC and abort
> the current transaction, turning the filesystem to RO mode. When this
> happens a stack trace like the following is dumped in dmesg/syslog:
>
> [ 1500.620934] ------------[ cut here ]------------
> [ 1500.620938] BTRFS: Transaction aborted (error -28)
> [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
> [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
> [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
> [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
> [ 1500.621026] Code: 8b 40 50 f0 48 (...)
> [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
> [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
> [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
> [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
> [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
> [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
> [ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
> [ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
> [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1500.621049] Call Trace:
> [ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs]
> [ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs]
> [ 1500.621108] vfs_fallocate+0x14d/0x290
> [ 1500.621112] ksys_fallocate+0x3a/0x70
> [ 1500.621117] __x64_sys_fallocate+0x1a/0x20
> [ 1500.621120] do_syscall_64+0x33/0x80
> [ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1500.621126] RIP: 0033:0x7fb5b248c477
> [ 1500.621128] Code: 89 7c 24 08 (...)
> [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
> [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
> [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
> [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
> [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
> [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
> [ 1500.621151] irq event stamp: 1026217
> [ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
> [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
> [ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
> [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
> [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
> [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
>
> When we use fallocate() internally, for reserving an extent for a space
> cache, inode cache or relocation, we can't hit this problem since either
> there aren't any file extent items to remove from the subvolume tree or
> there is at most one.
>
> When using plain fallocate() it's very unlikely, since that would require
> having many file extent items representing holes for the target range and
> crossing multiple leafs - we attempt to increase the range (merge) of such
> file extent items when punching holes, so at most we end up with 2 file
> extent items for holes at leaf boundaries.
>
> However when using the zero range operation of fallocate() for a large
> range (100+ MiB for example) that's fairly easy to trigger. The following
> example reproducer triggers the issue:
>
> $ cat reproducer.sh
> #!/bin/bash
>
> umount /dev/sdj &> /dev/null
> mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
> mount /dev/sdj /mnt/sdj
>
> # Create a 100M file with many file extent items. Punch a hole every 8K
> # just to speedup the file creation - we could do 4K sequential writes
> # followed by fsync (or O_SYNC) as well, but that takes a lot of time.
> file_size=$((100 * 1024 * 1024))
> xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
> for ((i = 0; i < $file_size; i += 8192)); do
> xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
> done
>
> # Force a transaction commit, so the zero range operation will be forced
> # to COW all metadata extents it need to touch.
> sync
>
> xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
>
> umount /mnt/sdj
>
> $ ./reproducer.sh
> wrote 104857600/104857600 bytes at offset 0
> 100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
> fallocate: No space left on device
>
> $ dmesg
> <shows the same stack trace pasted before>
>
> To fix this use the existing infrastructure that hole punching and
> extent cloning use for replacing a file range with another extent. This
> deals with doing the removal of file extent items and inserting the new
> one using an incremental approach, reserving more space when needed and
> always ensuring we don't leave an implicit hole in the range in case
> we need to do multiple iterations and a crash happens between iterations.
>
> A test case for fstests will follow up soon.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2020-09-10 21:00 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-08 10:27 [PATCH 0/5] btrfs: fix enospc and transaction aborts during fallocate fdmanana
2020-09-08 10:27 ` [PATCH 1/5] btrfs: fix metadata reservation for fallocate that leads to transaction aborts fdmanana
2020-09-10 14:48 ` Josef Bacik [this message]
2020-09-08 10:27 ` [PATCH 2/5] btrfs: remove item_size member of struct btrfs_clone_extent_info fdmanana
2020-09-10 14:48 ` Josef Bacik
2020-09-08 10:27 ` [PATCH 3/5] btrfs: rename struct btrfs_clone_extent_info to a more generic name fdmanana
2020-09-10 14:48 ` Josef Bacik
2020-09-08 10:27 ` [PATCH 4/5] btrfs: rename btrfs_punch_hole_range() " fdmanana
2020-09-10 14:49 ` Josef Bacik
2020-09-08 10:27 ` [PATCH 5/5] btrfs: rename btrfs_insert_clone_extent() " fdmanana
2020-09-10 14:49 ` Josef Bacik
2020-09-11 14:02 ` [PATCH 0/5] btrfs: fix enospc and transaction aborts during fallocate David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=632a347e-fe9d-88e9-9129-002605de2e96@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox