From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
Chris Mason <clm@fb.com>
Subject: [PATCH 3.17 73/73] Btrfs: fix fs corruption on transaction abort if device supports discard
Date: Tue, 6 Jan 2015 18:16:34 -0800 [thread overview]
Message-ID: <20150107021559.755034943@linuxfoundation.org> (raw)
In-Reply-To: <20150107021557.521276020@linuxfoundation.org>
3.17-stable review patch. If anyone has any objections, please let me know.
------------------
From: Filipe Manana <fdmanana@suse.com>
commit 678886bdc6378c1cbd5072da2c5a3035000214e3 upstream.
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.
This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.
I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:
[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
The corruption always happened when a transaction aborted and then fsck complained
like this:
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
read block failed check_tree_block
Couldn't open file system
In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:
1) transaction N started, fs_info->pinned_extents pointed to
fs_info->freed_extents[0];
2) node/eb 94945280 is created;
3) eb is persisted to disk;
4) transaction N commit starts, fs_info->pinned_extents now points to
fs_info->freed_extents[1], and transaction N completes;
5) transaction N + 1 starts;
6) eb is COWed, and btrfs_free_tree_block() called for this eb;
7) eb range (94945280 to 94945280 + 16Kb) is added to
fs_info->pinned_extents (fs_info->freed_extents[1]);
8) Something goes wrong in transaction N + 1, like hitting ENOSPC
for example, and the transaction is aborted, turning the fs into
readonly mode. The stack trace I got for example:
[112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
[112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
[112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
[112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
[112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
[112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
(...)
[112065.264493] ---[ end trace dd7903a975a31a08 ]---
[112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
[112065.264997] BTRFS info (device sdc): forced readonly
9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
turn calls btrfs_destroy_pinned_extent();
10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
marked as dirty in fs_info->freed_extents[], and for each one
it calls discard, if the fs was mounted with "-o discard", and
adds the range to the free space cache of the respective block
group;
11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
sees the free space entries and performs a discard;
12) After an umount and mount (or fsck), our eb's location on disk was full
of zeroes, and it should have been untouched, because it was marked as
dirty in the fs_info->pinned_extents tree, and therefore used by the
trees that the last committed superblock points to.
Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.
This isn't a new problem, as it's been present since 2011 (git commit
acce952b0263825da32cf10489413dec78053347).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/btrfs/disk-io.c | 6 ------
fs/btrfs/extent-tree.c | 10 ++++++----
2 files changed, 6 insertions(+), 10 deletions(-)
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4043,12 +4043,6 @@ again:
if (ret)
break;
- /* opt_discard */
- if (btrfs_test_opt(root, DISCARD))
- ret = btrfs_error_discard_extent(root, start,
- end + 1 - start,
- NULL);
-
clear_extent_dirty(unpin, start, end, GFP_NOFS);
btrfs_error_unpin_extent_range(root, start, end);
cond_resched();
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5715,7 +5715,8 @@ void btrfs_prepare_extent_commit(struct
update_global_block_rsv(fs_info);
}
-static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
+static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end,
+ const bool return_free_space)
{
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_group_cache *cache = NULL;
@@ -5739,7 +5740,8 @@ static int unpin_extent_range(struct btr
if (start < cache->last_byte_to_unpin) {
len = min(len, cache->last_byte_to_unpin - start);
- btrfs_add_free_space(cache, start, len);
+ if (return_free_space)
+ btrfs_add_free_space(cache, start, len);
}
start += len;
@@ -5803,7 +5805,7 @@ int btrfs_finish_extent_commit(struct bt
end + 1 - start, NULL);
clear_extent_dirty(unpin, start, end, GFP_NOFS);
- unpin_extent_range(root, start, end);
+ unpin_extent_range(root, start, end, true);
cond_resched();
}
@@ -9487,7 +9489,7 @@ out:
int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
{
- return unpin_extent_range(root, start, end);
+ return unpin_extent_range(root, start, end, false);
}
int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
next prev parent reply other threads:[~2015-01-07 2:23 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-07 2:15 [PATCH 3.17 00/73] 3.17.8-stable review Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 01/73] isofs: Fix infinite looping over CE entries Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 02/73] x86/tls: Validate TLS entries to protect espfix Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 03/73] x86/tls: Disallow unusual TLS segments Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 04/73] x86_64, switch_to(): Load TLS descriptors before switching DS and ES Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 05/73] x86, kvm: Clear paravirt_enabled on KVM guests for espfix32s benefit Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 06/73] md/bitmap: always wait for writes on unplug Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 07/73] mfd: twl4030-power: Fix regression with missing compatible flag Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 08/73] mfd: tc6393xb: Fail ohci suspend if full state restore is required Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 09/73] mmc: dw_mmc: avoid write to CDTHRCTL on older versions Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 10/73] mmc: omap_hsmmc: Fix UHS card with DDR50 support Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 11/73] mmc: block: add newline to sysfs display of force_ro Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 12/73] mmc: sdhci-pci-o2micro: Fix Dell E5440 issue Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 13/73] megaraid_sas: corrected return of wait_event from abort frame path Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 14/73] regulator: anatop: Set default voltage selector for vddpu Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 15/73] scsi: correct return values for .eh_abort_handler implementations Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 16/73] f2fs: fix possible data corruption in f2fs_write_begin() Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 17/73] nfs41: fix nfs4_proc_layoutget error handling Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 18/73] dcache: fix kmemcheck warning in switch_names Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 19/73] dm bufio: fix memleak when using a dm_buffers inline bio Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 20/73] dm crypt: use memzero_explicit for on-stack buffer Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 21/73] dm cache: only use overwrite optimisation for promotion when in writeback mode Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 22/73] dm cache: dirty flag was mistakenly being cleared when promoting via overwrite Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 23/73] dm cache: fix spurious cell_defer when dealing with partial block at end of device Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 24/73] dm space map metadata: fix sm_bootstrap_get_nr_blocks() Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 25/73] dm thin: fix inability to discard blocks when in out-of-data-space mode Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 26/73] dm thin: fix missing out-of-data-space to write mode transition if blocks are released Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 27/73] dm thin: fix a race in thin_dtr Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 28/73] brcmfmac: Fix bitmap malloc bug in msgbuf Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 29/73] arm64: Add COMPAT_HWCAP_LPAE Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 30/73] drm/tegra: gem: dumb: pitch and size are outputs Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 31/73] ARM: tegra: Re-add removed SoC id macro to tegra_resume() Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 32/73] ARM: mvebu: make the coherency_ll.S functions work with no coherency fabric Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 33/73] ARM: mvebu: disable I/O coherency on non-SMP situations on Armada 370/375/38x/XP Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 35/73] x86/asm/traps: Disable tracing and kprobes in fixup_bad_iret and sync_regs Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 36/73] x86/tls: Dont validate lm in set_thread_area() after all Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 37/73] isofs: Fix unchecked printing of ER records Greg Kroah-Hartman
2015-01-07 2:15 ` [PATCH 3.17 38/73] KEYS: Fix stale key registration at error path Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 40/73] mac80211: free management frame keys when removing station Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 41/73] mnt: Fix a memory stomp in umount Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 42/73] thermal: Fix error path in thermal_init() Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 43/73] mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 44/73] mnt: Update unprivileged remount test Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 45/73] umount: Disallow unprivileged mount force Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 46/73] groups: Consolidate the setgroups permission checks Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 47/73] userns: Document what the invariant required for safe unprivileged mappings Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 48/73] userns: Dont allow setgroups until a gid mapping has been setablished Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 49/73] userns: Dont allow unprivileged creation of gid mappings Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 50/73] userns: Check euid no fsuid when establishing an unprivileged uid mapping Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 51/73] userns: Only allow the creator of the userns unprivileged mappings Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 52/73] userns: Rename id_map_mutex to userns_state_mutex Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 53/73] userns: Add a knob to disable setgroups on a per user namespace basis Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 54/73] userns: Allow setting gid_maps without privilege when setgroups is disabled Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 55/73] userns: Unbreak the unprivileged remount tests Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 56/73] audit: use supplied gfp_mask from audit_buffer in kauditd_send_multicast_skb Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 57/73] audit: dont attempt to lookup PIDs when changing PID filtering audit rules Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 58/73] audit: restore AUDIT_LOGINUID unset ABI Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 59/73] crypto: af_alg - fix backlog handling Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 60/73] ncpfs: return proper error from NCP_IOC_SETROOT ioctl Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 61/73] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 62/73] udf: Check path length when reading symlink Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 63/73] udf: Verify i_size when loading inode Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 64/73] udf: Verify symlink size before loading it Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 65/73] udf: Check component length before reading it Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 66/73] eCryptfs: Force RO mount when encrypted view is enabled Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 67/73] eCryptfs: Remove buggy and unnecessary write in file name decode routine Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 68/73] Btrfs: make sure we wait on logged extents when fsycning two subvols Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 69/73] btrfs: fix wrong accounting of raid1 data profile in statfs Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 70/73] Btrfs: do not move em to modified list when unpinning Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 71/73] Btrfs: make sure logged extents complete in the current transaction V3 Greg Kroah-Hartman
2015-01-07 2:16 ` [PATCH 3.17 72/73] Btrfs: fix loop writing of async reclaim Greg Kroah-Hartman
2015-01-07 2:16 ` Greg Kroah-Hartman [this message]
2015-01-07 13:46 ` [PATCH 3.17 00/73] 3.17.8-stable review Guenter Roeck
2015-01-07 23:46 ` Greg Kroah-Hartman
2015-01-07 23:34 ` Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150107021559.755034943@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=clm@fb.com \
--cc=fdmanana@suse.com \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).