From: Liu Bo <bo.li.liu@oracle.com>
To: linux-btrfs@vger.kernel.org
Cc: Marcel Ritter <ritter.marcel@gmail.com>,
Christian Robert <christian.robert@polymtl.ca>,
<alanqk@gmail.com>,
Konstantinos Skarlatos <k.skarlatos@gmail.com>,
David Sterba <dsterba@suse.cz>,
Martin Steigerwald <Martin@lichtvoll.de>,
Josef Bacik <jbacik@fb.com>, Chris Mason <clm@fb.com>
Subject: [PATCH v10 16/16] Btrfs: fix dedup enospc problem
Date: Thu, 10 Apr 2014 11:48:46 +0800 [thread overview]
Message-ID: <1397101727-20806-17-git-send-email-bo.li.liu@oracle.com> (raw)
In-Reply-To: <1397101727-20806-1-git-send-email-bo.li.liu@oracle.com>
In the case of dedupe, btrfs will produce large number of delayed refs, and
processing them can very likely eat all of the space reserved in
global_block_rsv, and we'll end up with transaction abortion due to ENOSPC.
I tried several different ways to reserve more space for global_block_rsv to
hope it's enough for flushing delayed refs, but I failed and code could
become very messy.
I found that with high delayed refs pressure, the throttle work in the
end_transaction had little use since it didn't block new delayed refs'
insertion, so I put throttle stuff into the very start stage,
i.e. start_transaction.
We take the worst case into account in the throttle code, that is,
every delayed_refs would update btree, so when we reach the limit that
it may use up all the reserved space of global_block_rsv, we kick
transaction_kthread to commit transaction to process these delayed refs,
refresh global_block_rsv's space, and get pinned space back as well.
That way we get rid of annoy ENOSPC problem.
However, this leads to a new problem that it cannot use along with option
"flushoncommit", otherwise it can cause ABBA deadlock between
commit_transaction between ordered extents flush.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent-tree.c | 50 ++++++++++++++++++++++++++++++++++++++-----------
fs/btrfs/ordered-data.c | 6 ++++++
fs/btrfs/transaction.c | 41 ++++++++++++++++++++++++++++++++++++++++
fs/btrfs/transaction.h | 1 +
4 files changed, 87 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6f8b012..ec6f42d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2695,24 +2695,52 @@ static inline u64 heads_to_leaves(struct btrfs_root *root, u64 heads)
int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
struct btrfs_root *root)
{
+ struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_block_rsv *global_rsv;
- u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
+ u64 num_heads;
+ u64 num_entries;
u64 num_bytes;
int ret = 0;
- num_bytes = btrfs_calc_trans_metadata_size(root, 1);
- num_heads = heads_to_leaves(root, num_heads);
- if (num_heads > 1)
- num_bytes += (num_heads - 1) * root->leafsize;
- num_bytes <<= 1;
global_rsv = &root->fs_info->global_block_rsv;
- /*
- * If we can't allocate any more chunks lets make sure we have _lots_ of
- * wiggle room since running delayed refs can create more delayed refs.
- */
- if (global_rsv->space_info->full)
+ if (trans) {
+ num_heads = trans->transaction->delayed_refs.num_heads_ready;
+ num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+ num_heads = heads_to_leaves(root, num_heads);
+ if (num_heads > 1)
+ num_bytes += (num_heads - 1) * root->leafsize;
num_bytes <<= 1;
+ /*
+ * If we can't allocate any more chunks lets make sure we have
+ * _lots_ of wiggle room since running delayed refs can create
+ * more delayed refs.
+ */
+ if (global_rsv->space_info->full)
+ num_bytes <<= 1;
+ } else {
+ if (root->fs_info->dedup_bs == 0)
+ return 0;
+
+ /* dedup enabled */
+ spin_lock(&root->fs_info->trans_lock);
+ if (!root->fs_info->running_transaction) {
+ spin_unlock(&root->fs_info->trans_lock);
+ return 0;
+ }
+
+ delayed_refs =
+ &root->fs_info->running_transaction->delayed_refs;
+
+ num_entries = atomic_read(&delayed_refs->num_entries);
+ num_heads = delayed_refs->num_heads;
+
+ spin_unlock(&root->fs_info->trans_lock);
+
+ /* The worst case */
+ num_bytes = (num_entries - num_heads) *
+ btrfs_calc_trans_metadata_size(root, 1);
+ }
spin_lock(&global_rsv->lock);
if (global_rsv->reserved <= num_bytes)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index c520e13..72c0caa 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -747,6 +747,12 @@ int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
&cur_trans->ordered_operations);
spin_unlock(&root->fs_info->ordered_root_lock);
+ if (cur_trans->blocked) {
+ cur_trans->blocked = 0;
+ if (waitqueue_active(&cur_trans->commit_wait))
+ wake_up(&cur_trans->commit_wait);
+ }
+
work = btrfs_alloc_delalloc_work(inode, wait, 1);
if (!work) {
spin_lock(&root->fs_info->ordered_root_lock);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a04707f..9937eb2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -215,6 +215,7 @@ loop:
cur_trans->transid = fs_info->generation;
fs_info->running_transaction = cur_trans;
cur_trans->aborted = 0;
+ cur_trans->blocked = 1;
spin_unlock(&fs_info->trans_lock);
return 0;
@@ -329,6 +330,27 @@ static void wait_current_trans(struct btrfs_root *root)
wait_event(root->fs_info->transaction_wait,
cur_trans->state >= TRANS_STATE_UNBLOCKED ||
cur_trans->aborted);
+
+ btrfs_put_transaction(cur_trans);
+ } else {
+ spin_unlock(&root->fs_info->trans_lock);
+ }
+}
+
+static noinline void wait_current_trans_for_commit(struct btrfs_root *root)
+{
+ struct btrfs_transaction *cur_trans;
+
+ spin_lock(&root->fs_info->trans_lock);
+ cur_trans = root->fs_info->running_transaction;
+ if (cur_trans && is_transaction_blocked(cur_trans)) {
+ atomic_inc(&cur_trans->use_count);
+ spin_unlock(&root->fs_info->trans_lock);
+
+ wait_event(cur_trans->commit_wait,
+ cur_trans->state >= TRANS_STATE_COMPLETED ||
+ cur_trans->aborted || cur_trans->blocked == 0);
+
btrfs_put_transaction(cur_trans);
} else {
spin_unlock(&root->fs_info->trans_lock);
@@ -436,6 +458,25 @@ again:
if (may_wait_transaction(root, type))
wait_current_trans(root);
+ /*
+ * In the case of dedupe, we need to throttle delayed refs at the
+ * very start stage, otherwise we'd run into ENOSPC because more
+ * delayed refs are added while processing delayed refs.
+ */
+ if (root->fs_info->dedup_bs > 0 && type == TRANS_JOIN &&
+ btrfs_check_space_for_delayed_refs(NULL, root)) {
+ struct btrfs_transaction *cur_trans;
+
+ spin_lock(&root->fs_info->trans_lock);
+ cur_trans = root->fs_info->running_transaction;
+ if (cur_trans && cur_trans->state == TRANS_STATE_RUNNING)
+ cur_trans->state = TRANS_STATE_BLOCKED;
+ spin_unlock(&root->fs_info->trans_lock);
+
+ wake_up_process(root->fs_info->transaction_kthread);
+ wait_current_trans_for_commit(root);
+ }
+
do {
ret = join_transaction(root, type);
if (ret == -EBUSY) {
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 6ac037e..ac58d43 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -59,6 +59,7 @@ struct btrfs_transaction {
struct list_head pending_chunks;
struct btrfs_delayed_ref_root delayed_refs;
int aborted;
+ int blocked;
};
#define __TRANS_FREEZABLE (1U << 0)
--
1.8.1.4
next prev parent reply other threads:[~2014-04-10 3:50 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
2014-04-10 3:48 ` [PATCH v10 01/16] Btrfs: disable qgroups accounting when quota_enable is 0 Liu Bo
2014-04-10 3:48 ` [PATCH v10 02/16] Btrfs: introduce dedup tree and relatives Liu Bo
2014-04-10 3:48 ` [PATCH v10 03/16] Btrfs: introduce dedup tree operations Liu Bo
2014-04-10 3:48 ` [PATCH v10 04/16] Btrfs: introduce dedup state Liu Bo
2014-04-10 3:48 ` [PATCH v10 05/16] Btrfs: make ordered extent aware of dedup Liu Bo
2014-04-10 3:48 ` [PATCH v10 06/16] Btrfs: online(inband) data dedup Liu Bo
2014-04-10 3:48 ` [PATCH v10 07/16] Btrfs: skip dedup reference during backref walking Liu Bo
2014-04-10 3:48 ` [PATCH v10 08/16] Btrfs: don't return space for dedup extent Liu Bo
2014-04-10 3:48 ` [PATCH v10 09/16] Btrfs: add ioctl of dedup control Liu Bo
2014-04-10 3:48 ` [PATCH v10 10/16] Btrfs: improve the delayed refs process in rm case Liu Bo
2014-04-10 3:48 ` [PATCH v10 11/16] Btrfs: fix a crash of dedup ref Liu Bo
2014-04-10 3:48 ` [PATCH v10 12/16] Btrfs: fix deadlock of dedup work Liu Bo
2014-04-10 3:48 ` [PATCH v10 13/16] Btrfs: fix transactin abortion in __btrfs_free_extent Liu Bo
2014-04-10 3:48 ` [PATCH v10 14/16] Btrfs: fix wrong pinned bytes " Liu Bo
2014-04-10 3:48 ` [PATCH v10 15/16] Btrfs: use total_bytes instead of bytes_used for global_rsv Liu Bo
2014-04-10 3:48 ` Liu Bo [this message]
2014-04-10 3:48 ` [PATCH v5] Btrfs-progs: add dedup subcommand Liu Bo
2014-04-10 9:08 ` [RFC PATCH v10 00/16] Online(inband) data deduplication Konstantinos Skarlatos
2014-04-10 15:44 ` Liu Bo
2014-04-10 15:55 ` Liu Bo
2014-04-11 9:28 ` Martin Steigerwald
2014-04-11 9:51 ` Liu Bo
2014-04-14 8:41 ` Test results for " Konstantinos Skarlatos
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1397101727-20806-17-git-send-email-bo.li.liu@oracle.com \
--to=bo.li.liu@oracle.com \
--cc=Martin@lichtvoll.de \
--cc=alanqk@gmail.com \
--cc=christian.robert@polymtl.ca \
--cc=clm@fb.com \
--cc=dsterba@suse.cz \
--cc=jbacik@fb.com \
--cc=k.skarlatos@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=ritter.marcel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).