linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] apply rwlock for extent state
@ 2012-03-15  9:12 Liu Bo
  2012-03-15  9:12 ` [PATCH 1/5] Btrfs: use radix tree for checksum Liu Bo
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Liu Bo @ 2012-03-15  9:12 UTC (permalink / raw)
  To: linux-btrfs

This patchset is against one of project ideas, RBtree lock contention:
"Btrfs uses a number of rbtrees to index in-memory data structures.
Some of these are dominated by reads, and the lock contention from searching
them is showing up in profiles.  We need to look into an RCU and sequence
counter combination to allow lockless reads."

The goal is to use RCU, but we take it as a long term one, and instead we use
rwlock until we find a mature rcu structure for lockless read.

So what we need to do is to make the code RCU friendly, and the idea mainly
comes from Chris Mason:
Quoted:
"I think the extent_state code can be much more RCU friendly if we separate
the operations on the tree from operations on the individual state.
In general, we can gain a lot of performance if we are able to reduce
the write locks taken at endio time.  Especially for reads, these are
critical."

If we read a 10M file with pagecache,
$ dd if=/mnt/btrfs/foobar of=/dev/null bs=1M

without this patchset,
        lock for read : lock for write ~= 2:1
with this patchset,
        lock for read : lock for write ~= 200:1

I've run through xfstests, and no bugs jump out by then.
MORE TESTS ARE WELCOME!


Liu Bo (5):
  Btrfs: use radix tree for checksum
  Btrfs: merge adjacent states as much as possible
  Btrfs: use large extent range for read and its endio
  Btrfs: apply rwlock for extent state
  Btrfs: avoid splitting state when truncating pagecache

 fs/btrfs/extent_io.c        |  768 +++++++++++++++++++++++++++++++++----------
 fs/btrfs/extent_io.h        |    5 +-
 fs/btrfs/free-space-cache.c |    5 +
 fs/btrfs/inode.c            |   22 +-
 4 files changed, 614 insertions(+), 186 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] Btrfs: use radix tree for checksum
  2012-03-15  9:12 [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
@ 2012-03-15  9:12 ` Liu Bo
  2012-03-15  9:12 ` [PATCH 2/5] Btrfs: merge adjacent states as much as possible Liu Bo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Liu Bo @ 2012-03-15  9:12 UTC (permalink / raw)
  To: linux-btrfs

We used to issue a checksum to an extent state of 4K range for read endio,
but we want to use larger range for performance optimization, so instead we
create a radix tree for checksum, where an item stands for checksum of 4K data.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c |   86 ++++++++++++-------------------------------------
 fs/btrfs/extent_io.h |    2 +
 fs/btrfs/inode.c     |    7 +---
 3 files changed, 24 insertions(+), 71 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a55fbe6..e6433d4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -109,10 +109,12 @@ void extent_io_tree_init(struct extent_io_tree *tree,
 {
 	tree->state = RB_ROOT;
 	INIT_RADIX_TREE(&tree->buffer, GFP_ATOMIC);
+	INIT_RADIX_TREE(&tree->csum, GFP_ATOMIC);
 	tree->ops = NULL;
 	tree->dirty_bytes = 0;
 	spin_lock_init(&tree->lock);
 	spin_lock_init(&tree->buffer_lock);
+	spin_lock_init(&tree->csum_lock);
 	tree->mapping = mapping;
 }
 
@@ -686,15 +688,6 @@ static void cache_state(struct extent_state *state,
 	}
 }
 
-static void uncache_state(struct extent_state **cached_ptr)
-{
-	if (cached_ptr && (*cached_ptr)) {
-		struct extent_state *state = *cached_ptr;
-		*cached_ptr = NULL;
-		free_extent_state(state);
-	}
-}
-
 /*
  * set some bits on a range in the tree.  This may require allocations or
  * sleeping, so the gfp mask is used to indicate what is allowed.
@@ -1649,56 +1642,32 @@ out:
  */
 int set_state_private(struct extent_io_tree *tree, u64 start, u64 private)
 {
-	struct rb_node *node;
-	struct extent_state *state;
 	int ret = 0;
 
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state->private = private;
-out:
-	spin_unlock(&tree->lock);
+	spin_lock(&tree->csum_lock);
+	ret = radix_tree_insert(&tree->csum, (unsigned long)start,
+			       (void *)((unsigned long)private << 1));
+	BUG_ON(ret);
+	spin_unlock(&tree->csum_lock);
 	return ret;
 }
 
 int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private)
 {
-	struct rb_node *node;
-	struct extent_state *state;
-	int ret = 0;
+	void **slot = NULL;
 
-	spin_lock(&tree->lock);
-	/*
-	 * this search will find all the extents that end after
-	 * our range starts.
-	 */
-	node = tree_search(tree, start);
-	if (!node) {
-		ret = -ENOENT;
-		goto out;
-	}
-	state = rb_entry(node, struct extent_state, rb_node);
-	if (state->start != start) {
-		ret = -ENOENT;
-		goto out;
+	spin_lock(&tree->csum_lock);
+	slot = radix_tree_lookup_slot(&tree->csum, (unsigned long)start);
+	if (!slot) {
+		spin_unlock(&tree->csum_lock);
+		return -ENOENT;
 	}
-	*private = state->private;
-out:
-	spin_unlock(&tree->lock);
-	return ret;
+	*private = (u64)(*slot) >> 1;
+
+	radix_tree_delete(&tree->csum, (unsigned long)start);
+	spin_unlock(&tree->csum_lock);
+
+	return 0;
 }
 
 /*
@@ -2266,7 +2235,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 	do {
 		struct page *page = bvec->bv_page;
 		struct extent_state *cached = NULL;
-		struct extent_state *state;
 
 		pr_debug("end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, "
 			 "mirror=%ld\n", bio->bi_vcnt, bio->bi_idx, err,
@@ -2285,20 +2253,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 		if (++bvec <= bvec_end)
 			prefetchw(&bvec->bv_page->flags);
 
-		spin_lock(&tree->lock);
-		state = find_first_extent_bit_state(tree, start, EXTENT_LOCKED);
-		if (state && state->start == start) {
-			/*
-			 * take a reference on the state, unlock will drop
-			 * the ref
-			 */
-			cache_state(state, &cached);
-		}
-		spin_unlock(&tree->lock);
-
 		if (uptodate && tree->ops && tree->ops->readpage_end_io_hook) {
 			ret = tree->ops->readpage_end_io_hook(page, start, end,
-							      state);
+							      NULL);
 			if (ret)
 				uptodate = 0;
 			else
@@ -2325,13 +2282,12 @@ error_handled:
 					test_bit(BIO_UPTODATE, &bio->bi_flags);
 				if (err)
 					uptodate = 0;
-				uncache_state(&cached);
 				continue;
 			}
 			if (tree->ops && tree->ops->readpage_io_failed_hook) {
 				ret = tree->ops->readpage_io_failed_hook(
 							bio, page, start, end,
-							failed_mirror, state);
+							failed_mirror, NULL);
 				if (ret == 0)
 					goto error_handled;
 			}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index cecc351..d85e361 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -95,10 +95,12 @@ struct extent_io_ops {
 struct extent_io_tree {
 	struct rb_root state;
 	struct radix_tree_root buffer;
+	struct radix_tree_root csum;
 	struct address_space *mapping;
 	u64 dirty_bytes;
 	spinlock_t lock;
 	spinlock_t buffer_lock;
+	spinlock_t csum_lock;
 	struct extent_io_ops *ops;
 };
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cbeb2e3..e9c4d6c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1867,12 +1867,7 @@ static int btrfs_readpage_end_io_hook(struct page *page, u64 start, u64 end,
 		return 0;
 	}
 
-	if (state && state->start == start) {
-		private = state->private;
-		ret = 0;
-	} else {
-		ret = get_state_private(io_tree, start, &private);
-	}
+	ret = get_state_private(io_tree, start, &private);
 	kaddr = kmap_atomic(page, KM_USER0);
 	if (ret)
 		goto zeroit;
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/5] Btrfs: merge adjacent states as much as possible
  2012-03-15  9:12 [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
  2012-03-15  9:12 ` [PATCH 1/5] Btrfs: use radix tree for checksum Liu Bo
@ 2012-03-15  9:12 ` Liu Bo
  2012-03-15  9:12 ` [PATCH 3/5] Btrfs: use large extent range for read and its endio Liu Bo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Liu Bo @ 2012-03-15  9:12 UTC (permalink / raw)
  To: linux-btrfs

In order to reduce write locks, we do merge_state as much as much as possible.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c |   47 +++++++++++++++++++++++++++--------------------
 1 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e6433d4..7e76403 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -267,29 +267,36 @@ static void merge_state(struct extent_io_tree *tree,
 	if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY))
 		return;
 
-	other_node = rb_prev(&state->rb_node);
-	if (other_node) {
+	while (1) {
+		other_node = rb_prev(&state->rb_node);
+		if (!other_node)
+			break;
 		other = rb_entry(other_node, struct extent_state, rb_node);
-		if (other->end == state->start - 1 &&
-		    other->state == state->state) {
-			merge_cb(tree, state, other);
-			state->start = other->start;
-			other->tree = NULL;
-			rb_erase(&other->rb_node, &tree->state);
-			free_extent_state(other);
-		}
+		if (other->end != state->start - 1 ||
+		    other->state != state->state)
+			break;
+
+		merge_cb(tree, state, other);
+		state->start = other->start;
+		other->tree = NULL;
+		rb_erase(&other->rb_node, &tree->state);
+		free_extent_state(other);
 	}
-	other_node = rb_next(&state->rb_node);
-	if (other_node) {
+
+	while (1) {
+		other_node = rb_next(&state->rb_node);
+		if (!other_node)
+			break;
 		other = rb_entry(other_node, struct extent_state, rb_node);
-		if (other->start == state->end + 1 &&
-		    other->state == state->state) {
-			merge_cb(tree, state, other);
-			state->end = other->end;
-			other->tree = NULL;
-			rb_erase(&other->rb_node, &tree->state);
-			free_extent_state(other);
-		}
+		if (other->start != state->end + 1 ||
+		    other->state != state->state)
+			break;
+
+		merge_cb(tree, state, other);
+		state->end = other->end;
+		other->tree = NULL;
+		rb_erase(&other->rb_node, &tree->state);
+		free_extent_state(other);
 	}
 }
 
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/5] Btrfs: use large extent range for read and its endio
  2012-03-15  9:12 [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
  2012-03-15  9:12 ` [PATCH 1/5] Btrfs: use radix tree for checksum Liu Bo
  2012-03-15  9:12 ` [PATCH 2/5] Btrfs: merge adjacent states as much as possible Liu Bo
@ 2012-03-15  9:12 ` Liu Bo
  2012-03-15  9:12 ` [PATCH 4/5] Btrfs: apply rwlock for extent state Liu Bo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Liu Bo @ 2012-03-15  9:12 UTC (permalink / raw)
  To: linux-btrfs

we use larger extent state range for both readpages and read endio, so that
we can lock or unlock less and avoid most of split ops, then we'll reduce write
locks taken at endio time.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c |  201 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 182 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7e76403..c3b2a2e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2231,17 +2231,25 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 	struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct bio_vec *bvec = bio->bi_io_vec;
 	struct extent_io_tree *tree;
+	struct extent_state *cached = NULL;
 	u64 start;
 	u64 end;
 	int whole_page;
 	int ret;
+	u64 up_start, up_end, un_start, un_end;
+	int up_first, un_first;
+	int for_uptodate[bio->bi_vcnt];
+	int i = 0;
+
+	up_start = un_start = (u64)-1;
+	up_end = un_end = 0;
+	up_first = un_first = 1;
 
 	if (err)
 		uptodate = 0;
 
 	do {
 		struct page *page = bvec->bv_page;
-		struct extent_state *cached = NULL;
 
 		pr_debug("end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, "
 			 "mirror=%ld\n", bio->bi_vcnt, bio->bi_idx, err,
@@ -2252,11 +2260,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 			bvec->bv_offset;
 		end = start + bvec->bv_len - 1;
 
-		if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE)
-			whole_page = 1;
-		else
-			whole_page = 0;
-
 		if (++bvec <= bvec_end)
 			prefetchw(&bvec->bv_page->flags);
 
@@ -2300,14 +2303,71 @@ error_handled:
 			}
 		}
 
+		if (uptodate)
+			for_uptodate[i++] = 1;
+		else
+			for_uptodate[i++] = 0;
+
 		if (uptodate) {
-			set_extent_uptodate(tree, start, end, &cached,
-					    GFP_ATOMIC);
+			if (up_first) {
+				up_start = start;
+				up_end = end;
+				up_first = 0;
+			} else {
+				if (up_start == end + 1) {
+					up_start = start;
+				} else if (up_end == start - 1) {
+					up_end = end;
+				} else {
+					set_extent_uptodate(
+							tree, up_start, up_end,
+							&cached, GFP_ATOMIC);
+					up_start = start;
+					up_end = end;
+				}
+			}
 		}
-		unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
+
+		if (un_first) {
+			un_start = start;
+			un_end = end;
+			un_first = 0;
+		} else {
+			if (un_start == end + 1) {
+				un_start = start;
+			} else if (un_end == start - 1) {
+				un_end = end;
+			} else {
+				unlock_extent_cached(tree, un_start, un_end,
+						     &cached, GFP_ATOMIC);
+				un_start = start;
+				un_end = end;
+			}
+		}
+	} while (bvec <= bvec_end);
+
+	cached = NULL;
+	if (up_start < up_end)
+		set_extent_uptodate(tree, up_start, up_end, &cached,
+				    GFP_ATOMIC);
+	if (un_start < un_end)
+		unlock_extent_cached(tree, un_start, un_end, &cached,
+				     GFP_ATOMIC);
+
+	i = 0;
+	bvec = bio->bi_io_vec;
+	do {
+		struct page *page = bvec->bv_page;
+
+		tree = &BTRFS_I(page->mapping->host)->io_tree;
+
+		if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE)
+			whole_page = 1;
+		else
+			whole_page = 0;
 
 		if (whole_page) {
-			if (uptodate) {
+			if (for_uptodate[i++]) {
 				SetPageUptodate(page);
 			} else {
 				ClearPageUptodate(page);
@@ -2315,7 +2375,7 @@ error_handled:
 			}
 			unlock_page(page);
 		} else {
-			if (uptodate) {
+			if (for_uptodate[i++]) {
 				check_page_uptodate(tree, page);
 			} else {
 				ClearPageUptodate(page);
@@ -2323,6 +2383,7 @@ error_handled:
 			}
 			check_page_locked(tree, page);
 		}
+		++bvec;
 	} while (bvec <= bvec_end);
 
 	bio_put(bio);
@@ -2460,7 +2521,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 				   struct page *page,
 				   get_extent_t *get_extent,
 				   struct bio **bio, int mirror_num,
-				   unsigned long *bio_flags)
+				   unsigned long *bio_flags, int range_lock)
 {
 	struct inode *inode = page->mapping->host;
 	u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
@@ -2494,6 +2555,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 
 	end = page_end;
 	while (1) {
+		if (range_lock)
+			break;
 		lock_extent(tree, start, end, GFP_NOFS);
 		ordered = btrfs_lookup_ordered_extent(inode, start);
 		if (!ordered)
@@ -2642,7 +2705,7 @@ int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
 	int ret;
 
 	ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num,
-				      &bio_flags);
+				      &bio_flags, 0);
 	if (bio)
 		ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
 	return ret;
@@ -3159,6 +3222,59 @@ int extent_writepages(struct extent_io_tree *tree,
 	return ret;
 }
 
+struct page_list {
+	struct page *page;
+	struct list_head list;
+};
+
+static int process_batch_pages(struct extent_io_tree *tree,
+			       struct address_space *mapping,
+			       struct list_head *lock_pages, int *page_cnt,
+			       u64 lock_start, u64 lock_end,
+				get_extent_t get_extent, struct bio **bio,
+				unsigned long *bio_flags)
+{
+	u64 page_start;
+	struct page_list *plist;
+
+	while (1) {
+		struct btrfs_ordered_extent *ordered = NULL;
+
+		lock_extent(tree, lock_start, lock_end, GFP_NOFS);
+		page_start = lock_start;
+		while (page_start < lock_end) {
+			ordered = btrfs_lookup_ordered_extent(mapping->host,
+							      page_start);
+			if (ordered) {
+				page_start = ordered->file_offset;
+				break;
+			}
+			page_start += PAGE_CACHE_SIZE;
+		}
+		if (!ordered)
+			break;
+		unlock_extent(tree, lock_start, lock_end, GFP_NOFS);
+		btrfs_start_ordered_extent(mapping->host, ordered, 1);
+		btrfs_put_ordered_extent(ordered);
+	}
+
+	plist = NULL;
+	while (!list_empty(lock_pages)) {
+		plist = list_entry(lock_pages->prev, struct page_list, list);
+
+		__extent_read_full_page(tree, plist->page, get_extent,
+					bio, 0, bio_flags, 1);
+		page_cache_release(plist->page);
+		list_del(&plist->list);
+		plist->page = NULL;
+		kfree(plist);
+		(*page_cnt)--;
+	}
+
+	WARN_ON((*page_cnt));
+	return 0;
+}
+
 int extent_readpages(struct extent_io_tree *tree,
 		     struct address_space *mapping,
 		     struct list_head *pages, unsigned nr_pages,
@@ -3167,7 +3283,17 @@ int extent_readpages(struct extent_io_tree *tree,
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	unsigned long bio_flags = 0;
-
+	u64 page_start;
+	u64 page_end;
+	u64 lock_start = (u64)-1;
+	u64 lock_end = 0;
+	struct page_list *plist;
+	int page_cnt = 0;
+	LIST_HEAD(lock_pages);
+	int first = 1;
+
+	lock_start = (u64)-1;
+	lock_end = 0;
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
@@ -3175,12 +3301,49 @@ int extent_readpages(struct extent_io_tree *tree,
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_NOFS)) {
-			__extent_read_full_page(tree, page, get_extent,
-						&bio, 0, &bio_flags);
+			page_start = (u64)page_offset(page);
+			page_end = page_start + PAGE_CACHE_SIZE - 1;
+
+			if (first) {
+				lock_start = page_start;
+				lock_end = page_end;
+				first = 0;
+			} else {
+				/*
+				 * |--lock range--||--page range--|
+				 * or
+				 * |--page range--||--lock range--|
+				 */
+				if (lock_start != page_end - 1 &&
+				    lock_end != page_start - 1) {
+					process_batch_pages(tree, mapping,
+						&lock_pages, &page_cnt,
+						lock_start, lock_end,
+						get_extent, &bio, &bio_flags);
+
+					lock_start = page_start;
+					lock_end = page_end;
+				} else {
+					lock_start =
+						 min(lock_start, page_start);
+					lock_end = max(lock_end, page_end);
+				}
+			}
+			plist = kmalloc(sizeof(*plist), GFP_NOFS);
+			BUG_ON(!plist);
+			plist->page = page;
+			list_add(&plist->list, &lock_pages);
+			page_cache_get(page);
+			page_cnt++;
 		}
 		page_cache_release(page);
 	}
+
+	if (!list_empty(&lock_pages))
+		process_batch_pages(tree, mapping, &lock_pages, &page_cnt,
+			lock_start, lock_end, get_extent, &bio, &bio_flags);
 	BUG_ON(!list_empty(pages));
+
 	if (bio)
 		submit_one_bio(READ, bio, 0, bio_flags);
 	return 0;
@@ -4001,9 +4164,9 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 			if (start_i == 0)
 				inc_all_pages = 1;
 			ClearPageError(page);
-			err = __extent_read_full_page(tree, page,
-						      get_extent, &bio,
-						      mirror_num, &bio_flags);
+			err = __extent_read_full_page(
+						tree, page, get_extent, &bio,
+						mirror_num, &bio_flags, 0);
 			if (err)
 				ret = err;
 		} else {
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/5] Btrfs: apply rwlock for extent state
  2012-03-15  9:12 [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
                   ` (2 preceding siblings ...)
  2012-03-15  9:12 ` [PATCH 3/5] Btrfs: use large extent range for read and its endio Liu Bo
@ 2012-03-15  9:12 ` Liu Bo
  2012-03-15  9:12 ` [PATCH 5/5] Btrfs: avoid splitting state when truncating pagecache Liu Bo
  2012-03-21  2:52 ` [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
  5 siblings, 0 replies; 11+ messages in thread
From: Liu Bo @ 2012-03-15  9:12 UTC (permalink / raw)
  To: linux-btrfs

We used to protect both extent state tree and an individual state's state
by tree->lock, but this can be an obstacle of lockless read.

So we seperate them: tree->lock protects the tree while state->lock protects
its state.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c |  434 ++++++++++++++++++++++++++++++++++++++++++--------
 fs/btrfs/extent_io.h |    3 +-
 2 files changed, 369 insertions(+), 68 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c3b2a2e..db2f20e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -26,7 +26,7 @@ static struct kmem_cache *extent_buffer_cache;
 static LIST_HEAD(buffers);
 static LIST_HEAD(states);
 
-#define LEAK_DEBUG 0
+#define LEAK_DEBUG 1
 #if LEAK_DEBUG
 static DEFINE_SPINLOCK(leak_lock);
 #endif
@@ -112,7 +112,7 @@ void extent_io_tree_init(struct extent_io_tree *tree,
 	INIT_RADIX_TREE(&tree->csum, GFP_ATOMIC);
 	tree->ops = NULL;
 	tree->dirty_bytes = 0;
-	spin_lock_init(&tree->lock);
+	rwlock_init(&tree->lock);
 	spin_lock_init(&tree->buffer_lock);
 	spin_lock_init(&tree->csum_lock);
 	tree->mapping = mapping;
@@ -138,6 +138,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 #endif
 	atomic_set(&state->refs, 1);
 	init_waitqueue_head(&state->wq);
+	spin_lock_init(&state->lock);
 	return state;
 }
 
@@ -272,6 +273,7 @@ static void merge_state(struct extent_io_tree *tree,
 		if (!other_node)
 			break;
 		other = rb_entry(other_node, struct extent_state, rb_node);
+		/* FIXME: need other->lock? */
 		if (other->end != state->start - 1 ||
 		    other->state != state->state)
 			break;
@@ -288,6 +290,7 @@ static void merge_state(struct extent_io_tree *tree,
 		if (!other_node)
 			break;
 		other = rb_entry(other_node, struct extent_state, rb_node);
+		/* FIXME: need other->lock? */
 		if (other->start != state->end + 1 ||
 		    other->state != state->state)
 			break;
@@ -355,7 +358,10 @@ static int insert_state(struct extent_io_tree *tree,
 		return -EEXIST;
 	}
 	state->tree = tree;
+
+	spin_lock(&state->lock);
 	merge_state(tree, state);
+	spin_unlock(&state->lock);
 	return 0;
 }
 
@@ -401,20 +407,42 @@ static int split_state(struct extent_io_tree *tree, struct extent_state *orig,
 	return 0;
 }
 
-/*
- * utility function to clear some bits in an extent state struct.
- * it will optionally wake up any one waiting on this state (wake == 1), or
- * forcibly remove the state from the tree (delete == 1).
- *
- * If no bits are set on the state struct after clearing things, the
- * struct is freed and removed from the tree
- */
-static int clear_state_bit(struct extent_io_tree *tree,
-			    struct extent_state *state,
-			    int *bits, int wake)
+static struct extent_state *
+alloc_extent_state_atomic(struct extent_state *prealloc)
+{
+	if (!prealloc)
+		prealloc = alloc_extent_state(GFP_ATOMIC);
+
+	return prealloc;
+}
+
+enum extent_lock_type {
+	EXTENT_READ    = 0,
+	EXTENT_WRITE   = 1,
+	EXTENT_RLOCKED = 2,
+	EXTENT_WLOCKED = 3,
+	EXTENT_LAST    = 4,
+};
+
+static struct extent_state *next_state(struct extent_state *state)
+{
+	struct rb_node *next = rb_next(&state->rb_node);
+	if (next)
+		return rb_entry(next, struct extent_state, rb_node);
+	else
+		return NULL;
+}
+
+static int __clear_state_bit(struct extent_io_tree *tree,
+			     struct extent_state *state, int *bits, int wake,
+			     int check)
 {
 	int bits_to_clear = *bits & ~EXTENT_CTLBITS;
-	int ret = state->state & bits_to_clear;
+
+	if (check) {
+		if ((state->state & ~bits_to_clear) == 0)
+			return 1;
+	}
 
 	if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) {
 		u64 range = state->end - state->start + 1;
@@ -425,7 +453,18 @@ static int clear_state_bit(struct extent_io_tree *tree,
 	state->state &= ~bits_to_clear;
 	if (wake)
 		wake_up(&state->wq);
+	return 0;
+}
+
+static struct extent_state *
+try_free_or_merge_state(struct extent_io_tree *tree, struct extent_state *state)
+{
+	struct extent_state *next = NULL;
+
+	BUG_ON(!spin_is_locked(&state->lock));
 	if (state->state == 0) {
+		spin_unlock(&state->lock);
+		next = next_state(state);
 		if (state->tree) {
 			rb_erase(&state->rb_node, &tree->state);
 			state->tree = NULL;
@@ -435,17 +474,120 @@ static int clear_state_bit(struct extent_io_tree *tree,
 		}
 	} else {
 		merge_state(tree, state);
+		spin_unlock(&state->lock);
+		next = next_state(state);
 	}
-	return ret;
+	return next;
 }
 
-static struct extent_state *
-alloc_extent_state_atomic(struct extent_state *prealloc)
+/*
+ * utility function to clear some bits in an extent state struct.
+ * it will optionally wake up any one waiting on this state (wake == 1), or
+ * forcibly remove the state from the tree (delete == 1).
+ *
+ * If no bits are set on the state struct after clearing things, the
+ * struct is freed and removed from the tree
+ */
+static int clear_state_bit(struct extent_io_tree *tree,
+			   struct extent_state *state, int *bits, int wake)
 {
-	if (!prealloc)
-		prealloc = alloc_extent_state(GFP_ATOMIC);
+	__clear_state_bit(tree, state, bits, wake, 0);
+	try_free_or_merge_state(tree, state);
 
-	return prealloc;
+	return 0;
+}
+
+static int test_merge_state(struct extent_io_tree *tree,
+			    struct extent_state *state)
+{
+	struct extent_state *other;
+	struct rb_node *other_node;
+
+	if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY))
+		return 0;
+
+	other_node = rb_prev(&state->rb_node);
+	if (other_node) {
+		other = rb_entry(other_node, struct extent_state, rb_node);
+		/* FIXME: need other->lock? */
+		if (other->end == state->start - 1 &&
+		    other->state == state->state)
+			return 1;
+	}
+	other_node = rb_next(&state->rb_node);
+	if (other_node) {
+		other = rb_entry(other_node, struct extent_state, rb_node);
+		/* FIXME: need other->lock? */
+		if (other->start == state->end + 1 &&
+		    other->state == state->state)
+			return 1;
+	}
+
+	return 0;
+}
+
+static void process_merge_state(struct extent_io_tree *tree, u64 start)
+{
+	struct extent_state *state = NULL;
+	struct rb_node *node = NULL;
+
+	if (!tree || start == (u64)-1) {
+		WARN_ON(1);
+		return;
+	}
+
+	write_lock(&tree->lock);
+	node = tree_search(tree, start);
+	if (!node) {
+		printk(KERN_INFO "write side: not find states"
+		       " to merge %llu\n", start);
+		goto out;
+	}
+	state = rb_entry(node, struct extent_state, rb_node);
+	/* should merge all states around this one */
+	spin_lock(&state->lock);
+	merge_state(tree, state);
+	spin_unlock(&state->lock);
+out:
+	write_unlock(&tree->lock);
+}
+
+static void extent_rw_lock(struct extent_io_tree *tree, int *rw)
+{
+	int lock = *rw;
+
+	if (lock == EXTENT_READ) {
+		read_lock(&tree->lock);
+		*rw = EXTENT_RLOCKED;
+	} else if (lock == EXTENT_WRITE) {
+		write_lock(&tree->lock);
+		*rw = EXTENT_WLOCKED;
+	} else {
+		WARN_ON(1);
+	}
+}
+
+static void extent_rw_unlock(struct extent_io_tree *tree, int *rw)
+{
+	int lock = *rw;
+
+	if (lock == EXTENT_RLOCKED)
+		read_unlock(&tree->lock);
+	if (lock == EXTENT_WLOCKED)
+		write_unlock(&tree->lock);
+	*rw = EXTENT_READ;
+}
+
+static int extent_rw_flip(struct extent_io_tree *tree, int *rw)
+{
+	int lock = *rw;
+
+	if (lock == EXTENT_RLOCKED) {
+		read_unlock(&tree->lock);
+		*rw = EXTENT_WRITE;
+		return 1;
+	}
+	return 0;
 }
 
 /*
@@ -469,14 +611,18 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	struct extent_state *state;
 	struct extent_state *cached;
 	struct extent_state *prealloc = NULL;
-	struct rb_node *next_node;
 	struct rb_node *node;
 	u64 last_end;
+	u64 orig_start = start;
 	int err;
 	int set = 0;
 	int clear = 0;
+	int rw = EXTENT_READ;
+	int free = 0;
+	int merge = 0;
+	int check = 0;
 
-	if (delete)
+	if (delete == 1)
 		bits |= ~EXTENT_CTLBITS;
 	bits |= EXTENT_FIRST_DELALLOC;
 
@@ -489,7 +635,8 @@ again:
 			return -ENOMEM;
 	}
 
-	spin_lock(&tree->lock);
+	/* XXX: after this we're EXTENT_RLOCKED/EXTENT_WLOCKED */
+	extent_rw_lock(tree, &rw);
 	if (cached_state) {
 		cached = *cached_state;
 
@@ -522,14 +669,13 @@ hit_next:
 	WARN_ON(state->end < start);
 	last_end = state->end;
 
-	if (state->end < end && !need_resched())
-		next_node = rb_next(&state->rb_node);
-	else
-		next_node = NULL;
-
+	spin_lock(&state->lock);
 	/* the state doesn't have the wanted bits, go ahead */
-	if (!(state->state & bits))
+	if (!(state->state & bits)) {
+		spin_unlock(&state->lock);
+		state = next_state(state);
 		goto next;
+	}
 
 	/*
 	 *     | ---- desired range ---- |
@@ -548,18 +694,28 @@ hit_next:
 	 */
 
 	if (state->start < start) {
+		/* XXX: split need a write lock */
+		if (extent_rw_flip(tree, &rw)) {
+			spin_unlock(&state->lock);
+			goto again;
+		}
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, start);
 		BUG_ON(err == -EEXIST);
 		prealloc = NULL;
-		if (err)
+		if (err) {
+			spin_unlock(&state->lock);
 			goto out;
+		}
 		if (state->end <= end) {
+			/* this will unlock state->lock for us */
 			set |= clear_state_bit(tree, state, &bits, wake);
 			if (last_end == (u64)-1)
 				goto out;
 			start = last_end + 1;
+		} else {
+			spin_unlock(&state->lock);
 		}
 		goto search_again;
 	}
@@ -570,42 +726,65 @@ hit_next:
 	 * on the first half
 	 */
 	if (state->start <= end && state->end > end) {
+		/* XXX: split need a write lock */
+		if (extent_rw_flip(tree, &rw)) {
+			spin_unlock(&state->lock);
+			goto again;
+		}
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, end + 1);
 		BUG_ON(err == -EEXIST);
+		spin_unlock(&state->lock);
+
 		if (wake)
 			wake_up(&state->wq);
 
+		spin_lock(&prealloc->lock);
+		/* this will unlock prealloc->lock for us */
 		set |= clear_state_bit(tree, prealloc, &bits, wake);
 
 		prealloc = NULL;
 		goto out;
 	}
 
-	set |= clear_state_bit(tree, state, &bits, wake);
+	check = (rw == EXTENT_RLOCKED) ? 1 : 0;
+	free = __clear_state_bit(tree, state, &bits, wake, check);
+	if (free && rw == EXTENT_RLOCKED) {
+		/* this one will be freed, so it needs a write lock */
+		spin_unlock(&state->lock);
+		extent_rw_flip(tree, &rw);
+		goto again;
+	}
+	if (rw == EXTENT_RLOCKED) {
+		merge = test_merge_state(tree, state);
+		spin_unlock(&state->lock);
+		state = next_state(state);
+	} else {
+		/* this one will unlock state->lock for us */
+		state = try_free_or_merge_state(tree, state);
+	}
 next:
 	if (last_end == (u64)-1)
 		goto out;
 	start = last_end + 1;
-	if (start <= end && next_node) {
-		state = rb_entry(next_node, struct extent_state,
-				 rb_node);
+	if (start <= end && state && !need_resched())
 		goto hit_next;
-	}
 	goto search_again;
 
 out:
-	spin_unlock(&tree->lock);
+	extent_rw_unlock(tree, &rw);
 	if (prealloc)
 		free_extent_state(prealloc);
+	if (merge)
+		process_merge_state(tree, orig_start);
 
 	return set;
 
 search_again:
 	if (start > end)
 		goto out;
-	spin_unlock(&tree->lock);
+	extent_rw_unlock(tree, &rw);
 	if (mask & __GFP_WAIT)
 		cond_resched();
 	goto again;
@@ -618,9 +797,9 @@ static int wait_on_state(struct extent_io_tree *tree,
 {
 	DEFINE_WAIT(wait);
 	prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE);
-	spin_unlock(&tree->lock);
+	read_unlock(&tree->lock);
 	schedule();
-	spin_lock(&tree->lock);
+	read_lock(&tree->lock);
 	finish_wait(&state->wq, &wait);
 	return 0;
 }
@@ -635,7 +814,7 @@ int wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int bits)
 	struct extent_state *state;
 	struct rb_node *node;
 
-	spin_lock(&tree->lock);
+	read_lock(&tree->lock);
 again:
 	while (1) {
 		/*
@@ -651,22 +830,27 @@ again:
 		if (state->start > end)
 			goto out;
 
+		spin_lock(&state->lock);
 		if (state->state & bits) {
+			spin_unlock(&state->lock);
 			start = state->start;
 			atomic_inc(&state->refs);
 			wait_on_state(tree, state);
 			free_extent_state(state);
 			goto again;
 		}
+		spin_unlock(&state->lock);
 		start = state->end + 1;
 
 		if (start > end)
 			break;
 
-		cond_resched_lock(&tree->lock);
+		read_unlock(&tree->lock);
+		cond_resched();
+		read_lock(&tree->lock);
 	}
 out:
-	spin_unlock(&tree->lock);
+	read_unlock(&tree->lock);
 	return 0;
 }
 
@@ -716,6 +900,9 @@ int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	int err = 0;
 	u64 last_start;
 	u64 last_end;
+	u64 orig_start = start;
+	int rw = EXTENT_READ;
+	int merge = 0;
 
 	bits |= EXTENT_FIRST_DELALLOC;
 again:
@@ -724,7 +911,8 @@ again:
 		BUG_ON(!prealloc);
 	}
 
-	spin_lock(&tree->lock);
+	/* XXX: after this we're EXTENT_RLOCKED/EXTENT_WLOCKED */
+	extent_rw_lock(tree, &rw);
 	if (cached_state && *cached_state) {
 		state = *cached_state;
 		if (state->start <= start && state->end > start &&
@@ -739,6 +927,9 @@ again:
 	 */
 	node = tree_search(tree, start);
 	if (!node) {
+		/* XXX: insert need a write lock */
+		if (extent_rw_flip(tree, &rw))
+			goto again;
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = insert_state(tree, prealloc, start, end, &bits);
@@ -751,6 +942,7 @@ hit_next:
 	last_start = state->start;
 	last_end = state->end;
 
+	spin_lock(&state->lock);
 	/*
 	 * | ---- desired range ---- |
 	 * | state |
@@ -759,16 +951,22 @@ hit_next:
 	 */
 	if (state->start == start && state->end <= end) {
 		struct rb_node *next_node;
+
 		if (state->state & exclusive_bits) {
+			spin_unlock(&state->lock);
 			*failed_start = state->start;
 			err = -EEXIST;
 			goto out;
 		}
 
 		set_state_bits(tree, state, &bits);
-
 		cache_state(state, cached_state);
-		merge_state(tree, state);
+		/* XXX */
+		if (rw == EXTENT_RLOCKED)
+			merge = test_merge_state(tree, state);
+		else
+			merge_state(tree, state);
+		spin_unlock(&state->lock);
 		if (last_end == (u64)-1)
 			goto out;
 
@@ -801,25 +999,38 @@ hit_next:
 	 */
 	if (state->start < start) {
 		if (state->state & exclusive_bits) {
+			spin_unlock(&state->lock);
 			*failed_start = start;
 			err = -EEXIST;
 			goto out;
 		}
 
+		/* XXX: split need a write lock */
+		if (extent_rw_flip(tree, &rw)) {
+			spin_unlock(&state->lock);
+			goto again;
+		}
+
+		/* split must hold a write lock */
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, start);
 		BUG_ON(err == -EEXIST);
 		prealloc = NULL;
-		if (err)
+		if (err) {
+			spin_unlock(&state->lock);
 			goto out;
+		}
 		if (state->end <= end) {
 			set_state_bits(tree, state, &bits);
 			cache_state(state, cached_state);
 			merge_state(tree, state);
+			spin_unlock(&state->lock);
 			if (last_end == (u64)-1)
 				goto out;
 			start = last_end + 1;
+		} else {
+			spin_unlock(&state->lock);
 		}
 		goto search_again;
 	}
@@ -832,6 +1043,12 @@ hit_next:
 	 */
 	if (state->start > start) {
 		u64 this_end;
+
+		spin_unlock(&state->lock);
+		/* XXX: split need a write lock */
+		if (extent_rw_flip(tree, &rw))
+			goto again;
+
 		if (end < last_start)
 			this_end = end;
 		else
@@ -852,7 +1069,9 @@ hit_next:
 			prealloc = NULL;
 			goto out;
 		}
+		spin_lock(&prealloc->lock);
 		cache_state(prealloc, cached_state);
+		spin_unlock(&prealloc->lock);
 		prealloc = NULL;
 		start = this_end + 1;
 		goto search_again;
@@ -865,19 +1084,31 @@ hit_next:
 	 */
 	if (state->start <= end && state->end > end) {
 		if (state->state & exclusive_bits) {
+			spin_unlock(&state->lock);
 			*failed_start = start;
 			err = -EEXIST;
 			goto out;
 		}
 
+		/* XXX: split need a write lock */
+		if (extent_rw_flip(tree, &rw)) {
+			spin_unlock(&state->lock);
+			goto again;
+		}
+
+		/* split must hold a write lock */
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, end + 1);
 		BUG_ON(err == -EEXIST);
 
+		spin_unlock(&state->lock);
+
+		spin_lock(&prealloc->lock);
 		set_state_bits(tree, prealloc, &bits);
 		cache_state(prealloc, cached_state);
 		merge_state(tree, prealloc);
+		spin_unlock(&prealloc->lock);
 		prealloc = NULL;
 		goto out;
 	}
@@ -885,16 +1116,18 @@ hit_next:
 	goto search_again;
 
 out:
-	spin_unlock(&tree->lock);
+	extent_rw_unlock(tree, &rw);
 	if (prealloc)
 		free_extent_state(prealloc);
+	if (merge)
+		process_merge_state(tree, orig_start);
 
 	return err;
 
 search_again:
 	if (start > end)
 		goto out;
-	spin_unlock(&tree->lock);
+	extent_rw_unlock(tree, &rw);
 	if (mask & __GFP_WAIT)
 		cond_resched();
 	goto again;
@@ -924,6 +1157,9 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	int err = 0;
 	u64 last_start;
 	u64 last_end;
+	u64 orig_start = start;
+	int rw = EXTENT_READ;
+	int merge = 0;
 
 again:
 	if (!prealloc && (mask & __GFP_WAIT)) {
@@ -932,13 +1168,18 @@ again:
 			return -ENOMEM;
 	}
 
-	spin_lock(&tree->lock);
+	/* XXX: after this we're EXTENT_RLOCKED/EXTENT_WLOCKED */
+	extent_rw_lock(tree, &rw);
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, start);
 	if (!node) {
+		/* XXX: insert need a write lock */
+		if (extent_rw_flip(tree, &rw))
+			goto again;
+
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
 			err = -ENOMEM;
@@ -954,6 +1195,7 @@ hit_next:
 	last_start = state->start;
 	last_end = state->end;
 
+	spin_lock(&state->lock);
 	/*
 	 * | ---- desired range ---- |
 	 * | state |
@@ -964,7 +1206,13 @@ hit_next:
 		struct rb_node *next_node;
 
 		set_state_bits(tree, state, &bits);
-		clear_state_bit(tree, state, &clear_bits, 0);
+		__clear_state_bit(tree, state, &clear_bits, 0, 0);
+		if (rw == EXTENT_LOCKED)
+			merge = test_merge_state(tree, state);
+		else
+			merge_state(tree, state);
+		spin_unlock(&state->lock);
+
 		if (last_end == (u64)-1)
 			goto out;
 
@@ -979,6 +1227,8 @@ hit_next:
 		goto search_again;
 	}
 
+	WARN_ON(1);
+
 	/*
 	 *     | ---- desired range ---- |
 	 * | state |
@@ -996,22 +1246,34 @@ hit_next:
 	 * desired bit on it.
 	 */
 	if (state->start < start) {
+		/* XXX: split need a write lock */
+		if (extent_rw_flip(tree, &rw)) {
+			spin_unlock(&state->lock);
+			goto again;
+		}
+
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
+			spin_unlock(&state->lock);
 			err = -ENOMEM;
 			goto out;
 		}
 		err = split_state(tree, state, prealloc, start);
 		BUG_ON(err == -EEXIST);
 		prealloc = NULL;
-		if (err)
+		if (err) {
+			spin_unlock(&state->lock);
 			goto out;
+		}
 		if (state->end <= end) {
 			set_state_bits(tree, state, &bits);
+			/* will unlock state lock for us */
 			clear_state_bit(tree, state, &clear_bits, 0);
 			if (last_end == (u64)-1)
 				goto out;
 			start = last_end + 1;
+		} else {
+			spin_unlock(&state->lock);
 		}
 		goto search_again;
 	}
@@ -1024,11 +1286,17 @@ hit_next:
 	 */
 	if (state->start > start) {
 		u64 this_end;
+
+		spin_unlock(&state->lock);
 		if (end < last_start)
 			this_end = end;
 		else
 			this_end = last_start - 1;
 
+		/* XXX: insert need a write lock */
+		if (extent_rw_flip(tree, &rw))
+			goto again;
+
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
 			err = -ENOMEM;
@@ -1058,6 +1326,10 @@ hit_next:
 	 * on the first half
 	 */
 	if (state->start <= end && state->end > end) {
+		/* XXX: split need a write lock */
+		if (extent_rw_flip(tree, &rw))
+			goto again;
+
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
 			err = -ENOMEM;
@@ -1067,7 +1339,11 @@ hit_next:
 		err = split_state(tree, state, prealloc, end + 1);
 		BUG_ON(err == -EEXIST);
 
+		spin_unlock(&state->lock);
+		spin_lock(&prealloc->lock);
+
 		set_state_bits(tree, prealloc, &bits);
+		/* will unlock prealloc lock for us */
 		clear_state_bit(tree, prealloc, &clear_bits, 0);
 		prealloc = NULL;
 		goto out;
@@ -1076,16 +1352,20 @@ hit_next:
 	goto search_again;
 
 out:
-	spin_unlock(&tree->lock);
+	/* XXX */
+	extent_rw_unlock(tree, &rw);
 	if (prealloc)
 		free_extent_state(prealloc);
+	if (merge)
+		process_merge_state(tree, orig_start);
 
 	return err;
 
 search_again:
 	if (start > end)
 		goto out;
-	spin_unlock(&tree->lock);
+	/* XXX */
+	extent_rw_unlock(tree, &rw);
 	if (mask & __GFP_WAIT)
 		cond_resched();
 	goto again;
@@ -1248,8 +1528,12 @@ struct extent_state *find_first_extent_bit_state(struct extent_io_tree *tree,
 
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
-		if (state->end >= start && (state->state & bits))
+		spin_lock(&state->lock);
+		if (state->end >= start && (state->state & bits)) {
+			spin_unlock(&state->lock);
 			return state;
+		}
+		spin_unlock(&state->lock);
 
 		node = rb_next(node);
 		if (!node)
@@ -1272,14 +1556,14 @@ int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 	struct extent_state *state;
 	int ret = 1;
 
-	spin_lock(&tree->lock);
+	read_lock(&tree->lock);
 	state = find_first_extent_bit_state(tree, start, bits);
 	if (state) {
 		*start_ret = state->start;
 		*end_ret = state->end;
 		ret = 0;
 	}
-	spin_unlock(&tree->lock);
+	read_unlock(&tree->lock);
 	return ret;
 }
 
@@ -1299,7 +1583,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 	u64 found = 0;
 	u64 total_bytes = 0;
 
-	spin_lock(&tree->lock);
+	read_lock(&tree->lock);
 
 	/*
 	 * this search will find all the extents that end after
@@ -1314,15 +1598,20 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
+		spin_lock(&state->lock);
 		if (found && (state->start != cur_start ||
 			      (state->state & EXTENT_BOUNDARY))) {
+			spin_unlock(&state->lock);
 			goto out;
 		}
 		if (!(state->state & EXTENT_DELALLOC)) {
+			spin_unlock(&state->lock);
 			if (!found)
 				*end = state->end;
 			goto out;
 		}
+		spin_unlock(&state->lock);
+
 		if (!found) {
 			*start = state->start;
 			*cached_state = state;
@@ -1339,7 +1628,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 			break;
 	}
 out:
-	spin_unlock(&tree->lock);
+	read_unlock(&tree->lock);
 	return found;
 }
 
@@ -1602,7 +1891,7 @@ u64 count_range_bits(struct extent_io_tree *tree,
 		return 0;
 	}
 
-	spin_lock(&tree->lock);
+	read_lock(&tree->lock);
 	if (cur_start == 0 && bits == EXTENT_DIRTY) {
 		total_bytes = tree->dirty_bytes;
 		goto out;
@@ -1621,7 +1910,9 @@ u64 count_range_bits(struct extent_io_tree *tree,
 			break;
 		if (contig && found && state->start > last + 1)
 			break;
+		spin_lock(&state->lock);
 		if (state->end >= cur_start && (state->state & bits) == bits) {
+			spin_unlock(&state->lock);
 			total_bytes += min(search_end, state->end) + 1 -
 				       max(cur_start, state->start);
 			if (total_bytes >= max_bytes)
@@ -1632,14 +1923,18 @@ u64 count_range_bits(struct extent_io_tree *tree,
 			}
 			last = state->end;
 		} else if (contig && found) {
+			spin_unlock(&state->lock);
 			break;
+		} else {
+			spin_unlock(&state->lock);
 		}
+
 		node = rb_next(node);
 		if (!node)
 			break;
 	}
 out:
-	spin_unlock(&tree->lock);
+	read_unlock(&tree->lock);
 	return total_bytes;
 }
 
@@ -1690,7 +1985,7 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	struct rb_node *node;
 	int bitset = 0;
 
-	spin_lock(&tree->lock);
+	read_lock(&tree->lock);
 	if (cached && cached->tree && cached->start <= start &&
 	    cached->end > start)
 		node = &cached->rb_node;
@@ -1707,13 +2002,18 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		if (state->start > end)
 			break;
 
+		spin_lock(&state->lock);
 		if (state->state & bits) {
+			spin_unlock(&state->lock);
 			bitset = 1;
 			if (!filled)
 				break;
 		} else if (filled) {
+			spin_unlock(&state->lock);
 			bitset = 0;
 			break;
+		} else {
+			spin_unlock(&state->lock);
 		}
 
 		if (state->end == (u64)-1)
@@ -1729,7 +2029,7 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 			break;
 		}
 	}
-	spin_unlock(&tree->lock);
+	read_unlock(&tree->lock);
 	return bitset;
 }
 
@@ -1926,11 +2226,11 @@ static int clean_io_failure(u64 start, struct page *page)
 		goto out;
 	}
 
-	spin_lock(&BTRFS_I(inode)->io_tree.lock);
+	read_lock(&BTRFS_I(inode)->io_tree.lock);
 	state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree,
 					    failrec->start,
 					    EXTENT_LOCKED);
-	spin_unlock(&BTRFS_I(inode)->io_tree.lock);
+	read_unlock(&BTRFS_I(inode)->io_tree.lock);
 
 	if (state && state->start == failrec->start) {
 		map_tree = &BTRFS_I(inode)->root->fs_info->mapping_tree;
@@ -2064,12 +2364,12 @@ static int bio_readpage_error(struct bio *failed_bio, struct page *page,
 	}
 
 	if (!state) {
-		spin_lock(&tree->lock);
+		read_lock(&tree->lock);
 		state = find_first_extent_bit_state(tree, failrec->start,
 						    EXTENT_LOCKED);
 		if (state && state->start != failrec->start)
 			state = NULL;
-		spin_unlock(&tree->lock);
+		read_unlock(&tree->lock);
 	}
 
 	/*
@@ -2641,7 +2941,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 			set_extent_uptodate(tree, cur, cur + iosize - 1,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur, cur + iosize - 1,
-			                     &cached, GFP_NOFS);
+					     &cached, GFP_NOFS);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index d85e361..b9f6e7a 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -98,7 +98,7 @@ struct extent_io_tree {
 	struct radix_tree_root csum;
 	struct address_space *mapping;
 	u64 dirty_bytes;
-	spinlock_t lock;
+	rwlock_t lock;
 	spinlock_t buffer_lock;
 	spinlock_t csum_lock;
 	struct extent_io_ops *ops;
@@ -114,6 +114,7 @@ struct extent_state {
 	wait_queue_head_t wq;
 	atomic_t refs;
 	unsigned long state;
+	spinlock_t lock;
 
 	/* for use by the FS */
 	u64 private;
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/5] Btrfs: avoid splitting state when truncating pagecache
  2012-03-15  9:12 [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
                   ` (3 preceding siblings ...)
  2012-03-15  9:12 ` [PATCH 4/5] Btrfs: apply rwlock for extent state Liu Bo
@ 2012-03-15  9:12 ` Liu Bo
  2012-03-21  2:52 ` [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
  5 siblings, 0 replies; 11+ messages in thread
From: Liu Bo @ 2012-03-15  9:12 UTC (permalink / raw)
  To: linux-btrfs

o   While we're invalidating a page, we hold the page lock, so lock_extent
    is not needed.

o   When we're truncating pagecache in both evict and truncate_freespace,
    after waiting pages' writeback flag, it is ensured that no IO comes in,
    so we can safely clear the range's state.  And this way we can avoid
    splitting state and reduce the write locks of extent state tree.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/free-space-cache.c |    5 +++++
 fs/btrfs/inode.c            |   15 +++++++--------
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index b30242f..b062b82 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -217,6 +217,11 @@ int btrfs_truncate_free_space_cache(struct btrfs_root *root,
 	}
 	spin_unlock(&trans->block_rsv->lock);
 
+	filemap_fdatawait(inode->i_mapping);
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, (u64)-1,
+			EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING,
+			1, 1, NULL, GFP_NOFS);
+
 	oldsize = i_size_read(inode);
 	btrfs_i_size_write(inode, 0);
 	truncate_pagecache(inode, oldsize, 0);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e9c4d6c..9077966 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3501,6 +3501,10 @@ void btrfs_evict_inode(struct inode *inode)
 
 	trace_btrfs_inode_evict(inode);
 
+	filemap_fdatawait(inode->i_mapping);
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, (u64)-1,
+			EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING,
+			1, 1, NULL, GFP_NOFS);
 	truncate_inode_pages(&inode->i_data, 0);
 	if (inode->i_nlink && (btrfs_root_refs(&root->root_item) != 0 ||
 			       btrfs_is_free_space_inode(root, inode)))
@@ -6339,7 +6343,6 @@ static void btrfs_invalidatepage(struct page *page, unsigned long offset)
 {
 	struct extent_io_tree *tree;
 	struct btrfs_ordered_extent *ordered;
-	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
 	u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
 
@@ -6358,8 +6361,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned long offset)
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
-	lock_extent_bits(tree, page_start, page_end, 0, &cached_state,
-			 GFP_NOFS);
+
 	ordered = btrfs_lookup_ordered_extent(page->mapping->host,
 					   page_offset(page));
 	if (ordered) {
@@ -6370,7 +6372,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned long offset)
 		clear_extent_bit(tree, page_start, page_end,
 				 EXTENT_DIRTY | EXTENT_DELALLOC |
 				 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING, 1, 0,
-				 &cached_state, GFP_NOFS);
+				 NULL, GFP_NOFS);
 		/*
 		 * whoever cleared the private bit is responsible
 		 * for the finish_ordered_io
@@ -6380,13 +6382,10 @@ static void btrfs_invalidatepage(struct page *page, unsigned long offset)
 						page_start, page_end);
 		}
 		btrfs_put_ordered_extent(ordered);
-		cached_state = NULL;
-		lock_extent_bits(tree, page_start, page_end, 0, &cached_state,
-				 GFP_NOFS);
 	}
 	clear_extent_bit(tree, page_start, page_end,
 		 EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC |
-		 EXTENT_DO_ACCOUNTING, 1, 1, &cached_state, GFP_NOFS);
+		 EXTENT_DO_ACCOUNTING, 1, 1, NULL, GFP_NOFS);
 	__btrfs_releasepage(page, GFP_NOFS);
 
 	ClearPageChecked(page);
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] apply rwlock for extent state
  2012-03-15  9:12 [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
                   ` (4 preceding siblings ...)
  2012-03-15  9:12 ` [PATCH 5/5] Btrfs: avoid splitting state when truncating pagecache Liu Bo
@ 2012-03-21  2:52 ` Liu Bo
  2012-03-21 12:34   ` Andrea Gelmini
  5 siblings, 1 reply; 11+ messages in thread
From: Liu Bo @ 2012-03-21  2:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Mason, Josef Bacik, Arne Jansen, David Sterba

Any comments?

On 03/15/2012 05:12 PM, Liu Bo wrote:
> This patchset is against one of project ideas, RBtree lock contention:
> "Btrfs uses a number of rbtrees to index in-memory data structures.
> Some of these are dominated by reads, and the lock contention from searching
> them is showing up in profiles.  We need to look into an RCU and sequence
> counter combination to allow lockless reads."
> 
> The goal is to use RCU, but we take it as a long term one, and instead we use
> rwlock until we find a mature rcu structure for lockless read.
> 
> So what we need to do is to make the code RCU friendly, and the idea mainly
> comes from Chris Mason:
> Quoted:
> "I think the extent_state code can be much more RCU friendly if we separate
> the operations on the tree from operations on the individual state.
> In general, we can gain a lot of performance if we are able to reduce
> the write locks taken at endio time.  Especially for reads, these are
> critical."
> 
> If we read a 10M file with pagecache,
> $ dd if=/mnt/btrfs/foobar of=/dev/null bs=1M
> 
> without this patchset,
>         lock for read : lock for write ~= 2:1
> with this patchset,
>         lock for read : lock for write ~= 200:1
> 
> I've run through xfstests, and no bugs jump out by then.
> MORE TESTS ARE WELCOME!
> 
> 
> Liu Bo (5):
>   Btrfs: use radix tree for checksum
>   Btrfs: merge adjacent states as much as possible
>   Btrfs: use large extent range for read and its endio
>   Btrfs: apply rwlock for extent state
>   Btrfs: avoid splitting state when truncating pagecache
> 
>  fs/btrfs/extent_io.c        |  768 +++++++++++++++++++++++++++++++++----------
>  fs/btrfs/extent_io.h        |    5 +-
>  fs/btrfs/free-space-cache.c |    5 +
>  fs/btrfs/inode.c            |   22 +-
>  4 files changed, 614 insertions(+), 186 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] apply rwlock for extent state
  2012-03-21  2:52 ` [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
@ 2012-03-21 12:34   ` Andrea Gelmini
  2012-03-22  1:04     ` Chris Samuel
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Gelmini @ 2012-03-21 12:34 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, Chris Mason, Josef Bacik, Arne Jansen, David Sterba

2012/3/21 Liu Bo <liubo2009@cn.fujitsu.com>:
> Any comments?

No comment, but I'm using this patches without problem since you
published it (compressed
/home with hourly snapshot delete/creation).

Thanks a lot for your work,
Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] apply rwlock for extent state
  2012-03-21 12:34   ` Andrea Gelmini
@ 2012-03-22  1:04     ` Chris Samuel
  2012-03-22  1:17       ` Liu Bo
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Samuel @ 2012-03-22  1:04 UTC (permalink / raw)
  To: Andrea Gelmini
  Cc: Liu Bo, linux-btrfs, Chris Mason, Josef Bacik, Arne Jansen,
	David Sterba

On 21/03/12 23:34, Andrea Gelmini wrote:

> No comment, but I'm using this patches without problem since you
> published it (compressed /home with hourly snapshot delete/creation).

Well worth sending a Tested-By: tag then, it's useful information.

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] apply rwlock for extent state
  2012-03-22  1:04     ` Chris Samuel
@ 2012-03-22  1:17       ` Liu Bo
  2012-03-22 13:09         ` Josef Bacik
  0 siblings, 1 reply; 11+ messages in thread
From: Liu Bo @ 2012-03-22  1:17 UTC (permalink / raw)
  To: Andrea Gelmini
  Cc: Chris Samuel, linux-btrfs, Chris Mason, Josef Bacik, Arne Jansen,
	David Sterba

On 03/22/2012 09:04 AM, Chris Samuel wrote:
> On 21/03/12 23:34, Andrea Gelmini wrote:
> 
>> No comment, but I'm using this patches without problem since you
>> published it (compressed /home with hourly snapshot delete/creation).
> 
> Well worth sending a Tested-By: tag then, it's useful information.
> 
> cheers,
> Chris


Besides a Tested-By: tag, I'll appreciate a lot if anyone can provide
improvement/regression on performance with this patchset ;)

thanks,
liubo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] apply rwlock for extent state
  2012-03-22  1:17       ` Liu Bo
@ 2012-03-22 13:09         ` Josef Bacik
  0 siblings, 0 replies; 11+ messages in thread
From: Josef Bacik @ 2012-03-22 13:09 UTC (permalink / raw)
  To: Liu Bo
  Cc: Andrea Gelmini, Chris Samuel, linux-btrfs, Chris Mason,
	Josef Bacik, Arne Jansen, David Sterba

On Thu, Mar 22, 2012 at 09:17:30AM +0800, Liu Bo wrote:
> On 03/22/2012 09:04 AM, Chris Samuel wrote:
> > On 21/03/12 23:34, Andrea Gelmini wrote:
> > 
> >> No comment, but I'm using this patches without problem since you
> >> published it (compressed /home with hourly snapshot delete/creation).
> > 
> > Well worth sending a Tested-By: tag then, it's useful information.
> > 
> > cheers,
> > Chris
> 
> 
> Besides a Tested-By: tag, I'll appreciate a lot if anyone can provide
> improvement/regression on performance with this patchset ;)
>

I've started reviewing some of it Liu and I'll try to get to the rest of it
today.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-03-22 13:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-15  9:12 [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
2012-03-15  9:12 ` [PATCH 1/5] Btrfs: use radix tree for checksum Liu Bo
2012-03-15  9:12 ` [PATCH 2/5] Btrfs: merge adjacent states as much as possible Liu Bo
2012-03-15  9:12 ` [PATCH 3/5] Btrfs: use large extent range for read and its endio Liu Bo
2012-03-15  9:12 ` [PATCH 4/5] Btrfs: apply rwlock for extent state Liu Bo
2012-03-15  9:12 ` [PATCH 5/5] Btrfs: avoid splitting state when truncating pagecache Liu Bo
2012-03-21  2:52 ` [RFC PATCH 0/5] apply rwlock for extent state Liu Bo
2012-03-21 12:34   ` Andrea Gelmini
2012-03-22  1:04     ` Chris Samuel
2012-03-22  1:17       ` Liu Bo
2012-03-22 13:09         ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).