All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations
@ 2010-03-17  6:59 Mark Fasheh
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 1/5] ocfs2: " Mark Fasheh
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Mark Fasheh @ 2010-03-17  6:59 UTC (permalink / raw)
  To: ocfs2-devel

Changes from the last patch set:
- added a check for overlapping reservations in ocfs2_resv_insert()
- cleaned up the comments in ocfs2_cannibalize_resv()
- the check for reservation past bitmap end in ocfs2_check_resmap() is more
  strict now.
- removed the unused m_search_start member of ocfs2_reservation_map
- optimized __ocfs2_resv_find_window() to ignore regions that are too small
  for the current alloc
- major cleanup of ocfs2_resmap_claimed_bits()
- added a set of BUG_ON's to in ocfs2_resmap_claimed_bits() to check that
  the passed allocation range is within the window.
- fixed ocfs2_local_alloc_find_clear_bits() to return actual bits allocated
- add a check for a null data_ac in ocfs2_write_begin_nolock()

I also added a fifth patch, "ocfs2: remove ocfs2_local_alloc_in_range()".
I could spin this as it's own patch to go upstream earlier if we want.

Finally, thanks to Tao for an excellent review that helped me catch most of
those issues.


Original introduction message follows:

The following patches comprise my latest work on enabling larger contiguous
allocations in Ocfs2 in the presence of multiple threads. The patches have
been through much more testing since my last round. At this point, I'd say
they're ready for wider consumption and of course, more review :)

Similarly to the last series, reservations only operate on the local alloc
bitmap. The code knows nothing of inodes and allocators however, so we can
extend it to the global bitmap (should the need arise) at a future date.

Changes from the last series are numerous. The biggest one however, is that
reservations (when enabled) are no longer 'advisory' and represent an actual
region of free bits in the local alloc file. The local alloc code obeys
reservations unconditionally.

The reason I made this change is because I saw a breakdown in allocation
(back to worst-case) on longer running tests, or those with many threads. 
Those tests it turned out, were exposing "corner cases" in the code where
reservations could no longer be honored due to bits having been set in the
local alloc bitmap. Better window replacement (and tracking) policy became
quite convoluted when the state of the local alloc bitmap wasn't quite
known. It is far simpler to just consult the bitmap for windows, and my
testing results showed that it worked better too.

This differs from file systems like ext4 (which I used for inspiration), but
our allocation strategy differs greatly.  Whereas ext4 may have many
different block groups in play during a multi-threaded write we only have
the single (and relatively smaller) local alloc window.  Reservations can
afford to be advisory for ext4, in Ocfs2 however we need them to be honored.


As for results, I provide one of my recent test runs on a 4k/4k file
system:

dd if=/dev/urandom of=/ocfs2/1 bs=4096 count=10000 & dd if=/dev/urandom
of=/ocfs2/2 bs=4096 count=10000 & dd if=/dev/urandom of=/ocfs2/3 bs=4096
count=10000 &


resv_level=0
Inode: 16920    % fragmented: 93.48     clusters: 10000 extents: 9348 score: 23931
Inode: 16921    % fragmented: 84.75     clusters: 10000 extents: 8475 score: 21696
Inode: 16922    % fragmented: 95.50     clusters: 10000 extents: 9550 score: 24448

resv_level=5 (defaults changed a bit, this means '128 blocks per reservation'):
Inode: 16916    % fragmented: 1.66      clusters: 10000 extents: 166 score: 425
Inode: 16917    % fragmented: 1.71      clusters: 10000 extents: 171 score: 438
Inode: 16918    % fragmented: 1.58      clusters: 10000 extents: 158 score: 404

Thanks in advance,
	--Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 1/5] ocfs2: allocation reservations
  2010-03-17  6:59 [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
@ 2010-03-17  6:59 ` Mark Fasheh
  2010-03-19 22:40   ` Joel Becker
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 2/5] ocfs2: use allocation reservations during file write Mark Fasheh
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Mark Fasheh @ 2010-03-17  6:59 UTC (permalink / raw)
  To: ocfs2-devel

This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.

Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 Documentation/filesystems/ocfs2.txt |    3 +
 fs/ocfs2/Makefile                   |    1 +
 fs/ocfs2/cluster/masklog.c          |    1 +
 fs/ocfs2/cluster/masklog.h          |    1 +
 fs/ocfs2/localalloc.c               |   64 +++-
 fs/ocfs2/ocfs2.h                    |    5 +
 fs/ocfs2/reservations.c             |  829 +++++++++++++++++++++++++++++++++++
 fs/ocfs2/reservations.h             |  154 +++++++
 fs/ocfs2/suballoc.h                 |    2 +
 fs/ocfs2/super.c                    |   25 +
 10 files changed, 1074 insertions(+), 11 deletions(-)
 create mode 100644 fs/ocfs2/reservations.c
 create mode 100644 fs/ocfs2/reservations.h

diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt
index c58b9f5..412df90 100644
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.txt
@@ -80,3 +80,6 @@ user_xattr	(*)	Enables Extended User Attributes.
 nouser_xattr		Disables Extended User Attributes.
 acl			Enables POSIX Access Control Lists support.
 noacl		(*)	Disables POSIX Access Control Lists support.
+resv_level=4	(*)	Set how agressive allocation reservations will be.
+			Valid values are between 0 (reservations off) to 8
+			(maximum space for reservations).
diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile
index 600d2d2..b1cd7fc 100644
--- a/fs/ocfs2/Makefile
+++ b/fs/ocfs2/Makefile
@@ -29,6 +29,7 @@ ocfs2-objs := \
 	mmap.o 			\
 	namei.o 		\
 	refcounttree.o		\
+	reservations.o		\
 	resize.o		\
 	slot_map.o 		\
 	suballoc.o 		\
diff --git a/fs/ocfs2/cluster/masklog.c b/fs/ocfs2/cluster/masklog.c
index 1cd2934..a56147a 100644
--- a/fs/ocfs2/cluster/masklog.c
+++ b/fs/ocfs2/cluster/masklog.c
@@ -115,6 +115,7 @@ static struct mlog_attribute mlog_attrs[MLOG_MAX_BITS] = {
 	define_mask(ERROR),
 	define_mask(NOTICE),
 	define_mask(KTHREAD),
+	define_mask(RESERVATIONS),
 };
 
 static struct attribute *mlog_attr_ptrs[MLOG_MAX_BITS] = {NULL, };
diff --git a/fs/ocfs2/cluster/masklog.h b/fs/ocfs2/cluster/masklog.h
index 9b4d117..af1a610 100644
--- a/fs/ocfs2/cluster/masklog.h
+++ b/fs/ocfs2/cluster/masklog.h
@@ -118,6 +118,7 @@
 #define ML_ERROR	0x0000000100000000ULL /* sent to KERN_ERR */
 #define ML_NOTICE	0x0000000200000000ULL /* setn to KERN_NOTICE */
 #define ML_KTHREAD	0x0000000400000000ULL /* kernel thread activity */
+#define	ML_RESERVATIONS	0x0000000800000000ULL /* ocfs2 alloc reservations */
 
 #define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_NOTICE)
 #define MLOG_INITIAL_NOT_MASK (ML_ENTRY|ML_EXIT)
diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index ac10f83..ebab3c0 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -52,7 +52,8 @@ static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc);
 
 static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
 					     struct ocfs2_dinode *alloc,
-					     u32 numbits);
+					     u32 *numbits,
+					     struct ocfs2_alloc_reservation *resv);
 
 static void ocfs2_clear_local_alloc(struct ocfs2_dinode *alloc);
 
@@ -262,6 +263,8 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb)
 
 	osb->local_alloc_state = OCFS2_LA_DISABLED;
 
+	ocfs2_resmap_uninit(&osb->osb_la_resmap);
+
 	main_bm_inode = ocfs2_get_system_file_inode(osb,
 						    GLOBAL_BITMAP_SYSTEM_INODE,
 						    OCFS2_INVALID_SLOT);
@@ -498,7 +501,7 @@ static int ocfs2_local_alloc_in_range(struct inode *inode,
 	alloc = (struct ocfs2_dinode *) osb->local_alloc_bh->b_data;
 	la = OCFS2_LOCAL_ALLOC(alloc);
 
-	start = ocfs2_local_alloc_find_clear_bits(osb, alloc, bits_wanted);
+	start = ocfs2_local_alloc_find_clear_bits(osb, alloc, &bits_wanted, NULL);
 	if (start == -1) {
 		mlog_errno(-ENOSPC);
 		return 0;
@@ -664,7 +667,8 @@ int ocfs2_claim_local_alloc_bits(struct ocfs2_super *osb,
 	alloc = (struct ocfs2_dinode *) osb->local_alloc_bh->b_data;
 	la = OCFS2_LOCAL_ALLOC(alloc);
 
-	start = ocfs2_local_alloc_find_clear_bits(osb, alloc, bits_wanted);
+	start = ocfs2_local_alloc_find_clear_bits(osb, alloc, &bits_wanted,
+						  ac->ac_resv);
 	if (start == -1) {
 		/* TODO: Shouldn't we just BUG here? */
 		status = -ENOSPC;
@@ -674,8 +678,6 @@ int ocfs2_claim_local_alloc_bits(struct ocfs2_super *osb,
 
 	bitmap = la->la_bitmap;
 	*bit_off = le32_to_cpu(la->la_bm_off) + start;
-	/* local alloc is always contiguous by nature -- we never
-	 * delete bits from it! */
 	*num_bits = bits_wanted;
 
 	status = ocfs2_journal_access_di(handle,
@@ -687,6 +689,9 @@ int ocfs2_claim_local_alloc_bits(struct ocfs2_super *osb,
 		goto bail;
 	}
 
+	ocfs2_resmap_claimed_bits(&osb->osb_la_resmap, ac->ac_resv, start,
+				  bits_wanted);
+
 	while(bits_wanted--)
 		ocfs2_set_bit(start++, bitmap);
 
@@ -722,13 +727,17 @@ static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc)
 }
 
 static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
-					     struct ocfs2_dinode *alloc,
-					     u32 numbits)
+				     struct ocfs2_dinode *alloc,
+				     u32 *numbits,
+				     struct ocfs2_alloc_reservation *resv)
 {
 	int numfound, bitoff, left, startoff, lastzero;
+	int local_resv = 0;
+	struct ocfs2_alloc_reservation r;
 	void *bitmap = NULL;
+	struct ocfs2_reservation_map *resmap = &osb->osb_la_resmap;
 
-	mlog_entry("(numbits wanted = %u)\n", numbits);
+	mlog_entry("(numbits wanted = %u)\n", *numbits);
 
 	if (!alloc->id1.bitmap1.i_total) {
 		mlog(0, "No bits in my window!\n");
@@ -736,6 +745,30 @@ static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
 		goto bail;
 	}
 
+	if (!resv) {
+		local_resv = 1;
+		ocfs2_resv_init_once(&r);
+		resv = &r;
+	}
+
+	numfound = *numbits;
+	if (ocfs2_resmap_resv_bits(resmap, resv, local_resv,
+				   &bitoff, &numfound) == 0) {
+		if (numfound < *numbits)
+			*numbits = numfound;
+		goto bail;
+	}
+
+	/*
+	 * Code error. While reservations are enabled, local
+	 * allocation should _always_ go through them.
+	 */
+	BUG_ON(osb->osb_resv_level != 0);
+
+	/*
+	 * Reservations are disabled. Handle this the old way.
+	 */
+
 	bitmap = OCFS2_LOCAL_ALLOC(alloc)->la_bitmap;
 
 	numfound = bitoff = startoff = 0;
@@ -761,7 +794,7 @@ static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
 			startoff = bitoff+1;
 		}
 		/* we got everything we needed */
-		if (numfound == numbits) {
+		if (numfound == *numbits) {
 			/* mlog(0, "Found it all!\n"); */
 			break;
 		}
@@ -770,12 +803,18 @@ static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
 	mlog(0, "Exiting loop, bitoff = %d, numfound = %d\n", bitoff,
 	     numfound);
 
-	if (numfound == numbits)
+	if (numfound == *numbits) {
 		bitoff = startoff - numfound;
-	else
+		*numbits = numfound;
+	} else {
+		numfound = 0;
 		bitoff = -1;
+	}
 
 bail:
+	if (local_resv)
+		ocfs2_resv_discard(resmap, resv);
+
 	mlog_exit(bitoff);
 	return bitoff;
 }
@@ -1096,6 +1135,9 @@ retry_enospc:
 	memset(OCFS2_LOCAL_ALLOC(alloc)->la_bitmap, 0,
 	       le16_to_cpu(la->la_size));
 
+	ocfs2_resmap_restart(&osb->osb_la_resmap, cluster_count,
+			     OCFS2_LOCAL_ALLOC(alloc)->la_bitmap);
+
 	mlog(0, "New window allocated:\n");
 	mlog(0, "window la_bm_off = %u\n",
 	     OCFS2_LOCAL_ALLOC(alloc)->la_bm_off);
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 740f448..e0c6d5e 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -46,6 +46,7 @@
 /* For struct ocfs2_blockcheck_stats */
 #include "blockcheck.h"
 
+#include "reservations.h"
 
 /* Caching of metadata buffers */
 
@@ -346,6 +347,10 @@ struct ocfs2_super
 
 	u64 la_last_gd;
 
+	struct ocfs2_reservation_map	osb_la_resmap;
+
+	unsigned int	osb_resv_level;
+
 	/* Next three fields are for local node slot recovery during
 	 * mount. */
 	int dirty;
diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c
new file mode 100644
index 0000000..ecffb1c
--- /dev/null
+++ b/fs/ocfs2/reservations.c
@@ -0,0 +1,829 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * reservations.c
+ *
+ * Allocation reservations implementation
+ *
+ * Some code borrowed from fs/ext3/balloc.c and is:
+ *
+ * Copyright (C) 1992, 1993, 1994, 1995
+ * Remy Card (card at masi.ibp.fr)
+ * Laboratoire MASI - Institut Blaise Pascal
+ * Universite Pierre et Marie Curie (Paris VI)
+ *
+ * The rest is copyright (C) 2009 Novell.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/fs.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/bitops.h>
+#include <linux/list.h>
+
+#define MLOG_MASK_PREFIX ML_RESERVATIONS
+#include <cluster/masklog.h>
+
+#include "ocfs2.h"
+
+#ifdef CONFIG_OCFS2_DEBUG_FS
+#define OCFS2_CHECK_RESERVATIONS
+#endif
+
+#define OCFS2_CHECK_RESERVATIONS
+
+
+DEFINE_SPINLOCK(resv_lock);
+
+#define	OCFS2_MIN_RESV_WINDOW_BITS	8
+#define	OCFS2_MAX_RESV_WINDOW_BITS	1024
+
+static unsigned int ocfs2_resv_window_bits(struct ocfs2_reservation_map *resmap)
+{
+	struct ocfs2_super *osb = resmap->m_osb;
+
+	/* 8, 16, 32, 64, 128, 256, 512, 1024 */
+	return 4 << osb->osb_resv_level;
+}
+
+static inline unsigned int ocfs2_resv_end(struct ocfs2_alloc_reservation *resv)
+{
+	if (resv->r_len)
+		return resv->r_start + resv->r_len - 1;
+	return resv->r_start;
+}
+
+static inline int ocfs2_resv_empty(struct ocfs2_alloc_reservation *resv)
+{
+	return !!(resv->r_len == 0);
+}
+
+static inline int ocfs2_resmap_disabled(struct ocfs2_reservation_map *resmap)
+{
+	if (resmap->m_osb->osb_resv_level == 0)
+		return 1;
+	return 0;
+}
+
+static void ocfs2_dump_resv(struct ocfs2_reservation_map *resmap)
+{
+	struct ocfs2_super *osb = resmap->m_osb;
+	struct rb_node *node;
+	struct ocfs2_alloc_reservation *resv;
+	int i = 0;
+
+	mlog(ML_NOTICE, "Dumping resmap for device %s. Bitmap length: %u\n",
+	     osb->dev_str, resmap->m_bitmap_len);
+
+	node = rb_first(&resmap->m_reservations);
+	while (node) {
+		resv = rb_entry(node, struct ocfs2_alloc_reservation, r_node);
+
+		mlog(ML_NOTICE, "start: %u\tend: %u\tlen: %u\tlast_start: %u"
+		     "\tlast_len: %u\n", resv->r_start,
+		     ocfs2_resv_end(resv), resv->r_len, resv->r_last_start,
+		     resv->r_last_len);
+
+		node = rb_next(node);
+		i++;
+	}
+
+	mlog(ML_NOTICE, "%d reservations found. LRU follows\n", i);
+
+	i = 0;
+	list_for_each_entry(resv, &resmap->m_lru, r_lru) {
+		mlog(ML_NOTICE, "LRU(%d) start: %u\tend: %u\tlen: %u\t"
+		     "last_start: %u\tlast_len: %u\n", i, resv->r_start,
+		     ocfs2_resv_end(resv), resv->r_len, resv->r_last_start,
+		     resv->r_last_len);
+
+		i++;
+	}
+}
+
+#ifdef OCFS2_CHECK_RESERVATIONS
+static int ocfs2_validate_resmap_bits(struct ocfs2_reservation_map *resmap,
+				      int i,
+				      struct ocfs2_alloc_reservation *resv)
+{
+	char *disk_bitmap = resmap->m_disk_bitmap;
+	unsigned int start = resv->r_start;
+	unsigned int end = ocfs2_resv_end(resv);
+
+	while(start <= end) {
+		if (ocfs2_test_bit(start, disk_bitmap)) {
+			mlog(ML_ERROR,
+			     "reservation %d covers an allocated area "
+			     "starting@bit %u!\n", i, start);
+			return 1;
+		}
+
+		start++;
+	}
+	return 0;
+}
+
+static void ocfs2_check_resmap(struct ocfs2_reservation_map *resmap)
+{
+	unsigned int off = 0;
+	int i = 0;
+	struct rb_node *node;
+	struct ocfs2_alloc_reservation *resv;
+
+	node = rb_first(&resmap->m_reservations);
+	while (node) {
+		resv = rb_entry(node, struct ocfs2_alloc_reservation, r_node);
+
+		if (i > 0 && resv->r_start <= off) {
+			mlog(ML_ERROR, "reservation %d has bad start off!\n",
+			     i);
+			goto bad;
+		}
+
+		if (resv->r_len == 0) {
+			mlog(ML_ERROR, "reservation %d has no length!\n",
+			     i);
+			goto bad;
+		}
+
+		if (resv->r_start > ocfs2_resv_end(resv)) {
+			mlog(ML_ERROR, "reservation %d has invalid range!\n",
+			     i);
+			goto bad;
+		}
+
+		if (ocfs2_resv_end(resv) >= resmap->m_bitmap_len) {
+			mlog(ML_ERROR, "reservation %d extends past bitmap!\n",
+			     i);
+			goto bad;
+		}
+
+		if (ocfs2_validate_resmap_bits(resmap, i, resv))
+			goto bad;
+
+		off = ocfs2_resv_end(resv);
+		node = rb_next(node);
+
+		i++;
+	}
+	return;
+
+bad:
+	ocfs2_dump_resv(resmap);
+	BUG();
+}
+#else
+static inline void ocfs2_check_resmap(struct ocfs2_reservation_map *resmap)
+{
+
+}
+#endif
+
+void ocfs2_resv_init_once(struct ocfs2_alloc_reservation *resv)
+{
+	memset(resv, 0, sizeof(*resv));
+	INIT_LIST_HEAD(&resv->r_lru);
+}
+
+int ocfs2_resmap_init(struct ocfs2_super *osb,
+		      struct ocfs2_reservation_map *resmap)
+{
+	memset(resmap, 0, sizeof(*resmap));
+
+	resmap->m_osb = osb;
+	resmap->m_reservations = RB_ROOT;
+	/* m_bitmap_len is initialized to zero by the above memset. */
+	INIT_LIST_HEAD(&resmap->m_lru);
+
+	return 0;
+}
+
+static void ocfs2_resv_mark_lru(struct ocfs2_reservation_map *resmap,
+				struct ocfs2_alloc_reservation *resv)
+{
+	assert_spin_locked(&resv_lock);
+
+	if (!list_empty(&resv->r_lru))
+		list_del_init(&resv->r_lru);
+
+	list_add_tail(&resv->r_lru, &resmap->m_lru);
+}
+
+static void __ocfs2_resv_trunc(struct ocfs2_alloc_reservation *resv)
+{
+	resv->r_len = 0;
+	resv->r_start = 0;
+}
+
+static void ocfs2_resv_remove(struct ocfs2_reservation_map *resmap,
+			      struct ocfs2_alloc_reservation *resv)
+{
+	if (resv->r_inuse) {
+		list_del_init(&resv->r_lru);
+		rb_erase(&resv->r_node, &resmap->m_reservations);
+		resv->r_inuse = 0;
+	}
+}
+
+static void __ocfs2_resv_discard(struct ocfs2_reservation_map *resmap,
+				 struct ocfs2_alloc_reservation *resv)
+{
+	assert_spin_locked(&resv_lock);
+
+	__ocfs2_resv_trunc(resv);
+	/*
+	 * last_len and last_start no longer make sense if
+	 * we're changing the range of our allocations.
+	 */
+	resv->r_last_len = resv->r_last_start = 0;
+
+	ocfs2_resv_remove(resmap, resv);
+}
+
+/* does nothing if 'resv' is null */
+void ocfs2_resv_discard(struct ocfs2_reservation_map *resmap,
+			struct ocfs2_alloc_reservation *resv)
+{
+	if (resv) {
+		spin_lock(&resv_lock);
+		__ocfs2_resv_discard(resmap, resv);
+		spin_unlock(&resv_lock);
+	}
+}
+
+static void ocfs2_resmap_clear_all_resv(struct ocfs2_reservation_map *resmap)
+{
+	struct rb_node *node;
+	struct ocfs2_alloc_reservation *resv;
+
+	assert_spin_locked(&resv_lock);
+
+	while ((node = rb_last(&resmap->m_reservations)) != NULL) {
+		resv = rb_entry(node, struct ocfs2_alloc_reservation, r_node);
+
+		__ocfs2_resv_discard(resmap, resv);
+	}
+}
+
+void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap,
+			  unsigned int clen, char *disk_bitmap)
+{
+	if (ocfs2_resmap_disabled(resmap))
+		return;
+
+	spin_lock(&resv_lock);
+
+	ocfs2_resmap_clear_all_resv(resmap);
+	resmap->m_bitmap_len = clen;
+	resmap->m_disk_bitmap = disk_bitmap;
+
+	spin_unlock(&resv_lock);
+}
+
+void ocfs2_resmap_uninit(struct ocfs2_reservation_map *resmap)
+{
+	/* Does nothing for now. Keep this around for API symmetry */
+}
+
+static void ocfs2_resv_insert(struct ocfs2_reservation_map *resmap,
+			      struct ocfs2_alloc_reservation *new)
+{
+	struct rb_root *root = &resmap->m_reservations;
+	struct rb_node *parent = NULL;
+	struct rb_node **p = &root->rb_node;
+	struct ocfs2_alloc_reservation *tmp;
+
+	assert_spin_locked(&resv_lock);
+
+	mlog(0, "Insert reservation start: %u len: %u\n", new->r_start,
+	     new->r_len);
+
+	while(*p) {
+		parent = *p;
+
+		tmp = rb_entry(parent, struct ocfs2_alloc_reservation, r_node);
+
+		if (new->r_start < tmp->r_start) {
+			p = &(*p)->rb_left;
+
+			/*
+			 * This is a good place to check for
+			 * overlapping reservations.
+			 */
+			BUG_ON(ocfs2_resv_end(new) >= tmp->r_start);
+		} else if (new->r_start > ocfs2_resv_end(tmp)) {
+			p = &(*p)->rb_right;
+		} else {
+			/* This should never happen! */
+			mlog(ML_ERROR, "Duplicate reservation window!\n");
+			BUG();
+		}
+	}
+
+	rb_link_node(&new->r_node, parent, p);
+	rb_insert_color(&new->r_node, root);
+	new->r_inuse = 1;
+
+	ocfs2_resv_mark_lru(resmap, new);
+
+	ocfs2_check_resmap(resmap);
+}
+
+/**
+ * ocfs2_find_resv_lhs() - find the window which contains goal
+ * @resmap: reservation map to search
+ * @goal: which bit to search for
+ *
+ * If a window containing that goal is not found, we return the window
+ * which comes before goal. Returns NULL on empty rbtree or no window
+ * before goal.
+ */
+static struct ocfs2_alloc_reservation *
+ocfs2_find_resv_lhs(struct ocfs2_reservation_map *resmap, unsigned int goal)
+{
+	struct ocfs2_alloc_reservation *resv = NULL;
+	struct ocfs2_alloc_reservation *prev_resv = NULL;
+	struct rb_node *node = resmap->m_reservations.rb_node;
+	struct rb_node *prev = NULL;
+
+	assert_spin_locked(&resv_lock);
+
+	if (!node)
+		return NULL;
+
+	node = rb_first(&resmap->m_reservations);
+	while (node) {
+		resv = rb_entry(node, struct ocfs2_alloc_reservation, r_node);
+
+		if (resv->r_start <= goal && ocfs2_resv_end(resv) >= goal)
+			break;
+
+		/* Check if we overshot the reservation just before goal? */
+		if (resv->r_start > goal) {
+			resv = prev_resv;
+			break;
+		}
+
+		prev_resv = resv;
+		prev = node;
+		node = rb_next(node);
+	}
+
+	return resv;
+}
+
+/*
+ * We are given a range within the bitmap, which corresponds to a gap
+ * inside the reservations tree (search_start, search_len). The range
+ * can be anything from the whole bitmap, to a gap between
+ * reservations.
+ *
+ * The start value of *rstart is insignificant.
+ *
+ * This function searches the bitmap range starting at search_start
+ * with length csearch_len for a set of contiguous free bits. We try
+ * to find up to 'wanted' bits, but can sometimes return less.
+ *
+ * Returns the length of allocation, 0 if no free bits are found.
+ *
+ * *cstart and *clen will also be populated with the result.
+ */
+static int ocfs2_resmap_find_free_bits(struct ocfs2_reservation_map *resmap,
+				       unsigned int wanted,
+				       unsigned int search_start,
+				       unsigned int search_len,
+				       unsigned int *rstart,
+				       unsigned int *rlen)
+{
+	void *bitmap = resmap->m_disk_bitmap;
+	unsigned int best_start, best_len = 0;
+	int offset, start, found;
+
+	mlog(0, "Find %u bits within range (%u, len %u) resmap len: %u\n",
+	     wanted, search_start, search_len, resmap->m_bitmap_len);
+
+	found = best_start = best_len = 0;
+
+	start = search_start;
+	while((offset = ocfs2_find_next_zero_bit(bitmap, resmap->m_bitmap_len,
+						 start)) != -1) {
+		/* Search reached end of the region */
+		if (offset >= (search_start + search_len))
+			break;
+
+		if (offset == start) {
+			/* we found a zero */
+			found++;
+			/* move start to the next bit to test */
+			start++;
+		} else {
+			/* got a zero after some ones */
+			found = 1;
+			start = offset + 1;
+		}
+		if (found > best_len) {
+			best_len = found;
+			best_start = start - found;
+		}
+
+		if (found >= wanted)
+			break;
+	}
+
+	if (best_len == 0)
+		return 0;
+
+	if (best_len >= wanted)
+		best_len = wanted;
+
+	*rlen = best_len;
+	*rstart = best_start;
+
+	mlog(0, "Found start: %u len: %u\n", best_start, best_len);
+
+	return *rlen;
+}
+
+static void __ocfs2_resv_find_window(struct ocfs2_reservation_map *resmap,
+				     struct ocfs2_alloc_reservation *resv,
+				     unsigned int goal, unsigned int wanted)
+{
+	struct rb_root *root = &resmap->m_reservations;
+	unsigned int gap_start, gap_end, gap_len;
+	struct ocfs2_alloc_reservation *prev_resv, *next_resv;
+	struct rb_node *prev, *next;
+	unsigned int cstart, clen;
+	unsigned int best_start = 0, best_len = 0;
+
+	/*
+	 * Nasty cases to consider:
+	 *
+	 * - rbtree is empty
+	 * - our window should be first in all reservations
+	 * - our window should be last in all reservations
+	 * - need to make sure we don't go past end of bitmap
+	 */
+
+	mlog(0, "resv start: %u resv end: %u goal: %u wanted: %u\n",
+	     resv->r_start, ocfs2_resv_end(resv), goal, wanted);
+
+	assert_spin_locked(&resv_lock);
+
+	if (RB_EMPTY_ROOT(root)) {
+		/*
+		 * Easiest case - empty tree. We can just take
+		 * whatever window of free bits we want.
+		 */
+
+		mlog(0, "Empty root\n");
+
+		clen = ocfs2_resmap_find_free_bits(resmap, wanted, goal,
+						   resmap->m_bitmap_len - goal,
+						   &cstart, &clen);
+
+		/*
+		 * This should never happen - the local alloc window
+		 * will always have free bits when we're called.
+		 */
+		BUG_ON(goal == 0 && clen == 0);
+
+		if (clen == 0)
+			return;
+
+		resv->r_start = cstart;
+		resv->r_len = clen;
+
+		ocfs2_resv_insert(resmap, resv);
+		return;
+	}
+
+	prev_resv = ocfs2_find_resv_lhs(resmap, goal);
+
+	if (prev_resv == NULL) {
+		mlog(0, "Goal on LHS of leftmost window\n");
+
+		/*
+		 * A NULL here means that the search code couldn't
+		 * find a window that starts before goal.
+		 *
+		 * However, we can take the first window after goal,
+		 * which is also by definition, the leftmost window in
+		 * the entire tree. If we can find free bits in the
+		 * gap between goal and the LHS window, then the
+		 * reservation can safely be placed there.
+		 *
+		 * Otherwise we fall back to a linear search, checking
+		 * the gaps in between windows for a place to
+		 * allocate.
+		 */
+
+		next = rb_first(root);
+		next_resv = rb_entry(next, struct ocfs2_alloc_reservation,
+				     r_node);
+
+		/*
+		 * The search should never return such a window. (see
+		 * comment above
+		 */
+		if (next_resv->r_start <= goal) {
+			mlog(ML_ERROR, "goal: %u next_resv: start %u len %u\n",
+			     goal, next_resv->r_start, next_resv->r_len);
+			ocfs2_dump_resv(resmap);
+			BUG();
+		}
+
+		clen = ocfs2_resmap_find_free_bits(resmap, wanted, goal,
+						   next_resv->r_start - goal,
+						   &cstart, &clen);
+		if (clen) {
+			best_len = clen;
+			best_start = cstart;
+			if (best_len == wanted)
+				goto out_insert;
+		}
+
+		prev_resv = next_resv;
+		next_resv = NULL;
+	}
+
+	prev = &prev_resv->r_node;
+
+	/* Now we do a linear search for a window, starting at 'prev_rsv' */
+	while (1) {
+		next = rb_next(prev);
+		if (next) {
+			mlog(0, "One more resv found in linear search\n");
+			next_resv = rb_entry(next,
+					     struct ocfs2_alloc_reservation,
+					     r_node);
+
+			gap_start = ocfs2_resv_end(prev_resv) + 1;
+			gap_end = next_resv->r_start - 1;
+			gap_len = gap_end - gap_start + 1;
+		} else {
+			mlog(0, "No next node\n");
+			/*
+			 * We're at the rightmost edge of the
+			 * tree. See if a reservation between this
+			 * window and the end of the bitmap will work.
+			 */
+			gap_start = ocfs2_resv_end(prev_resv) + 1;
+			gap_len = resmap->m_bitmap_len - gap_start;
+			gap_end = resmap->m_bitmap_len - 1;
+		}
+
+		/*
+		 * No need to check this gap if we have already found
+		 * a larger region of free bits.
+		 */
+		if (gap_len <= best_len)
+			goto next_resv;
+
+		clen = ocfs2_resmap_find_free_bits(resmap, wanted, gap_start,
+						   gap_len, &cstart, &clen);
+		if (clen == wanted) {
+			best_len = clen;
+			best_start = cstart;
+			goto out_insert;
+		} else if (clen > best_len) {
+			best_len = clen;
+			best_start = cstart;
+		}
+
+next_resv:
+		if (!next)
+			break;
+
+		prev = next;
+		prev_resv = rb_entry(prev, struct ocfs2_alloc_reservation,
+				     r_node);
+	}
+
+out_insert:
+	if (best_len) {
+		resv->r_start = best_start;
+		resv->r_len = best_len;
+		ocfs2_resv_insert(resmap, resv);
+	}
+}
+
+static void ocfs2_cannibalize_resv(struct ocfs2_reservation_map *resmap,
+				   struct ocfs2_alloc_reservation *resv,
+				   unsigned int wanted)
+{
+	struct ocfs2_alloc_reservation *lru_resv;
+	unsigned int min_bits = ocfs2_resv_window_bits(resmap) >> 1;
+
+	/*
+	 * Take the first reservation off the LRU as our 'target'. We
+	 * don't try to be smart about it. There might be a case for
+	 * searching based on size but I don't have enough data to be
+	 * sure. --Mark (3/16/2010)
+	 */
+	lru_resv = list_first_entry(&resmap->m_lru,
+				    struct ocfs2_alloc_reservation, r_lru);
+
+	mlog(0, "lru resv: start: %u len: %u end: %u\n", lru_resv->r_start,
+	     lru_resv->r_len, ocfs2_resv_end(lru_resv));
+
+	/*
+	 * Cannibalize (some or all) of the target reservation and
+	 * feed it to the current window.
+	 */
+	if (lru_resv->r_len <= min_bits) {
+		/*
+		 * Discard completely if size is less than or equal to a
+		 * reasonable threshold - 50% of window bits.
+		 */
+		resv->r_start = lru_resv->r_start;
+		resv->r_len = lru_resv->r_len;
+
+		__ocfs2_resv_discard(resmap, lru_resv);
+	} else {
+		unsigned int shrink = lru_resv->r_len / 2;
+
+		lru_resv->r_len -= shrink;
+
+		resv->r_start = ocfs2_resv_end(lru_resv) + 1;
+		resv->r_len = shrink;
+	}
+
+	mlog(0, "Reservation now looks like: r_start: %u r_end: %u "
+	     "r_len: %u r_last_start: %u r_last_len: %u\n",
+	     resv->r_start, ocfs2_resv_end(resv), resv->r_len,
+	     resv->r_last_start, resv->r_last_len);
+
+	ocfs2_resv_insert(resmap, resv);
+}
+
+static void ocfs2_resv_find_window(struct ocfs2_reservation_map *resmap,
+				   struct ocfs2_alloc_reservation *resv,
+				   unsigned int wanted)
+{
+	unsigned int goal = 0;
+
+	BUG_ON(!ocfs2_resv_empty(resv));
+
+	/*
+	 * Begin by trying to get a window as close to the previous
+	 * one as possible. Using the most recent allocation as a
+	 * start goal makes sense.
+	 */
+	if (resv->r_last_len) {
+		goal = resv->r_last_start + resv->r_last_len;
+		if (goal >= resmap->m_bitmap_len)
+			goal = 0;
+	}
+
+	__ocfs2_resv_find_window(resmap, resv, goal, wanted);
+
+	/* Search from last alloc didn't work, try once more from beginning. */
+	if (ocfs2_resv_empty(resv) && goal != 0)
+		__ocfs2_resv_find_window(resmap, resv, 0, wanted);
+
+	if (ocfs2_resv_empty(resv)) {
+		/*
+		 * Still empty? Pull oldest one off the LRU, remove it from
+		 * tree, put this one in it's place.
+		 */
+		ocfs2_cannibalize_resv(resmap, resv, wanted);
+	}
+
+	BUG_ON(ocfs2_resv_empty(resv));
+}
+
+int ocfs2_resmap_resv_bits(struct ocfs2_reservation_map *resmap,
+			   struct ocfs2_alloc_reservation *resv,
+			   int tmpwindow, int *cstart, int *clen)
+{
+	unsigned int wanted = *clen;
+
+	if (resv == NULL || ocfs2_resmap_disabled(resmap))
+		return -ENOSPC;
+
+	spin_lock(&resv_lock);
+
+	/*
+	 * We don't want to over-allocate for temporary
+	 * windows. Otherwise, we run the risk of fragmenting the
+	 * allocation space.
+	 */
+	wanted = ocfs2_resv_window_bits(resmap);
+	if (tmpwindow || wanted < *clen)
+		wanted = *clen;
+
+	if (ocfs2_resv_empty(resv)) {
+		mlog(0, "empty reservation, find new window\n");
+
+		/*
+		 * Try to get a window here. If it works, we must fall
+		 * through and test the bitmap . This avoids some
+		 * ping-ponging of windows due to non-reserved space
+		 * being allocation before we initialize a window for
+		 * that inode.
+		 */
+		ocfs2_resv_find_window(resmap, resv, wanted);
+	}
+
+	BUG_ON(ocfs2_resv_empty(resv));
+
+	*cstart = resv->r_start;
+	*clen = resv->r_len;
+
+	spin_unlock(&resv_lock);
+	return 0;
+}
+
+static void
+	ocfs2_adjust_resv_from_alloc(struct ocfs2_reservation_map *resmap,
+				     struct ocfs2_alloc_reservation *resv,
+				     unsigned int start, unsigned int end)
+{
+	unsigned int lhs = 0, rhs = 0;
+
+	BUG_ON(start < resv->r_start);
+
+	/*
+	 * Completely used? We can remove it then.
+	 */
+	if (ocfs2_resv_end(resv) <= end && resv->r_start >= start) {
+		__ocfs2_resv_discard(resmap, resv);
+		return;
+	}
+
+	if (end < ocfs2_resv_end(resv))
+		rhs = end - ocfs2_resv_end(resv);
+
+	if (start > resv->r_start)
+		lhs = start - resv->r_start;
+
+	/*
+	 * This should have been trapped above. At the very least, rhs
+	 * should be non zero.
+	 */
+	BUG_ON(rhs == 0 && lhs == 0);
+
+	if (rhs >= lhs) {
+		unsigned int old_end = ocfs2_resv_end(resv);
+
+		resv->r_start = end + 1;
+		resv->r_len = old_end - resv->r_start + 1;
+	} else {
+		resv->r_len = start - resv->r_start;
+	}
+}
+
+void ocfs2_resmap_claimed_bits(struct ocfs2_reservation_map *resmap,
+			       struct ocfs2_alloc_reservation *resv,
+			       u32 cstart, u32 clen)
+{
+	unsigned int cend = cstart + clen - 1;
+
+	if (resmap == NULL || ocfs2_resmap_disabled(resmap))
+		return;
+
+	if (resv == NULL)
+		return;
+
+	spin_lock(&resv_lock);
+
+	mlog(0, "claim bits: cstart: %u cend: %u clen: %u r_start: %u "
+	     "r_end: %u r_len: %u, r_last_start: %u r_last_len: %u\n",
+	     cstart, cend, clen, resv->r_start, ocfs2_resv_end(resv),
+	     resv->r_len, resv->r_last_start, resv->r_last_len);
+
+	BUG_ON(cstart < resv->r_start);
+	BUG_ON(cstart > ocfs2_resv_end(resv));
+	BUG_ON(cend > ocfs2_resv_end(resv));
+
+	ocfs2_adjust_resv_from_alloc(resmap, resv, cstart, cend);
+	resv->r_last_start = cstart;
+	resv->r_last_len = clen;
+
+	/*
+	 * May have been discarded above from
+	 * ocfs2_adjust_resv_from_alloc().
+	 */
+	if (!ocfs2_resv_empty(resv))
+		ocfs2_resv_mark_lru(resmap, resv);
+
+	mlog(0, "Reservation now looks like: r_start: %u r_end: %u "
+	     "r_len: %u r_last_start: %u r_last_len: %u\n",
+	     resv->r_start, ocfs2_resv_end(resv), resv->r_len,
+	     resv->r_last_start, resv->r_last_len);
+
+	ocfs2_check_resmap(resmap);
+
+	spin_unlock(&resv_lock);
+}
diff --git a/fs/ocfs2/reservations.h b/fs/ocfs2/reservations.h
new file mode 100644
index 0000000..02c460b
--- /dev/null
+++ b/fs/ocfs2/reservations.h
@@ -0,0 +1,154 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * reservations.h
+ *
+ * Allocation reservations function prototypes and structures.
+ *
+ * Copyright (C) 2009 Novell.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef	OCFS2_RESERVATIONS_H
+#define	OCFS2_RESERVATIONS_H
+
+#include <linux/rbtree.h>
+
+#define OCFS2_DEFAULT_RESV_LEVEL	4
+#define OCFS2_MAX_RESV_LEVEL	9
+#define OCFS2_MIN_RESV_LEVEL	0
+
+struct ocfs2_alloc_reservation {
+	struct rb_node	r_node;
+
+	unsigned int	r_start;	/* Begining of current window */
+	unsigned int	r_len;		/* Length of the window */
+
+	unsigned int	r_last_len;	/* Length of most recent alloc */
+	unsigned int	r_last_start;	/* Start of most recent alloc */
+	struct list_head	r_lru;	/* LRU list head */
+
+	int		r_inuse;	/* r_inuse is set when r_node
+					 * is part of an rbtree. */
+};
+
+struct ocfs2_reservation_map {
+	struct rb_root		m_reservations;
+	char			*m_disk_bitmap;
+
+	struct ocfs2_super	*m_osb;
+
+	/* The following are not initialized to meaningful values until a disk
+	 * bitmap is provided. */
+	u32			m_bitmap_len;	/* Number of valid
+						 * bits available */
+
+	struct list_head	m_lru;		/* LRU of reservations
+						 * structures. */
+
+};
+
+void ocfs2_resv_init_once(struct ocfs2_alloc_reservation *resv);
+
+/**
+ * ocfs2_resv_discard() - truncate a reservation
+ * @resmap:
+ * @resv: the reservation to truncate.
+ *
+ * After this function is called, the reservation will be empty, and
+ * unlinked from the rbtree.
+ */
+void ocfs2_resv_discard(struct ocfs2_reservation_map *resmap,
+			struct ocfs2_alloc_reservation *resv);
+
+
+/**
+ * ocfs2_resmap_init() - Initialize fields of a reservations bitmap
+ * @resmap: struct ocfs2_reservation_map to initialize
+ * @obj: unused for now
+ * @ops: unused for now
+ * @max_bitmap_bytes: Maximum size of the bitmap (typically blocksize)
+ *
+ * Only possible return value other than '0' is -ENOMEM for failure to
+ * allocation mirror bitmap.
+ */
+int ocfs2_resmap_init(struct ocfs2_super *osb,
+		      struct ocfs2_reservation_map *resmap);
+
+/**
+ * ocfs2_resmap_restart() - "restart" a reservation bitmap
+ * @resmap: reservations bitmap
+ * @clen: Number of valid bits in the bitmap
+ * @disk_bitmap: the disk bitmap this resmap should refer to.
+ *
+ * Re-initialize the parameters of a reservation bitmap. This is
+ * useful for local alloc window slides.
+ * 
+ * This function will call ocfs2_trunc_resv against all existing
+ * reservations. A future version will recalculate existing
+ * reservations based on the new bitmap.
+ */
+void ocfs2_resmap_restart(struct ocfs2_reservation_map *resmap,
+			  unsigned int clen, char *disk_bitmap);
+
+/**
+ * ocfs2_resmap_uninit() - uninitialize a reservation bitmap structure
+ * @resmap: the struct ocfs2_reservation_map to uninitialize
+ */
+void ocfs2_resmap_uninit(struct ocfs2_reservation_map *resmap);
+
+/**
+ * ocfs2_resmap_resv_bits() - Return still-valid reservation bits
+ * @resmap: reservations bitmap
+ * @resv: reservation to base search from
+ * @tempwindow: the reservation will immediately be discarded
+ * @cstart: start of proposed allocation
+ * @clen: length (in clusters) of proposed allocation
+ *
+ * Using the reservation data from resv, this function will compare
+ * resmap and resmap->m_disk_bitmap to determine what part (if any) of
+ * the reservation window is still clear to use. If resv is empty,
+ * this function will try to allocate a window for it.
+ *
+ * On success, zero is returned and the valid allocation area is set in cstart
+ * and clen.
+ *
+ * Returns -ENOSPC if reservations are disabled.
+ */
+int ocfs2_resmap_resv_bits(struct ocfs2_reservation_map *resmap,
+			   struct ocfs2_alloc_reservation *resv,
+			   int tmpwindow, int *cstart, int *clen);
+
+/**
+ * ocfs2_resmap_claimed_bits() - Tell the reservation code that bits were used.
+ * @resmap: reservations bitmap
+ * @resv: optional reservation to recalulate based on new bitmap
+ * @cstart: start of allocation in clusters
+ * @clen: end of allocation in clusters.
+ *
+ * Tell the reservation code that bits were used to fulfill allocation in
+ * resmap. The bits don't have to have been part of any existing
+ * reservation. But we must always call this function when bits are claimed.
+ * Internally, the reservations code will use this information to mark the
+ * reservations bitmap. If resv is passed, it's next allocation window will be
+ * calculated.
+ */
+void ocfs2_resmap_claimed_bits(struct ocfs2_reservation_map *resmap,
+			       struct ocfs2_alloc_reservation *resv,
+			       u32 cstart, u32 clen);
+
+#endif	/* OCFS2_RESERVATIONS_H */
diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h
index 8c9a78a..5eb7753 100644
--- a/fs/ocfs2/suballoc.h
+++ b/fs/ocfs2/suballoc.h
@@ -54,6 +54,8 @@ struct ocfs2_alloc_context {
 	u64    ac_last_group;
 	u64    ac_max_block;  /* Highest block number to allocate. 0 is
 				 is the same as ~0 - unlimited */
+
+	struct ocfs2_alloc_reservation	*ac_resv;
 };
 
 void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac);
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 755cd49..a6f9556 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -94,6 +94,7 @@ struct mount_options
 	unsigned int	atime_quantum;
 	signed short	slot;
 	unsigned int	localalloc_opt;
+	unsigned int	resv_level;
 	char		cluster_stack[OCFS2_STACK_LABEL_LEN + 1];
 };
 
@@ -175,6 +176,7 @@ enum {
 	Opt_noacl,
 	Opt_usrquota,
 	Opt_grpquota,
+	Opt_resv_level,
 	Opt_err,
 };
 
@@ -201,6 +203,7 @@ static const match_table_t tokens = {
 	{Opt_noacl, "noacl"},
 	{Opt_usrquota, "usrquota"},
 	{Opt_grpquota, "grpquota"},
+	{Opt_resv_level, "resv_level=%u"},
 	{Opt_err, NULL}
 };
 
@@ -1026,6 +1029,7 @@ static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
 	osb->osb_commit_interval = parsed_options.commit_interval;
 	osb->local_alloc_default_bits = ocfs2_megabytes_to_clusters(sb, parsed_options.localalloc_opt);
 	osb->local_alloc_bits = osb->local_alloc_default_bits;
+	osb->osb_resv_level = parsed_options.resv_level;
 
 	status = ocfs2_verify_userspace_stack(osb, &parsed_options);
 	if (status)
@@ -1286,6 +1290,7 @@ static int ocfs2_parse_options(struct super_block *sb,
 	mopt->slot = OCFS2_INVALID_SLOT;
 	mopt->localalloc_opt = OCFS2_DEFAULT_LOCAL_ALLOC_SIZE;
 	mopt->cluster_stack[0] = '\0';
+	mopt->resv_level = OCFS2_DEFAULT_RESV_LEVEL;
 
 	if (!options) {
 		status = 1;
@@ -1429,6 +1434,17 @@ static int ocfs2_parse_options(struct super_block *sb,
 			mopt->mount_opt |= OCFS2_MOUNT_NO_POSIX_ACL;
 			mopt->mount_opt &= ~OCFS2_MOUNT_POSIX_ACL;
 			break;
+		case Opt_resv_level:
+			if (is_remount)
+				break;
+			if (match_int(&args[0], &option)) {
+				status = 0;
+				goto bail;
+			}
+			if (option >= OCFS2_MIN_RESV_LEVEL &&
+			    option < OCFS2_MAX_RESV_LEVEL)
+				mopt->resv_level = option;
+			break;
 		default:
 			mlog(ML_ERROR,
 			     "Unrecognized mount option \"%s\" "
@@ -1510,6 +1526,9 @@ static int ocfs2_show_options(struct seq_file *s, struct vfsmount *mnt)
 	else
 		seq_printf(s, ",noacl");
 
+	if (osb->osb_resv_level != OCFS2_DEFAULT_RESV_LEVEL)
+		seq_printf(s, ",resv_level=%d", osb->osb_resv_level);
+
 	return 0;
 }
 
@@ -2038,6 +2057,12 @@ static int ocfs2_initialize_super(struct super_block *sb,
 
 	init_waitqueue_head(&osb->osb_mount_event);
 
+	status = ocfs2_resmap_init(osb, &osb->osb_la_resmap);
+	if (status) {
+		mlog_errno(status);
+		goto bail;
+	}
+
 	osb->vol_label = kmalloc(OCFS2_MAX_VOL_LABEL_LEN, GFP_KERNEL);
 	if (!osb->vol_label) {
 		mlog(ML_ERROR, "unable to alloc vol label\n");
-- 
1.6.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 2/5] ocfs2: use allocation reservations during file write
  2010-03-17  6:59 [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 1/5] ocfs2: " Mark Fasheh
@ 2010-03-17  6:59 ` Mark Fasheh
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data Mark Fasheh
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Mark Fasheh @ 2010-03-17  6:59 UTC (permalink / raw)
  To: ocfs2-devel

Add a per-inode reservations structure and pass it through to the
reservations code.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |    2 ++
 fs/ocfs2/aops.c  |    3 +++
 fs/ocfs2/file.c  |    3 +++
 fs/ocfs2/inode.c |    4 ++++
 fs/ocfs2/inode.h |    2 ++
 fs/ocfs2/super.c |    2 ++
 6 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index d17bdc7..b8f0744 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -7307,6 +7307,8 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode,
 		}
 		did_quota = 1;
 
+		data_ac->ac_resv = &OCFS2_I(inode)->ip_la_data_resv;
+
 		ret = ocfs2_claim_clusters(osb, handle, data_ac, 1, &bit_off,
 					   &num);
 		if (ret) {
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 7e9df11..e70f7f8 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1734,6 +1734,9 @@ int ocfs2_write_begin_nolock(struct address_space *mapping,
 			goto out;
 		}
 
+		if (data_ac)
+			data_ac->ac_resv = &OCFS2_I(inode)->ip_la_data_resv;
+
 		credits = ocfs2_calc_extend_credits(inode->i_sb,
 						    &di->id2.i_list,
 						    clusters_to_alloc);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 558ce03..ff928a9 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -485,6 +485,9 @@ static int ocfs2_truncate_file(struct inode *inode,
 
 	down_write(&OCFS2_I(inode)->ip_alloc_sem);
 
+	ocfs2_resv_discard(&osb->osb_la_resmap,
+			   &OCFS2_I(inode)->ip_la_data_resv);
+
 	/*
 	 * The inode lock forced other nodes to sync and drop their
 	 * pages, which (correctly) happens even if we have a truncate
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 88459bd..2a00c2d 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1097,6 +1097,10 @@ void ocfs2_clear_inode(struct inode *inode)
 	ocfs2_mark_lockres_freeing(&oi->ip_inode_lockres);
 	ocfs2_mark_lockres_freeing(&oi->ip_open_lockres);
 
+	ocfs2_resv_discard(&OCFS2_SB(inode->i_sb)->osb_la_resmap,
+			   &oi->ip_la_data_resv);
+	ocfs2_resv_init_once(&oi->ip_la_data_resv);
+
 	/* We very well may get a clear_inode before all an inodes
 	 * metadata has hit disk. Of course, we can't drop any cluster
 	 * locks until the journal has finished with it. The only
diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h
index ba4fe07..e45edca 100644
--- a/fs/ocfs2/inode.h
+++ b/fs/ocfs2/inode.h
@@ -70,6 +70,8 @@ struct ocfs2_inode_info
 	/* Only valid if the inode is the dir. */
 	u32				ip_last_used_slot;
 	u64				ip_last_used_group;
+
+	struct ocfs2_alloc_reservation	ip_la_data_resv;
 };
 
 /*
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index a6f9556..db354d1 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1703,6 +1703,8 @@ static void ocfs2_inode_init_once(void *data)
 	oi->ip_blkno = 0ULL;
 	oi->ip_clusters = 0;
 
+	ocfs2_resv_init_once(&oi->ip_la_data_resv);
+
 	ocfs2_lock_res_init_once(&oi->ip_rw_lockres);
 	ocfs2_lock_res_init_once(&oi->ip_inode_lockres);
 	ocfs2_lock_res_init_once(&oi->ip_open_lockres);
-- 
1.6.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-17  6:59 [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 1/5] ocfs2: " Mark Fasheh
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 2/5] ocfs2: use allocation reservations during file write Mark Fasheh
@ 2010-03-17  6:59 ` Mark Fasheh
  2010-03-19 22:43   ` Joel Becker
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 4/5] ocfs2: allocate btree internal block groups from the global bitmap Mark Fasheh
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Mark Fasheh @ 2010-03-17  6:59 UTC (permalink / raw)
  To: ocfs2-devel

Use the reservations system for unindexed dir tree allocations. We
don't bother with the indexed tree as reads from it are mostly random
anyway. By default this behavior is turned off and can be turned on
via mount option 'dir_resv'. Workloads which create many large
directories will want to turn it on.

A future improvement will be to use a different window size for
directory inodes. Once done, we should change the default for this
feature to be enabled.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 Documentation/filesystems/ocfs2.txt |    2 ++
 fs/ocfs2/dir.c                      |    4 ++++
 fs/ocfs2/ocfs2.h                    |    2 ++
 fs/ocfs2/suballoc.c                 |    1 +
 fs/ocfs2/super.c                    |   15 +++++++++++++++
 5 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt
index 412df90..40fc8d1 100644
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.txt
@@ -83,3 +83,5 @@ noacl		(*)	Disables POSIX Access Control Lists support.
 resv_level=4	(*)	Set how agressive allocation reservations will be.
 			Valid values are between 0 (reservations off) to 8
 			(maximum space for reservations).
+no_dir_resv	(*)	Don't get allocation reservations on directory inodes.
+dir_resv		Get allocation reservations on directory inodes.
diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index 28c3ec2..e74bd18 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -2993,6 +2993,8 @@ static int ocfs2_expand_inline_dir(struct inode *dir, struct buffer_head *di_bh,
 	 * if we only get one now, that's enough to continue. The rest
 	 * will be claimed after the conversion to extents.
 	 */
+	if (osb->s_mount_opt & OCFS2_MOUNT_DIR_RESV)
+		data_ac->ac_resv = &oi->ip_la_data_resv;
 	ret = ocfs2_claim_clusters(osb, handle, data_ac, 1, &bit_off, &len);
 	if (ret) {
 		mlog_errno(ret);
@@ -3371,6 +3373,8 @@ static int ocfs2_extend_dir(struct ocfs2_super *osb,
 				mlog_errno(status);
 			goto bail;
 		}
+		if (osb->s_mount_opt & OCFS2_MOUNT_DIR_RESV)
+			data_ac->ac_resv = &OCFS2_I(dir)->ip_la_data_resv;
 
 		credits = ocfs2_calc_extend_credits(sb, el, 1);
 	} else {
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index e0c6d5e..c3e3758 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -255,6 +255,8 @@ enum ocfs2_mount_options
 						   control lists */
 	OCFS2_MOUNT_USRQUOTA = 1 << 10, /* We support user quotas */
 	OCFS2_MOUNT_GRPQUOTA = 1 << 11, /* We support group quotas */
+	OCFS2_MOUNT_DIR_RESV	= 1 << 12, /* Get reservations
+					    * on directories */
 };
 
 #define OCFS2_OSB_SOFT_RO			0x0001
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index c30b644..7b76926 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -137,6 +137,7 @@ void ocfs2_free_ac_resource(struct ocfs2_alloc_context *ac)
 	}
 	brelse(ac->ac_bh);
 	ac->ac_bh = NULL;
+	ac->ac_resv = NULL;
 }
 
 void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac)
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index db354d1..b1bc5d2 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -177,6 +177,8 @@ enum {
 	Opt_usrquota,
 	Opt_grpquota,
 	Opt_resv_level,
+	Opt_dir_resv,
+	Opt_no_dir_resv,
 	Opt_err,
 };
 
@@ -204,6 +206,8 @@ static const match_table_t tokens = {
 	{Opt_usrquota, "usrquota"},
 	{Opt_grpquota, "grpquota"},
 	{Opt_resv_level, "resv_level=%u"},
+	{Opt_dir_resv, "dir_resv"},
+	{Opt_no_dir_resv, "no_dir_resv"},
 	{Opt_err, NULL}
 };
 
@@ -1445,6 +1449,12 @@ static int ocfs2_parse_options(struct super_block *sb,
 			    option < OCFS2_MAX_RESV_LEVEL)
 				mopt->resv_level = option;
 			break;
+		case Opt_dir_resv:
+			mopt->mount_opt |= OCFS2_MOUNT_DIR_RESV;
+			break;
+		case Opt_no_dir_resv:
+			mopt->mount_opt &= ~OCFS2_MOUNT_DIR_RESV;
+			break;
 		default:
 			mlog(ML_ERROR,
 			     "Unrecognized mount option \"%s\" "
@@ -1529,6 +1539,11 @@ static int ocfs2_show_options(struct seq_file *s, struct vfsmount *mnt)
 	if (osb->osb_resv_level != OCFS2_DEFAULT_RESV_LEVEL)
 		seq_printf(s, ",resv_level=%d", osb->osb_resv_level);
 
+	if (opts & OCFS2_MOUNT_DIR_RESV)
+		seq_printf(s, ",dir_resv");
+	else
+		seq_printf(s, ",no_dir_resv");
+
 	return 0;
 }
 
-- 
1.6.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 4/5] ocfs2: allocate btree internal block groups from the global bitmap
  2010-03-17  6:59 [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
                   ` (2 preceding siblings ...)
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data Mark Fasheh
@ 2010-03-17  6:59 ` Mark Fasheh
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 5/5] ocfs2: remove ocfs2_local_alloc_in_range() Mark Fasheh
  2010-03-17 20:17 ` [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
  5 siblings, 0 replies; 16+ messages in thread
From: Mark Fasheh @ 2010-03-17  6:59 UTC (permalink / raw)
  To: ocfs2-devel

Otherwise, the need for a very large contiguous allocation tends to
wreak havoc on many inode allocation reservations on the local alloc, thus
ruining any chances for contiguousness.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/suballoc.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 7b76926..62c570c 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -659,7 +659,8 @@ int ocfs2_reserve_new_metadata_blocks(struct ocfs2_super *osb,
 
 	status = ocfs2_reserve_suballoc_bits(osb, (*ac),
 					     EXTENT_ALLOC_SYSTEM_INODE,
-					     slot, NULL, ALLOC_NEW_GROUP);
+					     slot, NULL,
+					     ALLOC_GROUPS_FROM_GLOBAL|ALLOC_NEW_GROUP);
 	if (status < 0) {
 		if (status != -ENOSPC)
 			mlog_errno(status);
@@ -1821,6 +1822,8 @@ int __ocfs2_claim_clusters(struct ocfs2_super *osb,
 	       && ac->ac_which != OCFS2_AC_USE_MAIN);
 
 	if (ac->ac_which == OCFS2_AC_USE_LOCAL) {
+		WARN_ON(min_clusters > 1);
+
 		status = ocfs2_claim_local_alloc_bits(osb,
 						      handle,
 						      ac,
-- 
1.6.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 5/5] ocfs2: remove ocfs2_local_alloc_in_range()
  2010-03-17  6:59 [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
                   ` (3 preceding siblings ...)
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 4/5] ocfs2: allocate btree internal block groups from the global bitmap Mark Fasheh
@ 2010-03-17  6:59 ` Mark Fasheh
  2010-03-17 20:17 ` [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
  5 siblings, 0 replies; 16+ messages in thread
From: Mark Fasheh @ 2010-03-17  6:59 UTC (permalink / raw)
  To: ocfs2-devel

Inodes are always allocated from the global bitmap now so we don't need this
any more. Also, the existing implementation bounces reservations around
needlessly.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/localalloc.c |   51 -------------------------------------------------
 fs/ocfs2/suballoc.c   |    6 +----
 2 files changed, 1 insertions(+), 56 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index ebab3c0..6515ca7 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -484,46 +484,6 @@ out:
 	return status;
 }
 
-/* Check to see if the local alloc window is within ac->ac_max_block */
-static int ocfs2_local_alloc_in_range(struct inode *inode,
-				      struct ocfs2_alloc_context *ac,
-				      u32 bits_wanted)
-{
-	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
-	struct ocfs2_dinode *alloc;
-	struct ocfs2_local_alloc *la;
-	int start;
-	u64 block_off;
-
-	if (!ac->ac_max_block)
-		return 1;
-
-	alloc = (struct ocfs2_dinode *) osb->local_alloc_bh->b_data;
-	la = OCFS2_LOCAL_ALLOC(alloc);
-
-	start = ocfs2_local_alloc_find_clear_bits(osb, alloc, &bits_wanted, NULL);
-	if (start == -1) {
-		mlog_errno(-ENOSPC);
-		return 0;
-	}
-
-	/*
-	 * Converting (bm_off + start + bits_wanted) to blocks gives us
-	 * the blkno just past our actual allocation.  This is perfect
-	 * to compare with ac_max_block.
-	 */
-	block_off = ocfs2_clusters_to_blocks(inode->i_sb,
-					     le32_to_cpu(la->la_bm_off) +
-					     start + bits_wanted);
-	mlog(0, "Checking %llu against %llu\n",
-	     (unsigned long long)block_off,
-	     (unsigned long long)ac->ac_max_block);
-	if (block_off > ac->ac_max_block)
-		return 0;
-
-	return 1;
-}
-
 /*
  * make sure we've got at least bits_wanted contiguous bits in the
  * local alloc. You lose them when you drop i_mutex.
@@ -616,17 +576,6 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb,
 		mlog(0, "Calling in_range for max block %llu\n",
 		     (unsigned long long)ac->ac_max_block);
 
-	if (!ocfs2_local_alloc_in_range(local_alloc_inode, ac,
-					bits_wanted)) {
-		/*
-		 * The window is outside ac->ac_max_block.
-		 * This errno tells the caller to keep localalloc enabled
-		 * but to get the allocation from the main bitmap.
-		 */
-		status = -EFBIG;
-		goto bail;
-	}
-
 	ac->ac_inode = local_alloc_inode;
 	/* We should never use localalloc from another slot */
 	ac->ac_alloc_slot = osb->slot_num;
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 62c570c..0120a4b 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -861,11 +861,7 @@ static int ocfs2_reserve_clusters_with_limit(struct ocfs2_super *osb,
 		status = ocfs2_reserve_local_alloc_bits(osb,
 							bits_wanted,
 							*ac);
-		if (status == -EFBIG) {
-			/* The local alloc window is outside ac_max_block.
-			 * use the main bitmap. */
-			status = -ENOSPC;
-		} else if ((status < 0) && (status != -ENOSPC)) {
+		if ((status < 0) && (status != -ENOSPC)) {
 			mlog_errno(status);
 			goto bail;
 		}
-- 
1.6.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations
  2010-03-17  6:59 [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
                   ` (4 preceding siblings ...)
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 5/5] ocfs2: remove ocfs2_local_alloc_in_range() Mark Fasheh
@ 2010-03-17 20:17 ` Mark Fasheh
  5 siblings, 0 replies; 16+ messages in thread
From: Mark Fasheh @ 2010-03-17 20:17 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Mar 16, 2010 at 11:59:09PM -0700, Mark Fasheh wrote:
> Changes from the last patch set:
> - added a check for overlapping reservations in ocfs2_resv_insert()
> - cleaned up the comments in ocfs2_cannibalize_resv()
> - the check for reservation past bitmap end in ocfs2_check_resmap() is more
>   strict now.
> - removed the unused m_search_start member of ocfs2_reservation_map
> - optimized __ocfs2_resv_find_window() to ignore regions that are too small
>   for the current alloc
> - major cleanup of ocfs2_resmap_claimed_bits()
> - added a set of BUG_ON's to in ocfs2_resmap_claimed_bits() to check that
>   the passed allocation range is within the window.
> - fixed ocfs2_local_alloc_find_clear_bits() to return actual bits allocated
> - add a check for a null data_ac in ocfs2_write_begin_nolock()
> 
> I also added a fifth patch, "ocfs2: remove ocfs2_local_alloc_in_range()".
> I could spin this as it's own patch to go upstream earlier if we want.

Attached is a very small follow-up patch to turn off some of the expensive
debug checks I placed in the code.
	--Mark

From: Mark Fasheh <mfasheh@suse.com>

[PATCH] ocfs2: turn off expensive reservations debugging by default

They can still be turned back on by enabling the "expensive checks" config
option for Ocfs2.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/reservations.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/reservations.c b/fs/ocfs2/reservations.c
index ecffb1c..97f8b25 100644
--- a/fs/ocfs2/reservations.c
+++ b/fs/ocfs2/reservations.c
@@ -41,9 +41,6 @@
 #define OCFS2_CHECK_RESERVATIONS
 #endif
 
-#define OCFS2_CHECK_RESERVATIONS
-
-
 DEFINE_SPINLOCK(resv_lock);
 
 #define	OCFS2_MIN_RESV_WINDOW_BITS	8
-- 
1.6.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 1/5] ocfs2: allocation reservations
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 1/5] ocfs2: " Mark Fasheh
@ 2010-03-19 22:40   ` Joel Becker
  2010-03-19 23:56     ` Mark Fasheh
  0 siblings, 1 reply; 16+ messages in thread
From: Joel Becker @ 2010-03-19 22:40 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Mar 16, 2010 at 11:59:10PM -0700, Mark Fasheh wrote:
> Reservation windows are represented internally by a red-black
> tree. Within that tree, each node represents the reservation window of
> one inode. An LRU of active reservations is also maintained. When new

	I see you decided against chopping up the localalloc and went
for reservations as needed.

> +int ocfs2_resmap_init(struct ocfs2_super *osb,
> +		      struct ocfs2_reservation_map *resmap)
> +{
> +	memset(resmap, 0, sizeof(*resmap));
> +
> +	resmap->m_osb = osb;
> +	resmap->m_reservations = RB_ROOT;
> +	/* m_bitmap_len is initialized to zero by the above memset. */
> +	INIT_LIST_HEAD(&resmap->m_lru);
> +
> +	return 0;
> +}

	Why have m_osb and set it?  Can't you get the osb from
container_of(resmap, struct ocfs2_super, osb_la_resmap)?

> +int ocfs2_resmap_resv_bits(struct ocfs2_reservation_map *resmap,
> +			   struct ocfs2_alloc_reservation *resv,
> +			   int tmpwindow, int *cstart, int *clen)

	The tmpwindow argument to resmap_resv_bits()...what you really
want to say is "I'm not interested in hanging on to reserved space.
Please behave like the original localalloc".  What it does is correct.
I wish I had a better way to say it.

> --- /dev/null
> +++ b/fs/ocfs2/reservations.h
> @@ -0,0 +1,154 @@
> +/* -*- mode: c; c-basic-offset: 8; -*-
> + * vim: noexpandtab sw=8 ts=8 sts=0:
> + *
> + * reservations.h
> + *
> + * Allocation reservations function prototypes and structures.
> + *
> + * Copyright (C) 2009 Novell.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.

	Don't add the paragraph about Temple Place to new files ;-)

Joel

-- 

"Always give your best, never get discouraged, never be petty; always
 remember, others may hate you.  Those who hate you don't win unless
 you hate them.  And then you destroy yourself."
	- Richard M. Nixon

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data Mark Fasheh
@ 2010-03-19 22:43   ` Joel Becker
  2010-03-20  0:14     ` Mark Fasheh
  0 siblings, 1 reply; 16+ messages in thread
From: Joel Becker @ 2010-03-19 22:43 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Mar 16, 2010 at 11:59:12PM -0700, Mark Fasheh wrote:
> Use the reservations system for unindexed dir tree allocations. We
> don't bother with the indexed tree as reads from it are mostly random
> anyway. By default this behavior is turned off and can be turned on
> via mount option 'dir_resv'. Workloads which create many large
> directories will want to turn it on.
> 
> A future improvement will be to use a different window size for
> directory inodes. Once done, we should change the default for this
> feature to be enabled.

	Why can't we turn this on and avoid a mount option?  Sure, the
reservation holds out some space, but it will be cannibalized if
the directory doesn't grow, right?

Joel

-- 

"I inject pure kryptonite into my brain.
 It improves my kung fu, and it eases the pain."


Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 1/5] ocfs2: allocation reservations
  2010-03-19 22:40   ` Joel Becker
@ 2010-03-19 23:56     ` Mark Fasheh
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Fasheh @ 2010-03-19 23:56 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Mar 19, 2010 at 03:40:47PM -0700, Joel Becker wrote:
> On Tue, Mar 16, 2010 at 11:59:10PM -0700, Mark Fasheh wrote:
> > Reservation windows are represented internally by a red-black
> > tree. Within that tree, each node represents the reservation window of
> > one inode. An LRU of active reservations is also maintained. When new
> 
> 	I see you decided against chopping up the localalloc and went
> for reservations as needed.

Yeah, in the end it wound up making the most sense for our usage. In reality
though, the bitmap is being "virtually" chopped up.


> > +int ocfs2_resmap_init(struct ocfs2_super *osb,
> > +		      struct ocfs2_reservation_map *resmap)
> > +{
> > +	memset(resmap, 0, sizeof(*resmap));
> > +
> > +	resmap->m_osb = osb;
> > +	resmap->m_reservations = RB_ROOT;
> > +	/* m_bitmap_len is initialized to zero by the above memset. */
> > +	INIT_LIST_HEAD(&resmap->m_lru);
> > +
> > +	return 0;
> > +}
> 
> 	Why have m_osb and set it?  Can't you get the osb from
> container_of(resmap, struct ocfs2_super, osb_la_resmap)?

I could, but I have m_osb there to divorce it from any particular parent
structure. That way a future version to use this on say, block groups
wouldn't need any changes in reservations.[ch].


> > +int ocfs2_resmap_resv_bits(struct ocfs2_reservation_map *resmap,
> > +			   struct ocfs2_alloc_reservation *resv,
> > +			   int tmpwindow, int *cstart, int *clen)
> 
> 	The tmpwindow argument to resmap_resv_bits()...what you really
> want to say is "I'm not interested in hanging on to reserved space.
> Please behave like the original localalloc".  What it does is correct.
> I wish I had a better way to say it.

Essentially that's what happens, but with the
benefits of allocating the very small window in the right places, etc.
Otherwise we'd be searching the bitmap without tracking other reservations,
etc.


> > --- /dev/null
> > +++ b/fs/ocfs2/reservations.h
> > @@ -0,0 +1,154 @@
> > +/* -*- mode: c; c-basic-offset: 8; -*-
> > + * vim: noexpandtab sw=8 ts=8 sts=0:
> > + *
> > + * reservations.h
> > + *
> > + * Allocation reservations function prototypes and structures.
> > + *
> > + * Copyright (C) 2009 Novell.  All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2 of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; if not, write to the
> > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> > + * Boston, MA 021110-1307, USA.
> 
> 	Don't add the paragraph about Temple Place to new files ;-)

Doh! I thought I had already done that in the last round. I'll fix it up.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-19 22:43   ` Joel Becker
@ 2010-03-20  0:14     ` Mark Fasheh
  2010-03-20  1:25       ` Joel Becker
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Fasheh @ 2010-03-20  0:14 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Mar 19, 2010 at 03:43:10PM -0700, Joel Becker wrote:
> On Tue, Mar 16, 2010 at 11:59:12PM -0700, Mark Fasheh wrote:
> > Use the reservations system for unindexed dir tree allocations. We
> > don't bother with the indexed tree as reads from it are mostly random
> > anyway. By default this behavior is turned off and can be turned on
> > via mount option 'dir_resv'. Workloads which create many large
> > directories will want to turn it on.
> > 
> > A future improvement will be to use a different window size for
> > directory inodes. Once done, we should change the default for this
> > feature to be enabled.
> 
> 	Why can't we turn this on and avoid a mount option?  Sure, the
> reservation holds out some space, but it will be cannibalized if
> the directory doesn't grow, right?

Yeah, it'll work. My concern was that we'd be cannibalizing those too much
since the window sizes are optimized for file data. It's only a hunch though
that directories might want smaller windows - I didn't get much data on
directory growth since my focus was on file data.


Maybe I should just go ahead and hack it up to know if a reservation is for
a directory and it could provide some fewer number of bits (maybe half the
default)? Adjusting the window size values is something we can do very
easily until we get it right.

Adding a flag field on the resv struct could accomplish this pretty easily,
and I could re-use the flag for the 'temporary' reservations, thus losing a
function call parameter in the process.


I agree though that losing the extra mount option would be good.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-20  0:14     ` Mark Fasheh
@ 2010-03-20  1:25       ` Joel Becker
  2010-03-20  3:47         ` Mark Fasheh
  0 siblings, 1 reply; 16+ messages in thread
From: Joel Becker @ 2010-03-20  1:25 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Mar 19, 2010 at 05:14:25PM -0700, Mark Fasheh wrote:
> Yeah, it'll work. My concern was that we'd be cannibalizing those too much
> since the window sizes are optimized for file data. It's only a hunch though
> that directories might want smaller windows - I didn't get much data on
> directory growth since my focus was on file data.

	I expect that directories won't use their entire reservation.
I'm just not sure it matters.  Half of the files we create won't either.
If we actually are doing a lot of creating (untar, etc), we'll
eventually canabalize the reservation attached to the directory anyway.
So I don't know why directories have to have smaller reservations.  Just
let the expire/canabalize code handle it.
	I'd be interested to see how an untar of a kernel tree or any
long-running workload that is helped by reservations changes when a
directory is reserving the same space as a file.

Joel

-- 

"War doesn't determine who's right; war determines who's left."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-20  1:25       ` Joel Becker
@ 2010-03-20  3:47         ` Mark Fasheh
  2010-03-20  6:18           ` Joel Becker
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Fasheh @ 2010-03-20  3:47 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Mar 19, 2010 at 06:25:54PM -0700, Joel Becker wrote:
> On Fri, Mar 19, 2010 at 05:14:25PM -0700, Mark Fasheh wrote:
> > Yeah, it'll work. My concern was that we'd be cannibalizing those too much
> > since the window sizes are optimized for file data. It's only a hunch though
> > that directories might want smaller windows - I didn't get much data on
> > directory growth since my focus was on file data.
> 
> 	I expect that directories won't use their entire reservation.
> I'm just not sure it matters.  Half of the files we create won't either.
> If we actually are doing a lot of creating (untar, etc), we'll
> eventually canabalize the reservation attached to the directory anyway.
> So I don't know why directories have to have smaller reservations.  Just
> let the expire/canabalize code handle it.

Cannibalizing a reservation is a bit less optimal because you're taking
any region at that point, as opposed to trying very hard to be at least
after the previous window. Also there's the shrinking logic (though in the
last couple days I've been thinking of getting rid of region shrinking).

Anyway, I'm with you regarding what the proper parameters for directory
reservations are. The fact is, i don't really know one way or the other. A
mount option seemed to at least give the user an 'out' if things go bad.


> 	I'd be interested to see how an untar of a kernel tree or any
> long-running workload that is helped by reservations changes when a
> directory is reserving the same space as a file.

Well, we could gather up some disk images to get us that information. The
patches as they are now make it easy to turn directory reservations on and
off, so running some tests doesn't even need a reboot :)

I'm not sure that untar of a kernel tree will give us anything interesting.
A parallel build would work... We could check a few arbitrary object files
to see where they're at. Those tend to be skew bigger.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-20  3:47         ` Mark Fasheh
@ 2010-03-20  6:18           ` Joel Becker
  2010-03-21 23:26             ` Mark Fasheh
  0 siblings, 1 reply; 16+ messages in thread
From: Joel Becker @ 2010-03-20  6:18 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Mar 19, 2010 at 08:47:58PM -0700, Mark Fasheh wrote:
> Anyway, I'm with you regarding what the proper parameters for directory
> reservations are. The fact is, i don't really know one way or the other. A
> mount option seemed to at least give the user an 'out' if things go bad.

	Yeah, I hear you.  I just figure we're stuck with the option for
the future.

> I'm not sure that untar of a kernel tree will give us anything interesting.
> A parallel build would work... We could check a few arbitrary object files
> to see where they're at. Those tend to be skew bigger.

	I was thinking about a kernel tree untar where the directories
are holding on to reservations as thousands of files are created under
them.  Wouldn't that lead to cannibalization and show us that pattern?
This is just a lay guess - you have a lot more familiarity with the
code.

Joel

-- 

"In a crisis, don't hide behind anything or anybody. They're going
 to find you anyway."
	- Paul "Bear" Bryant

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-20  6:18           ` Joel Becker
@ 2010-03-21 23:26             ` Mark Fasheh
  2010-03-22 20:41               ` Joel Becker
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Fasheh @ 2010-03-21 23:26 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Mar 19, 2010 at 11:18:48PM -0700, Joel Becker wrote:
> On Fri, Mar 19, 2010 at 08:47:58PM -0700, Mark Fasheh wrote:
> > Anyway, I'm with you regarding what the proper parameters for directory
> > reservations are. The fact is, i don't really know one way or the other. A
> > mount option seemed to at least give the user an 'out' if things go bad.
> 
> 	Yeah, I hear you.  I just figure we're stuck with the option for
> the future.

Updated patches are up in git:

git pull git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2-mark.git reservations

Or: http://git.kernel.org/?p=linux/kernel/git/mfasheh/ocfs2-mark.git;a=shortlog;h=refs/heads/reservations


Aside from cleaning up the license info, I remove the dir_resv/no_dir_resv
options. I also added a flags field to struct ocfs2_alloc_reservation which
saves us the 'tmpwindow' argument to ocfs2_resmap_resv_bits().


> > I'm not sure that untar of a kernel tree will give us anything interesting.
> > A parallel build would work... We could check a few arbitrary object files
> > to see where they're at. Those tend to be skew bigger.
> 
> 	I was thinking about a kernel tree untar where the directories
> are holding on to reservations as thousands of files are created under
> them.  Wouldn't that lead to cannibalization and show us that pattern?
> This is just a lay guess - you have a lot more familiarity with the
> code.

I don't know how much a kernel build is going to make a difference. I took a
couple images after kernel builds with various options but didn't see
anything obvious. To be fair though, I only checked one or two files. I'll
upload the images somewhere shortly.

As an interim compromise, I changed the code to get minimum (8 bits) sized
windows on directories. That way, they'll get some amount of
continguousness, but not as much as file data. We can easily adjust in any
direction we want.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data
  2010-03-21 23:26             ` Mark Fasheh
@ 2010-03-22 20:41               ` Joel Becker
  0 siblings, 0 replies; 16+ messages in thread
From: Joel Becker @ 2010-03-22 20:41 UTC (permalink / raw)
  To: ocfs2-devel

On Sun, Mar 21, 2010 at 04:26:23PM -0700, Mark Fasheh wrote:
> On Fri, Mar 19, 2010 at 11:18:48PM -0700, Joel Becker wrote:

> Aside from cleaning up the license info, I remove the dir_resv/no_dir_resv
> options. I also added a flags field to struct ocfs2_alloc_reservation which
> saves us the 'tmpwindow' argument to ocfs2_resmap_resv_bits().

	I like the changes.

> > 	I was thinking about a kernel tree untar where the directories
> > are holding on to reservations as thousands of files are created under
> > them.  Wouldn't that lead to cannibalization and show us that pattern?
> > This is just a lay guess - you have a lot more familiarity with the
> > code.
> 
> I don't know how much a kernel build is going to make a difference. I took a
> couple images after kernel builds with various options but didn't see
> anything obvious. To be fair though, I only checked one or two files. I'll
> upload the images somewhere shortly.
> 
> As an interim compromise, I changed the code to get minimum (8 bits) sized
> windows on directories. That way, they'll get some amount of
> continguousness, but not as much as file data. We can easily adjust in any
> direction we want.

	I wonder how the varying reservation sizes will impact
ulilization of the localalloc.  May not matter, but there's a part of me
that wonders.  Does the new dir-resv code do as well on your tests as
the no-dir-resv code?
	Maybe something to see any help from dir reservations would be a
many-parallel-untar sort of thing?  Unpack 20 kernel trees at the same
time?  Some other package with a lot of files.  I have no idea if this
will mean anything, and I actually rather trust your observations of
previous untars.  I'm just casting about for something to show us how
dir reservations are behaving.

Joel

-- 

Life's Little Instruction Book #451

	"Don't be afraid to say, 'I'm sorry.'"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-03-22 20:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-17  6:59 [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh
2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 1/5] ocfs2: " Mark Fasheh
2010-03-19 22:40   ` Joel Becker
2010-03-19 23:56     ` Mark Fasheh
2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 2/5] ocfs2: use allocation reservations during file write Mark Fasheh
2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 3/5] ocfs2: use allocation reservations for directory data Mark Fasheh
2010-03-19 22:43   ` Joel Becker
2010-03-20  0:14     ` Mark Fasheh
2010-03-20  1:25       ` Joel Becker
2010-03-20  3:47         ` Mark Fasheh
2010-03-20  6:18           ` Joel Becker
2010-03-21 23:26             ` Mark Fasheh
2010-03-22 20:41               ` Joel Becker
2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 4/5] ocfs2: allocate btree internal block groups from the global bitmap Mark Fasheh
2010-03-17  6:59 ` [Ocfs2-devel] [PATCH 5/5] ocfs2: remove ocfs2_local_alloc_in_range() Mark Fasheh
2010-03-17 20:17 ` [Ocfs2-devel] [PATCH 0/5] Ocfs2 allocation reservations Mark Fasheh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.