[PATCH 0 of 8] O_DIRECT locking rework v5

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0 of 8] O_DIRECT locking rework v5
@ 2006-12-22  1:45 Chris Mason
  2006-12-22  1:52 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  1:45 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

[ resend, sorry for the messed up headers ]

I took a small detour on the O_DIRECT locking rework to look at
different alternatives for range locking in the pagecache.  After
benchmarking a few different types of trees, I didn't find anything that
would match radix for random lookup performance.

This patchset does ranges in the radix tree by inserting a placeholder
at the last slot in the range and forcing all lookups to search forward.
It means radix_tree_gang_lookup must be used instead of
radix_tree_lookup, but this is still faster for random searches than
anything else I tried.

A bit is set on the radix root node to indicate if range searching is
required.  So, when O_DIRECT isn't used or O_DIRECT is used for tiny
ios, no range lookups are done.

With O_DIRECT in use, only a single placeholder is inserted
to lock down the entire range for a given IO.  This should
stand up pretty well for those monster XFS workloads.

If the mapping has pages on it, I do one placeholder for every 64k
to limit the number of pages pinned down during the IO.  There's
lots of hand waving here, it may be best to get rid of this special
case.

Patch against Linus' git from today.  Testing has been light, I'm
mostly looking for comments on the range locking tricks.

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1 of 8] Introduce a place holder page for the pagecache
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
@ 2006-12-22  1:52 ` Chris Mason
  2006-12-22  1:55 ` [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Chris Mason
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  1:52 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

mm/filemap.c is changed to wait on these before adding a page into the page
cache, and truncates are changed to wait for all of the place holder pages to
disappear.

Place holder pages can only be examined with the mapping lock held.  They
cannot be locked, and cannot have references increased or decreased on them.

Placeholders can span a range bigger than one page.  The placeholder is
inserted into the radix slot for the end of the range, and the flags field in
the page struct is used to record the start of the range.

A bit is added for the radix root (PAGECACHE_TAG_EXTENTS), and when
mm/filemap.c finds that bit set, searches for an index in the pagecache
look forward to find any placeholders that index may intersect.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 511f067627ac -r 4cac7e560b53 drivers/mtd/devices/block2mtd.c
--- a/drivers/mtd/devices/block2mtd.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/drivers/mtd/devices/block2mtd.c	Thu Dec 21 15:31:30 2006 -0500
@@ -66,7 +66,7 @@ static void cache_readahead(struct addre
 			INFO("Overrun end of disk in cache readahead\n");
 			break;
 		}
-		page = radix_tree_lookup(&mapping->page_tree, pagei);
+		page = radix_tree_lookup_extent(&mapping->page_tree, pagei);
 		if (page && (!i))
 			break;
 		if (page)
diff -r 511f067627ac -r 4cac7e560b53 include/linux/fs.h
--- a/include/linux/fs.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/fs.h	Thu Dec 21 15:31:30 2006 -0500
@@ -489,6 +489,11 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+
+/*
+ * This tag is only valid on the root of the radix tree
+ */
+#define PAGE_CACHE_TAG_EXTENTS 2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff -r 511f067627ac -r 4cac7e560b53 include/linux/page-flags.h
--- a/include/linux/page-flags.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/page-flags.h	Thu Dec 21 15:31:30 2006 -0500
@@ -267,4 +267,6 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
+void set_page_placeholder(struct page *page, pgoff_t start, pgoff_t end);
+
 #endif	/* PAGE_FLAGS_H */
diff -r 511f067627ac -r 4cac7e560b53 include/linux/pagemap.h
--- a/include/linux/pagemap.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/pagemap.h	Thu Dec 21 15:31:30 2006 -0500
@@ -76,6 +76,11 @@ extern struct page * find_get_page(struc
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
 				unsigned long index);
+int find_or_insert_placeholders(struct address_space *mapping,
+                                  struct page **pages, unsigned long start,
+                                  unsigned long end, unsigned long nr,
+                                  gfp_t gfp_mask,
+                                  int wait);
 extern __deprecated_for_modules struct page * find_trylock_page(
 			struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -86,6 +91,15 @@ unsigned find_get_pages_contig(struct ad
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+void remove_placeholder_pages(struct address_space *mapping,
+                             struct page **pages,
+                             unsigned long offset,
+                             unsigned long end,
+                             unsigned long nr);
+void wake_up_placeholder_page(struct page *page);
+void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start,
+			       pgoff_t end);
+
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
@@ -116,6 +130,8 @@ int add_to_page_cache_lru(struct page *p
 				unsigned long index, gfp_t gfp_mask);
 extern void remove_from_page_cache(struct page *page);
 extern void __remove_from_page_cache(struct page *page);
+struct page *radix_tree_lookup_extent(struct radix_tree_root *root,
+					     unsigned long index);
 
 /*
  * Return byte-offset into filesystem object for page.
diff -r 511f067627ac -r 4cac7e560b53 include/linux/radix-tree.h
--- a/include/linux/radix-tree.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/radix-tree.h	Thu Dec 21 15:31:30 2006 -0500
@@ -53,6 +53,7 @@ static inline int radix_tree_is_direct_p
 /*** radix-tree API starts here ***/
 
 #define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_ROOT_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
@@ -168,6 +169,7 @@ radix_tree_gang_lookup_tag(struct radix_
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
+void radix_tree_root_tag_set(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
 {
diff -r 511f067627ac -r 4cac7e560b53 lib/radix-tree.c
--- a/lib/radix-tree.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/lib/radix-tree.c	Thu Dec 21 15:31:30 2006 -0500
@@ -468,6 +468,12 @@ void *radix_tree_tag_set(struct radix_tr
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_set);
+
+void radix_tree_root_tag_set(struct radix_tree_root *root, unsigned int tag)
+{
+	root_tag_set(root, tag);
+}
+EXPORT_SYMBOL(radix_tree_root_tag_set);
 
 /**
  *	radix_tree_tag_clear - clear a tag on a radix tree node
diff -r 511f067627ac -r 4cac7e560b53 mm/filemap.c
--- a/mm/filemap.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/filemap.c	Thu Dec 21 15:31:30 2006 -0500
@@ -44,6 +44,14 @@ generic_file_direct_IO(int rw, struct ki
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+static wait_queue_head_t *page_waitqueue(struct page *page);
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock);
+
+static struct address_space placeholder_address_space;
+#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -421,6 +429,41 @@ int filemap_write_and_wait_range(struct 
 	return err;
 }
 
+/*
+ * When the radix tree has the extent bit set, a lookup needs to search
+ * forward in the tree to find any extent the index might intersect.
+ * When extents are off, a faster radix_tree_lookup can be done instead.
+ *
+ * This does the appropriate lookup based on the PAGE_CACHE_TAG_EXTENTS
+ * bit on the root node
+ */
+struct page *radix_tree_lookup_extent(struct radix_tree_root *root,
+					     unsigned long index)
+{
+	if (radix_tree_tagged(root, PAGE_CACHE_TAG_EXTENTS)) {
+		struct page *p;
+		unsigned int found;
+		found = radix_tree_gang_lookup(root, (void **)(&p), index, 1);
+		if (found) {
+			if (PagePlaceHolder(p)) {
+				pgoff_t start = p->flags;
+				pgoff_t end = p->index;
+				if (end >= index && start <= index)
+					return p;
+				return NULL;
+			} else {
+				if (p->index == index) {
+					return p;
+				}
+				return NULL;
+			}
+		} else
+			return NULL;
+	}
+	return radix_tree_lookup(root, index);
+}
+
+
 /**
  * add_to_page_cache - add newly allocated pagecache pages
  * @page:	page to add
@@ -437,12 +480,38 @@ int add_to_page_cache(struct page *page,
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	int error;
+again:
+	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		struct page *tmp;
 		write_lock_irq(&mapping->tree_lock);
+		/*
+		 * If extents are on for this radix tree, we have to do
+		 * the more expensive search for an overlapping extent
+		 * before we try to insert.
+		 */
+		if (radix_tree_tagged(&mapping->page_tree,
+				      PAGE_CACHE_TAG_EXTENTS)) {
+			tmp = radix_tree_lookup_extent(&mapping->page_tree,
+						       offset);
+			if (tmp && PagePlaceHolder(tmp))
+				goto exists;
+		}
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
+		if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) {
+			tmp = radix_tree_lookup_extent(&mapping->page_tree,
+						       offset);
+			if (tmp && PagePlaceHolder(tmp)) {
+exists:
+				radix_tree_preload_end();
+				/* drops the lock */
+				wait_on_placeholder_page(mapping, tmp, 1);
+				goto again;
+			}
+		}
+		if (!error && !PagePlaceHolder(page)) {
 			page_cache_get(page);
 			SetPageLocked(page);
 			page->mapping = mapping;
@@ -516,6 +585,92 @@ void fastcall wait_on_page_bit(struct pa
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+/*
+ * Call with either a read lock or a write lock on the mapping tree.
+ *
+ * placeholder pages can't be tested or checked without the tree lock held
+ *
+ * In order to wait for the placeholders without losing a wakeup from someone
+ * removing them, we have to prepare_to_wait before dropping the tree lock.
+ *
+ * The lock is dropped just before waiting for the place holder.  It is not
+ * retaken before returning.
+ */
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	if (write_lock)
+		write_unlock_irq(&mapping->tree_lock);
+	else
+		read_unlock_irq(&mapping->tree_lock);
+	io_schedule();
+	finish_wait(wqh, &wait);
+}
+
+void wake_up_placeholder_page(struct page *page)
+{
+	__wake_up_bit(page_waitqueue(page), &page->flags, PG_locked);
+}
+EXPORT_SYMBOL_GPL(wake_up_placeholder_page);
+
+/**
+ * wait_on_placeholder_pages - gang placeholder page waiter
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @end:	The max page index (inclusive)
+ *
+ * wait_on_placeholder_pages() will search for and wait on a range of pages
+ * in the mapping
+ *
+ * On return, the range has no placeholder pages sitting in it.
+ */
+void wait_on_placeholder_pages_range(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end)
+{
+	unsigned int i;
+	unsigned int ret;
+	struct page *pages[8];
+	pgoff_t cur = start;
+	pgoff_t highest = start;
+
+	/*
+	 * we expect a very small number of place holder pages, so
+	 * this code isn't trying to be very fast.
+	 */
+again:
+	read_lock_irq(&mapping->tree_lock);
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pages, cur, ARRAY_SIZE(pages));
+	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i])) {
+			if (pages[i]->flags > end)
+				goto done;
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, pages[i], 0);
+			goto again;
+		}
+		if (pages[i]->index > highest)
+			highest = pages[i]->index;
+		if (pages[i]->index > end)
+			goto done;
+	}
+	if (highest < end && ret == ARRAY_SIZE(pages)) {
+		cur = highest;
+		if (need_resched()) {
+			read_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+		}
+		goto again;
+	}
+done:
+	read_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range);
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -532,6 +687,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
  */
 void fastcall unlock_page(struct page *page)
 {
+	BUG_ON(PagePlaceHolder(page));
 	smp_mb__before_clear_bit();
 	if (!TestClearPageLocked(page))
 		BUG();
@@ -568,6 +724,7 @@ void fastcall __lock_page(struct page *p
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -580,6 +737,7 @@ void fastcall __lock_page_nosync(struct 
 void fastcall __lock_page_nosync(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -597,13 +755,281 @@ struct page * find_get_page(struct addre
 	struct page *page;
 
 	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
+	if (page) {
+		if (PagePlaceHolder(page))
+			page = NULL;
+		else
+			page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
+
+/**
+ * remove_placeholder_pages - remove a range of placeholder or locked pages
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @placeholder: the placeholder page previously inserted (for verification)
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ *
+ * Any placeholder pages in the range specified are removed.  Any real
+ * pages are unlocked and released.
+ */
+void remove_placeholder_pages(struct address_space *mapping,
+			     struct page **pages,
+			     unsigned long start,
+			     unsigned long end,
+			     unsigned long nr)
+{
+	struct page *page;
+	int ret;
+	int i;
+	unsigned long num;
+
+	write_lock_irq(&mapping->tree_lock);
+	while (start < end) {
+		num = min(nr, end-start);
+		ret = radix_tree_gang_lookup(&mapping->page_tree,
+						(void **)pages, start, num);
+		for (i = 0; i < ret; i++) {
+			page = pages[i];
+			if (PagePlaceHolder(page)) {
+				if (page->index >= end)
+					break;
+				radix_tree_delete(&mapping->page_tree,
+						  page->index);
+				start = page->index + 1;
+				wake_up_placeholder_page(page);
+				kfree(page);
+			} else {
+				if (page->index >= end)
+					break;
+				unlock_page(page);
+				page_cache_release(page);
+				start = page->index + 1;
+			}
+		}
+		if (need_resched()) {
+			write_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+			write_lock_irq(&mapping->tree_lock);
+		}
+	}
+	write_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(remove_placeholder_pages);
+
+/*
+ * a helper function to insert a placeholder into multiple slots
+ * in the radix tree.  This could probably use an optimized version
+ * in the radix code.  It may insert fewer than the request number
+ * of placeholders if we need to reschedule or the radix tree needs to
+ * be preloaded again.
+ *
+ * returns zero on error or the number actually inserted.
+ */
+static int insert_placeholder(struct address_space *mapping,
+					 struct page *insert)
+{
+	int err;
+	unsigned int found;
+	struct page *debug_page;
+	/* sanity check, make sure other extents don't exist in this range */
+	found = radix_tree_gang_lookup(&mapping->page_tree,
+				    (void **)(&debug_page),
+				    insert->flags, 1);
+	BUG_ON(found > 0 && debug_page->flags <= (insert->index));
+	err = radix_tree_insert(&mapping->page_tree, insert->index, insert);
+	return err;
+}
+
+
+static struct page *alloc_placeholder(gfp_t gfp_mask)
+{
+	struct page *p;
+	p = kmalloc(sizeof(*p), gfp_mask);
+	if (p) {
+		memset(p, 0, sizeof(*p));
+		p->mapping = &placeholder_address_space;
+	}
+	return p;
+}
+
+/**
+ * find_or_insert_placeholders - locate a group of pagecache pages or insert one
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ * @gfp_mask: page allocation mode
+ * @insert: the page to insert if none is found
+ * @iowait: 1 if you want to wait for dirty or writeback pages.
+ *
+ * This locks down a range of offsets in the address space.  Any pages
+ * already present are locked and a placeholder page is inserted into
+ * the radix tree for any offsets without pages.
+ */
+int find_or_insert_placeholders(struct address_space *mapping,
+				  struct page **pages, unsigned long start,
+				  unsigned long end, unsigned long nr,
+				  gfp_t gfp_mask,
+				  int iowait)
+{
+	int err = 0;
+	int i, ret;
+	unsigned long cur = start;
+	struct page *page;
+	int restart;
+	struct page *insert = NULL;
+	/*
+	 * this gets complicated.  Placeholders and page locks need to
+	 * be taken in order.  We use gang lookup to cut down on the cpu
+	 * cost, but we need to keep track of holes in the results and
+	 * insert placeholders as appropriate.
+	 *
+	 * If a locked page or a placeholder is found, we wait for it and
+	 * pick up where we left off.  If a dirty or PG_writeback page is found
+	 * and iowait==1, we have to drop all of our locks, kick/wait for the
+	 * io and resume again.
+	 */
+repeat:
+	if (!insert) {
+		insert = alloc_placeholder(gfp_mask);
+		if (!insert) {
+			err = -ENOMEM;
+			goto fail;
+		}
+	}
+	if (cur != start )
+		cond_resched();
+	err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (err)
+		goto fail;
+	write_lock_irq(&mapping->tree_lock);
+
+	/* only set the extent tag if we are inserting placeholders for more
+	 * than one page worth of slots.  This way small random ios don't
+	 * suffer from slower lookups.
+	 */
+	if (cur == start && end - start > 1)
+		radix_tree_root_tag_set(&mapping->page_tree,
+					PAGE_CACHE_TAG_EXTENTS);
+repeat_lock:
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+					(void **)pages, cur,
+					min(nr, end-cur));
+	for (i = 0 ; i < ret ; i++) {
+		restart = 0;
+		page = pages[i];
+
+		if (PagePlaceHolder(page) && page->flags < end) {
+			radix_tree_preload_end();
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 1);
+			goto repeat;
+		}
+
+		if (page->index > cur) {
+			unsigned long top = min(end, page->index);
+			insert->index = top - 1;
+			insert->flags = cur;
+			err = insert_placeholder(mapping, insert);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			insert = NULL;
+			if (err)
+				goto fail;
+			cur = top;
+			if (cur < end)
+				goto repeat;
+			else
+				goto done;
+		}
+		if (page->index >= end) {
+			ret = 0;
+			break;
+		}
+		page_cache_get(page);
+		BUG_ON(page->index != cur);
+		BUG_ON(PagePlaceHolder(page));
+		if (TestSetPageLocked(page)) {
+			unsigned long tmpoff = page->index;
+			page_cache_get(page);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			__lock_page(page);
+			/* Has the page been truncated while we slept? */
+			if (unlikely(page->mapping != mapping ||
+				     page->index != tmpoff)) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto repeat;
+			} else {
+				/* we've locked the page, but  we need to
+				 *  check it for dirty/writeback
+				 */
+				restart = 1;
+			}
+		}
+		if (iowait && (PageDirty(page) || PageWriteback(page))) {
+			unlock_page(page);
+			page_cache_release(page);
+			if (!restart) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+			}
+			err = filemap_write_and_wait_range(mapping,
+						 cur << PAGE_CACHE_SHIFT,
+						 end << PAGE_CACHE_SHIFT);
+			if (err)
+				goto fail;
+			goto repeat;
+		}
+		cur++;
+		if (restart)
+			goto repeat;
+		if (cur >= end)
+			break;
+	}
+
+	/* we haven't yet filled the range */
+	if (cur < end) {
+		/* if the search filled our array, there is more to do. */
+		if (ret && ret == nr)
+			goto repeat_lock;
+
+		/* otherwise insert placeholders for the remaining offsets */
+		insert->index = end - 1;
+		insert->flags = cur;
+		err = insert_placeholder(mapping, insert);
+		write_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+		if (err)
+			goto fail;
+		insert = NULL;
+		cur = end;
+	} else {
+		write_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+	}
+done:
+	BUG_ON(cur < end);
+	BUG_ON(cur > end);
+	if (insert)
+		kfree(insert);
+	return err;
+fail:
+	remove_placeholder_pages(mapping, pages, start, cur, nr);
+	if (insert)
+		kfree(insert);
+	return err;
+}
+EXPORT_SYMBOL_GPL(find_or_insert_placeholders);
 
 /**
  * find_trylock_page - find and lock a page
@@ -617,8 +1043,8 @@ struct page *find_trylock_page(struct ad
 	struct page *page;
 
 	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
+	if (page && (PagePlaceHolder(page) || TestSetPageLocked(page)))
 		page = NULL;
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
@@ -642,8 +1068,14 @@ struct page *find_lock_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
 	if (page) {
+		if (PagePlaceHolder(page)) {
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 0);
+			read_lock_irq(&mapping->tree_lock);
+			goto repeat;
+		}
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -727,14 +1159,25 @@ unsigned find_get_pages(struct address_s
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			    unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -761,6 +1204,8 @@ unsigned find_get_pages_contig(struct ad
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, index, nr_pages);
 	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i]))
+			break;
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
@@ -785,14 +1230,25 @@ unsigned find_get_pages_tag(struct addre
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
@@ -2406,18 +2862,15 @@ generic_file_direct_IO(int rw, struct ki
 			unmap_mapping_range(mapping, offset, write_len, 0);
 	}
 
-	retval = filemap_write_and_wait(mapping);
-	if (retval == 0) {
-		retval = mapping->a_ops->direct_IO(rw, iocb, iov,
-						offset, nr_segs);
-		if (rw == WRITE && mapping->nrpages) {
-			pgoff_t end = (offset + write_len - 1)
-						>> PAGE_CACHE_SHIFT;
-			int err = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
-			if (err)
-				retval = err;
-		}
+	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
+					offset, nr_segs);
+	if (rw == WRITE && mapping->nrpages) {
+		pgoff_t end = (offset + write_len - 1)
+					>> PAGE_CACHE_SHIFT;
+		int err = invalidate_inode_pages2_range(mapping,
+				offset >> PAGE_CACHE_SHIFT, end);
+		if (err)
+			retval = err;
 	}
 	return retval;
 }
diff -r 511f067627ac -r 4cac7e560b53 mm/migrate.c
--- a/mm/migrate.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/migrate.c	Thu Dec 21 15:31:30 2006 -0500
@@ -305,8 +305,12 @@ static int migrate_page_move_mapping(str
 
 	write_lock_irq(&mapping->tree_lock);
 
+	/*
+	 * we don't need to worry about placeholders here,
+	 * the slot in the tree is verified
+	 */
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
- 					page_index(page));
+					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
diff -r 511f067627ac -r 4cac7e560b53 mm/readahead.c
--- a/mm/readahead.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/readahead.c	Thu Dec 21 15:31:30 2006 -0500
@@ -288,7 +288,8 @@ __do_page_cache_readahead(struct address
 		if (page_offset > end_index)
 			break;
 
-		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		page = radix_tree_lookup_extent(&mapping->page_tree,
+						page_offset);
 		if (page)
 			continue;
 
diff -r 511f067627ac -r 4cac7e560b53 mm/truncate.c
--- a/mm/truncate.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/truncate.c	Thu Dec 21 15:31:30 2006 -0500
@@ -209,6 +209,7 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_release(&pvec);
 	}
+	wait_on_placeholder_pages_range(mapping, start, end);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
  2006-12-22  1:52 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
@ 2006-12-22  1:55 ` Chris Mason
  2006-12-22  1:57 ` [PATCH 3 of 8] DIO: don't fall back to buffered writes Chris Mason
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  1:55 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

All mutex and semaphore usage is removed from the blockdev_direct_IO paths.
Filesystems can either do this locking on their own, or ask for placeholder
pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 4cac7e560b53 -r 317779b11fe1 fs/direct-io.c
--- a/fs/direct-io.c	Thu Dec 21 15:31:30 2006 -0500
+++ b/fs/direct-io.c	Thu Dec 21 15:31:30 2006 -0500
@@ -36,6 +36,7 @@
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
+#include <linux/writeback.h>
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -95,6 +96,13 @@ struct dio {
 	struct buffer_head map_bh;	/* last get_block() result */
 
 	/*
+	 * kernel page pinning
+	 */
+	struct page *tmppages[DIO_PAGES];
+	unsigned long fspages_start_off;
+	unsigned long fspages_end_off;
+
+	/*
 	 * Deferred addition of a page to the dio.  These variables are
 	 * private to dio_send_cur_page(), submit_page_section() and
 	 * dio_bio_add_page().
@@ -190,6 +198,31 @@ out:
 	return ret;	
 }
 
+static void unlock_page_range(struct dio *dio, unsigned long start,
+			      unsigned long nr)
+{
+	if (dio->lock_type != DIO_NO_LOCKING) {
+		remove_placeholder_pages(dio->inode->i_mapping, dio->tmppages,
+					 start, start + nr,
+					 ARRAY_SIZE(dio->tmppages));
+	}
+}
+
+static int lock_page_range(struct dio *dio, unsigned long start,
+			   unsigned long nr)
+{
+	struct address_space *mapping = dio->inode->i_mapping;
+	unsigned long end = start + nr;
+
+	if (dio->lock_type == DIO_NO_LOCKING)
+		return 0;
+	return find_or_insert_placeholders(mapping, dio->tmppages, start, end,
+	                                  ARRAY_SIZE(dio->tmppages),
+					  GFP_KERNEL,
+					  dio->rw == READ);
+}
+
+
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
  * buffered inside the dio so that we can call get_user_pages() against a
@@ -246,9 +279,9 @@ static int dio_complete(struct dio *dio,
 	if (dio->end_io && dio->result)
 		dio->end_io(dio->iocb, offset, transferred,
 			    dio->map_bh.b_private);
-	if (dio->lock_type == DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
+	unlock_page_range(dio, dio->fspages_start_off,
+			  dio->fspages_end_off - dio->fspages_start_off);
+	dio->fspages_end_off = dio->fspages_start_off;
 
 	if (ret == 0)
 		ret = dio->page_errors;
@@ -513,6 +546,8 @@ static int get_more_blocks(struct dio *d
 	unsigned long fs_count;	/* Number of filesystem-sized blocks */
 	unsigned long dio_count;/* Number of dio_block-sized blocks */
 	unsigned long blkmask;
+	unsigned long index;
+	unsigned long end;
 	int create;
 
 	/*
@@ -540,7 +575,24 @@ static int get_more_blocks(struct dio *d
 		} else if (dio->lock_type == DIO_NO_LOCKING) {
 			create = 0;
 		}
-
+	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
+		                        dio->inode->i_blkbits);
+		end = (dio->final_block_in_request >> dio->blkfactor) >>
+		      (PAGE_CACHE_SHIFT - dio->inode->i_blkbits);
+		BUG_ON(index > end);
+		while (index >= dio->fspages_end_off) {
+			unsigned long nr = end - dio->fspages_end_off + 1;
+			/* if we're hitting buffered pages,
+			 * work in smaller chunks.  Otherwise, just
+			 * lock down the whole thing
+			 */
+			if (dio->inode->i_mapping->nrpages)
+				nr = min(nr, (unsigned long)DIO_PAGES);
+			ret = lock_page_range(dio, dio->fspages_end_off, nr);
+			if (ret)
+				goto error;
+			dio->fspages_end_off += nr;
+		}
 		/*
 		 * For writes inside i_size we forbid block creations: only
 		 * overwrites are permitted.  We fall back to buffered writes
@@ -550,6 +602,7 @@ static int get_more_blocks(struct dio *d
 		ret = (*dio->get_block)(dio->inode, fs_startblk,
 						map_bh, create);
 	}
+error:
 	return ret;
 }
 
@@ -946,9 +999,6 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1074,14 +1124,6 @@ direct_io_worker(int rw, struct kiocb *i
 	dio_cleanup(dio);
 
 	/*
-	 * All block lookups have been performed. For READ requests
-	 * we can let i_mutex go now that its achieved its purpose
-	 * of protecting us from looking up uninitialized blocks.
-	 */
-	if ((rw == READ) && (dio->lock_type == DIO_LOCKING))
-		mutex_unlock(&dio->inode->i_mutex);
-
-	/*
 	 * The only time we want to leave bios in flight is when a successful
 	 * partial aio read or full aio write have been setup.  In that case
 	 * bio completion will call aio_complete.  The only time it's safe to
@@ -1130,8 +1172,6 @@ direct_io_worker(int rw, struct kiocb *i
  * DIO_LOCKING (simple locking for regular files)
  * For writes we are called under i_mutex and return with i_mutex held, even
  * though it is internally dropped.
- * For reads, i_mutex is not held on entry, but it is taken and dropped before
- * returning.
  *
  * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
  *	uninitialised data, allowing parallel direct readers and writers)
@@ -1156,8 +1196,7 @@ __blockdev_direct_IO(int rw, struct kioc
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
 	struct dio *dio;
-	int release_i_mutex = 0;
-	int acquire_i_mutex = 0;
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
 
 	if (rw & WRITE)
 		rw = WRITE_SYNC;
@@ -1186,49 +1225,28 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 		}
 	}
-
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
 		goto out;
 
+	dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
+	dio->fspages_end_off = dio->fspages_start_off;
+
 	/*
 	 * For block device access DIO_NO_LOCKING is used,
 	 *	neither readers nor writers do any locking at all
 	 * For regular files using DIO_LOCKING,
-	 *	readers need to grab i_mutex and i_alloc_sem
-	 *	writers need to grab i_alloc_sem only (i_mutex is already held)
+	 *	No locks are taken
 	 * For regular files using DIO_OWN_LOCKING,
 	 *	neither readers nor writers take any locks here
 	 */
 	dio->lock_type = dio_lock_type;
-	if (dio_lock_type != DIO_NO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
-			struct address_space *mapping;
-
-			mapping = iocb->ki_filp->f_mapping;
-			if (dio_lock_type != DIO_OWN_LOCKING) {
-				mutex_lock(&inode->i_mutex);
-				release_i_mutex = 1;
-			}
-
-			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
-			if (retval) {
-				kfree(dio);
-				goto out;
-			}
-
-			if (dio_lock_type == DIO_OWN_LOCKING) {
-				mutex_unlock(&inode->i_mutex);
-				acquire_i_mutex = 1;
-			}
-		}
-
-		if (dio_lock_type == DIO_LOCKING)
-			/* lockdep: not the owner will release it */
-			down_read_non_owner(&inode->i_alloc_sem);
+
+	if (dio->lock_type == DIO_NO_LOCKING && end > offset) {
+		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+		if (retval)
+			goto out;
 	}
 
 	/*
@@ -1242,15 +1260,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
-
-	if (rw == READ && dio_lock_type == DIO_LOCKING)
-		release_i_mutex = 0;
-
 out:
-	if (release_i_mutex)
-		mutex_unlock(&inode->i_mutex);
-	else if (acquire_i_mutex)
-		mutex_lock(&inode->i_mutex);
 	return retval;
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3 of 8] DIO: don't fall back to buffered writes
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
  2006-12-22  1:52 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
  2006-12-22  1:55 ` [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Chris Mason
@ 2006-12-22  1:57 ` Chris Mason
  2006-12-22  1:59 ` [PATCH 4 of 8] Add flags to control direct IO helpers Chris Mason
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  1:57 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

Placeholder pages allow DIO to use locking rules similar to that of
writepage.  DIO can now fill holes, and it can extend the file via
get_block().

i_mutex can be dropped during writes if we are writing inside i_size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 317779b11fe1 -r ac51e7a4c7a6 fs/direct-io.c
--- a/fs/direct-io.c	Thu Dec 21 15:31:30 2006 -0500
+++ b/fs/direct-io.c	Thu Dec 21 15:31:30 2006 -0500
@@ -70,6 +70,7 @@ struct dio {
 	int rw;
 	loff_t i_size;			/* i_size when submitted */
 	int lock_type;			/* doesn't change */
+	int reacquire_i_mutex;		/* should we get i_mutex when done? */
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -218,8 +219,7 @@ static int lock_page_range(struct dio *d
 		return 0;
 	return find_or_insert_placeholders(mapping, dio->tmppages, start, end,
 	                                  ARRAY_SIZE(dio->tmppages),
-					  GFP_KERNEL,
-					  dio->rw == READ);
+					  GFP_KERNEL, 1);
 }
 
 
@@ -282,6 +282,8 @@ static int dio_complete(struct dio *dio,
 	unlock_page_range(dio, dio->fspages_start_off,
 			  dio->fspages_end_off - dio->fspages_start_off);
 	dio->fspages_end_off = dio->fspages_start_off;
+	if (dio->reacquire_i_mutex)
+		mutex_lock(&dio->inode->i_mutex);
 
 	if (ret == 0)
 		ret = dio->page_errors;
@@ -568,13 +570,8 @@ static int get_more_blocks(struct dio *d
 		map_bh->b_size = fs_count << dio->inode->i_blkbits;
 
 		create = dio->rw & WRITE;
-		if (dio->lock_type == DIO_LOCKING) {
-			if (dio->block_in_file < (i_size_read(dio->inode) >>
-							dio->blkbits))
-				create = 0;
-		} else if (dio->lock_type == DIO_NO_LOCKING) {
+		if (dio->lock_type == DIO_NO_LOCKING)
 			create = 0;
-		}
 	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
 		                        dio->inode->i_blkbits);
 		end = (dio->final_block_in_request >> dio->blkfactor) >>
@@ -1258,6 +1255,13 @@ __blockdev_direct_IO(int rw, struct kioc
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
+	/* if our write is inside i_size, we can drop i_mutex */
+	dio->reacquire_i_mutex = 0;
+	if ((rw & WRITE) && dio_lock_type == DIO_LOCKING &&
+	   end <= i_size_read(inode) && is_sync_kiocb(iocb)) {
+		dio->reacquire_i_mutex = 1;
+		mutex_unlock(&inode->i_mutex);
+	}
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
 out:
diff -r 317779b11fe1 -r ac51e7a4c7a6 mm/filemap.c
--- a/mm/filemap.c	Thu Dec 21 15:31:30 2006 -0500
+++ b/mm/filemap.c	Thu Dec 21 15:31:30 2006 -0500
@@ -2865,10 +2865,19 @@ generic_file_direct_IO(int rw, struct ki
 	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
 					offset, nr_segs);
 	if (rw == WRITE && mapping->nrpages) {
+		int err;
 		pgoff_t end = (offset + write_len - 1)
 					>> PAGE_CACHE_SHIFT;
-		int err = invalidate_inode_pages2_range(mapping,
-				offset >> PAGE_CACHE_SHIFT, end);
+
+		/* O_DIRECT is allowed to drop i_mutex, so more data
+		 * could have been dirtied by others.  Start io one more
+		 * time
+		 */
+		err = filemap_fdatawrite_range(mapping, offset,
+		                               offset + write_len - 1);
+		if (!err)
+			err = invalidate_inode_pages2_range(mapping,
+					offset >> PAGE_CACHE_SHIFT, end);
 		if (err)
 			retval = err;
 	}



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4 of 8] Add flags to control direct IO helpers
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
                   ` (2 preceding siblings ...)
  2006-12-22  1:57 ` [PATCH 3 of 8] DIO: don't fall back to buffered writes Chris Mason
@ 2006-12-22  1:59 ` Chris Mason
  2006-12-22  2:00 ` [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules Chris Mason
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  1:59 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

This creates a number of flags so that filesystems can control
blockdev_direct_IO.  It is based on code from Russell Cettelan.

The new flags are:
DIO_CREATE -- always pass create=1 to get_block on writes.  This allows
	      DIO to fill holes in the file.
DIO_PLACEHOLDERS -- use placeholder pages to provide locking against buffered
	            io and truncates.
DIO_DROP_I_MUTEX -- drop i_mutex before starting the mapping, io submission,
		    or io waiting.  The mutex is still dropped for AIO
		    as well.

Some API changes are made so that filesystems can have more control
over the DIO features.

__blockdev_direct_IO is more or less renamed to blockdev_direct_IO_flags.
All waiting and invalidating of page cache data is pushed down into
blockdev_direct_IO_flags (and removed from mm/filemap.c)

direct_io_worker is exported into the wild.  Filesystems that want to be
special can pull out the bits of blockdev_direct_IO_flags they care about
and then call direct_io_worker directly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r ac51e7a4c7a6 -r 385bc75d9266 fs/direct-io.c
--- a/fs/direct-io.c	Thu Dec 21 15:31:30 2006 -0500
+++ b/fs/direct-io.c	Thu Dec 21 15:31:30 2006 -0500
@@ -54,13 +54,6 @@
  *
  * If blkfactor is zero then the user's request was aligned to the filesystem's
  * blocksize.
- *
- * lock_type is DIO_LOCKING for regular files on direct-IO-naive filesystems.
- * This determines whether we need to do the fancy locking which prevents
- * direct-IO from being able to read uninitialised disk blocks.  If its zero
- * (blockdev) this locking is not done, and if it is DIO_OWN_LOCKING i_mutex is
- * not held for the entire direct write (taken briefly, initially, during a
- * direct read though, but its never held for the duration of a direct-IO).
  */
 
 struct dio {
@@ -69,8 +62,7 @@ struct dio {
 	struct inode *inode;
 	int rw;
 	loff_t i_size;			/* i_size when submitted */
-	int lock_type;			/* doesn't change */
-	int reacquire_i_mutex;		/* should we get i_mutex when done? */
+	unsigned flags;			/* doesn't change */
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -202,7 +194,7 @@ static void unlock_page_range(struct dio
 static void unlock_page_range(struct dio *dio, unsigned long start,
 			      unsigned long nr)
 {
-	if (dio->lock_type != DIO_NO_LOCKING) {
+	if (dio->flags & DIO_PLACEHOLDERS) {
 		remove_placeholder_pages(dio->inode->i_mapping, dio->tmppages,
 					 start, start + nr,
 					 ARRAY_SIZE(dio->tmppages));
@@ -215,13 +207,14 @@ static int lock_page_range(struct dio *d
 	struct address_space *mapping = dio->inode->i_mapping;
 	unsigned long end = start + nr;
 
-	if (dio->lock_type == DIO_NO_LOCKING)
-		return 0;
-	return find_or_insert_placeholders(mapping, dio->tmppages, start, end,
-	                                  ARRAY_SIZE(dio->tmppages),
-					  GFP_KERNEL, 1);
-}
-
+	if (dio->flags & DIO_PLACEHOLDERS) {
+		return find_or_insert_placeholders(mapping, dio->tmppages,
+						   start, end,
+						   ARRAY_SIZE(dio->tmppages),
+						   GFP_KERNEL, 1);
+	}
+	return 0;
+}
 
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
@@ -282,8 +275,6 @@ static int dio_complete(struct dio *dio,
 	unlock_page_range(dio, dio->fspages_start_off,
 			  dio->fspages_end_off - dio->fspages_start_off);
 	dio->fspages_end_off = dio->fspages_start_off;
-	if (dio->reacquire_i_mutex)
-		mutex_lock(&dio->inode->i_mutex);
 
 	if (ret == 0)
 		ret = dio->page_errors;
@@ -569,8 +560,9 @@ static int get_more_blocks(struct dio *d
 		map_bh->b_state = 0;
 		map_bh->b_size = fs_count << dio->inode->i_blkbits;
 
-		create = dio->rw & WRITE;
-		if (dio->lock_type == DIO_NO_LOCKING)
+		if (dio->flags & DIO_CREATE)
+			create = dio->rw & WRITE;
+		else
 			create = 0;
 	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
 		                        dio->inode->i_blkbits);
@@ -996,19 +988,43 @@ out:
 	return ret;
 }
 
-static ssize_t
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
+/*
+ * This does all the real work of the direct io.  Most filesystems want to
+ * call blockdev_direct_IO_flags instead, but if you have exotic locking
+ * routines you can call this directly.
+ *
+ * The flags parameter is a bitmask of:
+ *
+ * DIO_PLACEHOLDERS (use placeholder pages for locking)
+ * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
+ * DIO_DROP_I_MUTEX (drop inode->i_mutex during writes)
+ */
+ssize_t
+direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
+	const struct iovec *iov, loff_t offset, unsigned long nr_segs,
 	unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-	struct dio *dio)
-{
-	unsigned long user_addr; 
+	int is_async, unsigned dioflags)
+{
+	unsigned long user_addr;
 	unsigned long flags;
 	int seg;
 	ssize_t ret = 0;
 	ssize_t ret2;
 	size_t bytes;
-
+	struct dio *dio;
+
+	if (rw & WRITE)
+		rw = WRITE_SYNC;
+
+	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
+	ret = -ENOMEM;
+	if (!dio)
+		goto out;
+
+	dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
+	dio->fspages_end_off = dio->fspages_start_off;
+	dio->flags = dioflags;
+	dio->is_async = is_async;
 	dio->bio = NULL;
 	dio->inode = inode;
 	dio->rw = rw;
@@ -1156,33 +1172,24 @@ direct_io_worker(int rw, struct kiocb *i
 	} else
 		BUG_ON(ret != -EIOCBQUEUED);
 
+out:
 	return ret;
 }
-
-/*
- * This is a library function for use by filesystem drivers.
- * The locking rules are governed by the dio_lock_type parameter.
- *
- * DIO_NO_LOCKING (no locking, for raw block device access)
- * For writes, i_mutex is not held on entry; it is never taken.
- *
- * DIO_LOCKING (simple locking for regular files)
- * For writes we are called under i_mutex and return with i_mutex held, even
- * though it is internally dropped.
- *
- * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
- *	uninitialised data, allowing parallel direct readers and writers)
- * For writes we are called without i_mutex, return without it, never touch it.
- * For reads we are called under i_mutex and return with i_mutex held, even
- * though it may be internally dropped.
- *
- * Additional i_alloc_sem locking requirements described inline below.
- */
-ssize_t
-__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
-	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
-	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
-	int dio_lock_type)
+EXPORT_SYMBOL(direct_io_worker);
+
+/*
+ * A utility function fro blockdev_direct_IO_flags, this checks
+ * alignment of a O_DIRECT iovec against filesystem and blockdevice
+ * requirements.
+ *
+ * It returns a blkbits value that will work for the io, and returns the
+ * end offset of the io (via blkbits_ret and end_ret).
+ *
+ * The function returns 0 if everything will work or -EINVAL on error
+ */
+int check_dio_alignment(struct inode *inode, struct block_device *bdev,
+			const struct iovec *iov, loff_t offset, unsigned long nr_segs,
+			unsigned *blkbits_ret, loff_t *end_ret)
 {
 	int seg;
 	size_t size;
@@ -1190,13 +1197,7 @@ __blockdev_direct_IO(int rw, struct kioc
 	unsigned blkbits = inode->i_blkbits;
 	unsigned bdev_blkbits = 0;
 	unsigned blocksize_mask = (1 << blkbits) - 1;
-	ssize_t retval = -EINVAL;
 	loff_t end = offset;
-	struct dio *dio;
-	struct address_space *mapping = iocb->ki_filp->f_mapping;
-
-	if (rw & WRITE)
-		rw = WRITE_SYNC;
 
 	if (bdev)
 		bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
@@ -1206,7 +1207,7 @@ __blockdev_direct_IO(int rw, struct kioc
 			 blkbits = bdev_blkbits;
 		blocksize_mask = (1 << blkbits) - 1;
 		if (offset & blocksize_mask)
-			goto out;
+			return -EINVAL;
 	}
 
 	/* Check the memory alignment.  Blocks cannot straddle pages */
@@ -1218,29 +1219,60 @@ __blockdev_direct_IO(int rw, struct kioc
 			if (bdev)
 				 blkbits = bdev_blkbits;
 			blocksize_mask = (1 << blkbits) - 1;
-			if ((addr & blocksize_mask) || (size & blocksize_mask))  
-				goto out;
+			if ((addr & blocksize_mask) || (size & blocksize_mask))
+				return -EINVAL;
 		}
 	}
-	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
-	retval = -ENOMEM;
-	if (!dio)
+	*end_ret = end;
+	*blkbits_ret = blkbits;
+	return 0;
+}
+EXPORT_SYMBOL(check_dio_alignment);
+
+/*
+ * This is a library function for use by filesystem drivers.
+ * The flags parameter is a bitmask of:
+ *
+ * DIO_PLACEHOLDERS (use placeholder pages for locking)
+ * DIO_CREATE (pass create=1 to get_block for filling holes)
+ * DIO_DROP_I_MUTEX (drop inode->i_mutex during writes)
+ */
+ssize_t
+blockdev_direct_IO_flags(int rw, struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
+	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
+	unsigned flags)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	unsigned blkbits = 0;
+	ssize_t retval = -EINVAL;
+	loff_t end = 0;
+	int is_async;
+	int grab_i_mutex = 0;
+
+
+	if (check_dio_alignment(inode, bdev, iov, offset, nr_segs,
+				&blkbits, &end))
 		goto out;
 
-	dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
-	dio->fspages_end_off = dio->fspages_start_off;
-
-	/*
-	 * For block device access DIO_NO_LOCKING is used,
-	 *	neither readers nor writers do any locking at all
-	 * For regular files using DIO_LOCKING,
-	 *	No locks are taken
-	 * For regular files using DIO_OWN_LOCKING,
-	 *	neither readers nor writers take any locks here
-	 */
-	dio->lock_type = dio_lock_type;
-
-	if (dio->lock_type == DIO_NO_LOCKING && end > offset) {
+	if (rw & WRITE) {
+		/*
+		 * If it's a write, unmap all mmappings of the file up-front.
+		 * This will cause any pte dirty bits to be propagated into
+		 * the pageframes for the subsequent filemap_write_and_wait().
+		 */
+		if (mapping_mapped(mapping))
+			unmap_mapping_range(mapping, offset, end - offset, 0);
+		if (end <= i_size_read(inode) && (flags & DIO_DROP_I_MUTEX)) {
+			mutex_unlock(&inode->i_mutex);
+			grab_i_mutex = 1;
+		}
+	}
+	/*
+	 * the placeholder code does filemap_write_and_wait, so if we
+	 * aren't using placeholders we have to do it here
+	 */
+	if (!(flags & DIO_PLACEHOLDERS) && end > offset) {
 		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
 		if (retval)
 			goto out;
@@ -1252,19 +1284,30 @@ __blockdev_direct_IO(int rw, struct kioc
 	 * even for AIO, we need to wait for i/o to complete before
 	 * returning in this case.
 	 */
-	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
+	is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
-	/* if our write is inside i_size, we can drop i_mutex */
-	dio->reacquire_i_mutex = 0;
-	if ((rw & WRITE) && dio_lock_type == DIO_LOCKING &&
-	   end <= i_size_read(inode) && is_sync_kiocb(iocb)) {
-		dio->reacquire_i_mutex = 1;
-		mutex_unlock(&inode->i_mutex);
-	}
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
-				nr_segs, blkbits, get_block, end_io, dio);
+				nr_segs, blkbits, get_block, end_io, is_async,
+				flags);
 out:
+	if (grab_i_mutex)
+		mutex_lock(&inode->i_mutex);
+
+	if ((rw & WRITE) && mapping->nrpages) {
+		int err;
+		/* O_DIRECT is allowed to drop i_mutex, so more data
+		 * could have been dirtied by others.  Start io one more
+		 * time
+		 */
+		err = filemap_write_and_wait_range(mapping, offset, end - 1);
+		if (!err)
+			err = invalidate_inode_pages2_range(mapping,
+					offset >> PAGE_CACHE_SHIFT,
+					(end - 1) >> PAGE_CACHE_SHIFT);
+		if (!retval && err)
+			retval = err;
+	}
 	return retval;
 }
-EXPORT_SYMBOL(__blockdev_direct_IO);
+EXPORT_SYMBOL(blockdev_direct_IO_flags);
diff -r ac51e7a4c7a6 -r 385bc75d9266 include/linux/fs.h
--- a/include/linux/fs.h	Thu Dec 21 15:31:30 2006 -0500
+++ b/include/linux/fs.h	Thu Dec 21 15:31:30 2006 -0500
@@ -1775,24 +1775,28 @@ static inline void do_generic_file_read(
 }
 
 #ifdef CONFIG_BLOCK
-ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
+int check_dio_alignment(struct inode *inode, struct block_device *bdev,
+                        const struct iovec *iov, loff_t offset, unsigned long nr_segs,
+			                        unsigned *blkbits_ret, loff_t *end_ret);
+
+ssize_t blockdev_direct_IO_flags(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
 	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
-	int lock_type);
-
-enum {
-	DIO_LOCKING = 1, /* need locking between buffered and direct access */
-	DIO_NO_LOCKING,  /* bdev; no locking at all between buffered/direct */
-	DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
-};
+	unsigned int dio_flags);
+
+#define DIO_PLACEHOLDERS (1 << 0)  /* insert placeholder pages */
+#define DIO_CREATE	(1 << 1)  /* pass create=1 to get_block when writing */
+#define DIO_DROP_I_MUTEX (1 << 2) /* drop i_mutex during writes */
 
 static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_LOCKING);
+	/* locking is on, FS wants to fill holes w/get_block */
+	return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
+				nr_segs, get_block, end_io, DIO_PLACEHOLDERS |
+				DIO_CREATE | DIO_DROP_I_MUTEX);
 }
 
 static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb,
@@ -1800,17 +1804,9 @@ static inline ssize_t blockdev_direct_IO
 	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_NO_LOCKING);
-}
-
-static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb *iocb,
-	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_block_t get_block,
-	dio_iodone_t end_io)
-{
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_OWN_LOCKING);
+	/* locking is off, create is off */
+	return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
+				nr_segs, get_block, end_io, 0);
 }
 #endif
 
diff -r ac51e7a4c7a6 -r 385bc75d9266 mm/filemap.c
--- a/mm/filemap.c	Thu Dec 21 15:31:30 2006 -0500
+++ b/mm/filemap.c	Thu Dec 21 15:31:30 2006 -0500
@@ -40,7 +40,7 @@
 
 #include <asm/mman.h>
 
-static ssize_t
+static inline ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
@@ -2842,46 +2842,12 @@ EXPORT_SYMBOL(generic_file_aio_write);
  * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if something
  * went wrong during pagecache shootdown.
  */
-static ssize_t
+static inline ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs)
 {
-	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	ssize_t retval;
-	size_t write_len = 0;
-
-	/*
-	 * If it's a write, unmap all mmappings of the file up-front.  This
-	 * will cause any pte dirty bits to be propagated into the pageframes
-	 * for the subsequent filemap_write_and_wait().
-	 */
-	if (rw == WRITE) {
-		write_len = iov_length(iov, nr_segs);
-	       	if (mapping_mapped(mapping))
-			unmap_mapping_range(mapping, offset, write_len, 0);
-	}
-
-	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
-					offset, nr_segs);
-	if (rw == WRITE && mapping->nrpages) {
-		int err;
-		pgoff_t end = (offset + write_len - 1)
-					>> PAGE_CACHE_SHIFT;
-
-		/* O_DIRECT is allowed to drop i_mutex, so more data
-		 * could have been dirtied by others.  Start io one more
-		 * time
-		 */
-		err = filemap_fdatawrite_range(mapping, offset,
-		                               offset + write_len - 1);
-		if (!err)
-			err = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
-		if (err)
-			retval = err;
-	}
-	return retval;
+	return iocb->ki_filp->f_mapping->a_ops->direct_IO(rw, iocb, iov,
+							  offset, nr_segs);
 }
 
 /**



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
                   ` (3 preceding siblings ...)
  2006-12-22  1:59 ` [PATCH 4 of 8] Add flags to control direct IO helpers Chris Mason
@ 2006-12-22  2:00 ` Chris Mason
  2006-12-22  2:02 ` [PATCH 6 of 8] Make reiserfs safe for " Chris Mason
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  2:00 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

This creates a version of ext3_get_block that starts and ends a transaction.

By starting and ending the transaction inside get_block, this is able to
avoid lock inversion problems when the DIO code tries to take page locks
inside blockdev_direct_IO. (transaction locks must always happen after
page locks).

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 385bc75d9266 -r bebaf8972a31 fs/ext3/inode.c
--- a/fs/ext3/inode.c	Thu Dec 21 15:31:30 2006 -0500
+++ b/fs/ext3/inode.c	Thu Dec 21 15:31:30 2006 -0500
@@ -1673,6 +1673,30 @@ static int ext3_releasepage(struct page 
 	return journal_try_to_free_buffers(journal, page, wait);
 }
 
+static int ext3_get_block_direct_IO(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create)
+{
+	int ret = 0;
+	handle_t *handle = ext3_journal_start(inode, DIO_CREDITS);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
+	ret = ext3_get_block(inode, iblock, bh_result, create);
+	/*
+	 * Reacquire the handle: ext3_get_block() can restart the transaction
+	 */
+	handle = journal_current_handle();
+	if (handle) {
+		int err;
+		err = ext3_journal_stop(handle);
+		if (!ret)
+			ret = err;
+	}
+out:
+	return ret;
+}
+
 /*
  * If the O_DIRECT write will extend the file then add this inode to the
  * orphan list.  So recovery will truncate it back to the original size
@@ -1693,39 +1717,58 @@ static ssize_t ext3_direct_IO(int rw, st
 	int orphan = 0;
 	size_t count = iov_length(iov, nr_segs);
 
-	if (rw == WRITE) {
-		loff_t final_size = offset + count;
-
+	if (rw == WRITE && (offset + count > inode->i_size)) { 
 		handle = ext3_journal_start(inode, DIO_CREDITS);
 		if (IS_ERR(handle)) {
 			ret = PTR_ERR(handle);
 			goto out;
 		}
-		if (final_size > inode->i_size) {
-			ret = ext3_orphan_add(handle, inode);
-			if (ret)
-				goto out_stop;
-			orphan = 1;
-			ei->i_disksize = inode->i_size;
-		}
-	}
-
+		ret = ext3_orphan_add(handle, inode);
+		if (ret) {
+			ext3_journal_stop(handle);
+			goto out;
+		}
+		ei->i_disksize = inode->i_size;
+		ret = ext3_journal_stop(handle);
+		if (ret) {
+			/* something has gone horribly wrong, cleanup
+			 * the orphan list in ram
+			 */
+			if (inode->i_nlink)
+				ext3_orphan_del(NULL, inode);
+			goto out;
+		}
+		orphan = 1;
+	}
+
+	/*
+	 * the placeholder page code may take a page lock, so we have
+	 * to stop any running transactions before calling
+	 * blockdev_direct_IO.  Use ext3_get_block_direct_IO to start
+	 * and stop a transaction on each get_block call.
+	 */
 	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
 				 offset, nr_segs,
-				 ext3_get_block, NULL);
+				 ext3_get_block_direct_IO, NULL);
 
 	/*
 	 * Reacquire the handle: ext3_get_block() can restart the transaction
 	 */
 	handle = journal_current_handle();
 
-out_stop:
-	if (handle) {
+	if (orphan) {
 		int err;
-
-		if (orphan && inode->i_nlink)
+		handle = ext3_journal_start(inode, DIO_CREDITS);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			if (inode->i_nlink)
+				ext3_orphan_del(NULL, inode);
+			goto out;
+		}
+
+		if (inode->i_nlink)
 			ext3_orphan_del(handle, inode);
-		if (orphan && ret > 0) {
+		if (ret > 0) {
 			loff_t end = offset + ret;
 			if (end > inode->i_size) {
 				ei->i_disksize = end;



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 6 of 8] Make reiserfs safe for new DIO locking rules
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
                   ` (4 preceding siblings ...)
  2006-12-22  2:00 ` [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules Chris Mason
@ 2006-12-22  2:02 ` Chris Mason
  2006-12-22  2:03 ` [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls Chris Mason
  2006-12-22  2:05 ` [PATCH 8 of 8] Avoid too many boundary buffers in DIO Chris Mason
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  2:02 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

reiserfs is changed to use a version of reiserfs_get_block that is safe
for filling holes without i_mutex held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r bebaf8972a31 -r 5a06df98f46d fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c	Thu Dec 21 15:31:30 2006 -0500
+++ b/fs/reiserfs/inode.c	Thu Dec 21 15:31:31 2006 -0500
@@ -469,7 +469,8 @@ static int reiserfs_get_blocks_direct_io
 	bh_result->b_size = (1 << inode->i_blkbits);
 
 	ret = reiserfs_get_block(inode, iblock, bh_result,
-				 create | GET_BLOCK_NO_DANGLE);
+				 create | GET_BLOCK_NO_DANGLE |
+				 GET_BLOCK_NO_IMUX);
 	if (ret)
 		goto out;
 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
                   ` (5 preceding siblings ...)
  2006-12-22  2:02 ` [PATCH 6 of 8] Make reiserfs safe for " Chris Mason
@ 2006-12-22  2:03 ` Chris Mason
  2006-12-22  2:05 ` [PATCH 8 of 8] Avoid too many boundary buffers in DIO Chris Mason
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  2:03 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

XFS is changed to use blockdev_direct_IO flags instead of DIO_OWN_LOCKING.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 5a06df98f46d -r 3bd838f3dc06 fs/xfs/linux-2.6/xfs_aops.c
--- a/fs/xfs/linux-2.6/xfs_aops.c	Thu Dec 21 15:31:31 2006 -0500
+++ b/fs/xfs/linux-2.6/xfs_aops.c	Thu Dec 21 15:31:31 2006 -0500
@@ -1392,19 +1392,16 @@ xfs_vm_direct_IO(
 
 	iocb->private = xfs_alloc_ioend(inode, IOMAP_UNWRITTEN);
 
-	if (rw == WRITE) {
-		ret = blockdev_direct_IO_own_locking(rw, iocb, inode,
-			iomap.iomap_target->bt_bdev,
-			iov, offset, nr_segs,
-			xfs_get_blocks_direct,
-			xfs_end_io_direct);
-	} else {
-		ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
-			iomap.iomap_target->bt_bdev,
-			iov, offset, nr_segs,
-			xfs_get_blocks_direct,
-			xfs_end_io_direct);
-	}
+	/*
+	 * ask DIO not to do any special locking for us, and to always
+	 * pass create=1 to get_block on writes
+	 */
+	ret = blockdev_direct_IO_flags(rw, iocb, inode,
+				       iomap.iomap_target->bt_bdev,
+				       iov, offset, nr_segs,
+				       xfs_get_blocks_direct,
+				       xfs_end_io_direct,
+				       DIO_CREATE);
 
 	if (unlikely(ret != -EIOCBQUEUED && iocb->private))
 		xfs_destroy_ioend(iocb->private);



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 8 of 8] Avoid too many boundary buffers in DIO
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
                   ` (6 preceding siblings ...)
  2006-12-22  2:03 ` [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls Chris Mason
@ 2006-12-22  2:05 ` Chris Mason
  7 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2006-12-22  2:05 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

Dave Chinner found a 10% performance regression with ext3 when using DIO
to fill holes instead of buffered IO.  On large IOs, the ext3 get_block
routine will send more than a page worth of blocks back to DIO via a
single buffer_head with a large b_size value.

The DIO code iterates through this massive block and tests for a
boundary buffer over and over again.  For every block size unit spanned
by the big map_bh, the boundary bit is tested and a bio may be forced
down to the block layer.

There are two potential fixes, one is to ignore the boundary bit on
large regions returned by the FS.  DIO can't tell which part of the big
region was a boundary, and so it may not be a good idea to trust the
hint.

This patch just clears the boundary bit after using it once.  It is 10%
faster for a streaming DIO write w/blocksize of 512k on my sata drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 3bd838f3dc06 -r 9d3d4e0f01fe fs/direct-io.c
--- a/fs/direct-io.c	Thu Dec 21 15:31:31 2006 -0500
+++ b/fs/direct-io.c	Thu Dec 21 15:31:31 2006 -0500
@@ -610,7 +610,6 @@ static int dio_new_bio(struct dio *dio, 
 	nr_pages = min(dio->pages_in_io, bio_get_nr_vecs(dio->map_bh.b_bdev));
 	BUG_ON(nr_pages <= 0);
 	ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector, nr_pages);
-	dio->boundary = 0;
 out:
 	return ret;
 }
@@ -664,12 +663,6 @@ static int dio_send_cur_page(struct dio 
 		 */
 		if (dio->final_block_in_bio != dio->cur_page_block)
 			dio_bio_submit(dio);
-		/*
-		 * Submit now if the underlying fs is about to perform a
-		 * metadata read
-		 */
-		if (dio->boundary)
-			dio_bio_submit(dio);
 	}
 
 	if (dio->bio == NULL) {
@@ -686,6 +679,12 @@ static int dio_send_cur_page(struct dio 
 			BUG_ON(ret != 0);
 		}
 	}
+	/*
+	 * Submit now if the underlying fs is about to perform a
+	 * metadata read
+	 */
+	if (dio->boundary)
+		dio_bio_submit(dio);
 out:
 	return ret;
 }
@@ -712,6 +711,10 @@ submit_page_section(struct dio *dio, str
 		unsigned offset, unsigned len, sector_t blocknr)
 {
 	int ret = 0;
+	int boundary = dio->boundary;
+
+	/* don't let dio_send_cur_page do the boundary too soon */
+	dio->boundary = 0;
 
 	if (dio->rw & WRITE) {
 		/*
@@ -728,17 +731,7 @@ submit_page_section(struct dio *dio, str
 		(dio->cur_page_block +
 			(dio->cur_page_len >> dio->blkbits) == blocknr)) {
 		dio->cur_page_len += len;
-
-		/*
-		 * If dio->boundary then we want to schedule the IO now to
-		 * avoid metadata seeks.
-		 */
-		if (dio->boundary) {
-			ret = dio_send_cur_page(dio);
-			page_cache_release(dio->cur_page);
-			dio->cur_page = NULL;
-		}
-		goto out;
+		goto out_send;
 	}
 
 	/*
@@ -757,6 +750,18 @@ submit_page_section(struct dio *dio, str
 	dio->cur_page_offset = offset;
 	dio->cur_page_len = len;
 	dio->cur_page_block = blocknr;
+
+out_send:
+	/*
+	 * If dio->boundary then we want to schedule the IO now to
+	 * avoid metadata seeks.
+	 */
+	if (boundary) {
+		dio->boundary = 1;
+		ret = dio_send_cur_page(dio);
+		page_cache_release(dio->cur_page);
+		dio->cur_page = NULL;
+	}
 out:
 	return ret;
 }
@@ -962,7 +967,16 @@ do_holes:
 			this_chunk_bytes = this_chunk_blocks << blkbits;
 			BUG_ON(this_chunk_bytes == 0);
 
-			dio->boundary = buffer_boundary(map_bh);
+			/*
+			 * get_block may return more than one page worth
+			 * of blocks.  Make sure only the last io we
+			 * send down for this region is a boundary
+			 */
+			if (dio->blocks_available == this_chunk_blocks)
+				dio->boundary = buffer_boundary(map_bh);
+			else
+				dio->boundary = 0;
+
 			ret = submit_page_section(dio, page, offset_in_page,
 				this_chunk_bytes, dio->next_block_for_io);
 			if (ret) {



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3 of 8] DIO: don't fall back to buffered writes
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

Placeholder pages allow DIO to use locking rules similar to that of
writepage.  DIO can now fill holes, and it can extend the file via
get_block().

i_mutex can be dropped during writes if we are writing inside i_size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 5cd028318654 -r 1a7105ab9c19 fs/direct-io.c
--- a/fs/direct-io.c	Tue Feb 06 20:02:49 2007 -0500
+++ b/fs/direct-io.c	Tue Feb 06 20:02:55 2007 -0500
@@ -1,3 +1,4 @@
+					  GFP_KERNEL, 1);
 /*
  * fs/direct-io.c
  *
@@ -70,6 +71,7 @@ struct dio {
 	int rw;
 	loff_t i_size;			/* i_size when submitted */
 	int lock_type;			/* doesn't change */
+	int reacquire_i_mutex;		/* should we get i_mutex when done? */
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -308,6 +310,9 @@ static int dio_complete(struct dio *dio,
 			    dio->map_bh.b_private);
 	dio_unlock_page_range(dio);
 
+	if (dio->reacquire_i_mutex)
+		mutex_lock(&dio->inode->i_mutex);
+
 	if (ret == 0)
 		ret = dio->page_errors;
 	if (ret == 0)
@@ -593,13 +598,8 @@ static int get_more_blocks(struct dio *d
 		map_bh->b_size = fs_count << dio->inode->i_blkbits;
 
 		create = dio->rw & WRITE;
-		if (dio->lock_type == DIO_LOCKING) {
-			if (dio->block_in_file < (i_size_read(dio->inode) >>
-							dio->blkbits))
-				create = 0;
-		} else if (dio->lock_type == DIO_NO_LOCKING) {
+		if (dio->lock_type == DIO_NO_LOCKING)
 			create = 0;
-		}
 	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
 		                        dio->inode->i_blkbits);
 		end = (dio->final_block_in_request >> dio->blkfactor) >>
@@ -1289,6 +1289,13 @@ __blockdev_direct_IO(int rw, struct kioc
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
+	/* if our write is inside i_size, we can drop i_mutex */
+	dio->reacquire_i_mutex = 0;
+	if ((rw & WRITE) && dio_lock_type == DIO_LOCKING &&
+	   end <= i_size_read(inode) && is_sync_kiocb(iocb)) {
+		dio->reacquire_i_mutex = 1;
+		mutex_unlock(&inode->i_mutex);
+	}
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
 out:
diff -r 5cd028318654 -r 1a7105ab9c19 mm/filemap.c
--- a/mm/filemap.c	Tue Feb 06 20:02:49 2007 -0500
+++ b/mm/filemap.c	Tue Feb 06 20:02:55 2007 -0500
@@ -2840,10 +2840,19 @@ generic_file_direct_IO(int rw, struct ki
 	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
 					offset, nr_segs);
 	if (rw == WRITE && mapping->nrpages) {
+		int err;
 		pgoff_t end = (offset + write_len - 1)
 					>> PAGE_CACHE_SHIFT;
-		int err = invalidate_inode_pages2_range(mapping,
-				offset >> PAGE_CACHE_SHIFT, end);
+
+		/* O_DIRECT is allowed to drop i_mutex, so more data
+		 * could have been dirtied by others.  Start io one more
+		 * time
+		 */
+		err = filemap_fdatawrite_range(mapping, offset,
+		                               offset + write_len - 1);
+		if (!err)
+			err = invalidate_inode_pages2_range(mapping,
+					offset >> PAGE_CACHE_SHIFT, end);
 		if (err)
 			retval = err;
 	}



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-02-07 15:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
2006-12-22  1:52 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
2006-12-22  1:55 ` [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Chris Mason
2006-12-22  1:57 ` [PATCH 3 of 8] DIO: don't fall back to buffered writes Chris Mason
2006-12-22  1:59 ` [PATCH 4 of 8] Add flags to control direct IO helpers Chris Mason
2006-12-22  2:00 ` [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules Chris Mason
2006-12-22  2:02 ` [PATCH 6 of 8] Make reiserfs safe for " Chris Mason
2006-12-22  2:03 ` [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls Chris Mason
2006-12-22  2:05 ` [PATCH 8 of 8] Avoid too many boundary buffers in DIO Chris Mason
  -- strict thread matches above, loose matches on Subject: below --
2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
2007-02-07  0:32 ` [PATCH 3 of 8] DIO: don't fall back to buffered writes Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).