[RFC PATCH 0/3] O_DIRECT locking rework

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] O_DIRECT locking rework
@ 2006-10-27 18:22 Chris Mason
  2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Chris Mason @ 2006-10-27 18:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

Hello everyone,

A new spin of my DIO locking patches.  I've fixed a few bugs in patch 2,
and added a new patch to extend the file and fill holes without falling
back to buffered.

This was tested on top of Zach's DIO patches, but rediffed on top of
Linus' tree for sending (minor changes required).  I also tested on top
of 2.6.19-rc2-mm2, but my code for inserting/removing placeholders
stopped working right.  I think this is either a bug in my code or a bug
in Nick's RCU stuff, I'm still trying to track it down.

-chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/3] Introduce a place holder page for the pagecache.
  2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason
@ 2006-10-27 18:23 ` Chris Mason
  2006-11-01 10:23   ` Suparna Bhattacharya
  2006-10-27 18:25 ` [RFC PATCH 2/3] Change O_DIRECT to use placeholders Chris Mason
  2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason
  2 siblings, 1 reply; 11+ messages in thread
From: Chris Mason @ 2006-10-27 18:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

Introduce a place holder page for the pagecache.

mm/filemap.c is changed to wait on these before adding a page into the page
cache, and truncates are changed to wait for all of the place holder pages to
disappear.

Place holder pages can only be tested or looked at with the mapping lock
held, and only PagePlaceHolder(page) can be trusted.  They cannot be locked,
and cannot have references increased or decreased on them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 18a9e9f5c707 include/linux/mm.h
--- a/include/linux/mm.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/mm.h	Tue Oct 24 15:10:48 2006 -0400
@@ -276,6 +276,7 @@ static inline void get_page(struct page 
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
 	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(PagePlaceHolder(page));
 	atomic_inc(&page->_count);
 }
 
diff -r 18a9e9f5c707 include/linux/page-flags.h
--- a/include/linux/page-flags.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/page-flags.h	Tue Oct 24 15:10:48 2006 -0400
@@ -267,4 +266,6 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
+void set_page_placeholder(struct page *page);
+
 #endif	/* PAGE_FLAGS_H */
diff -r 18a9e9f5c707 include/linux/pagemap.h
--- a/include/linux/pagemap.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/pagemap.h	Tue Oct 24 15:10:48 2006 -0400
@@ -72,6 +72,12 @@ extern struct page * find_get_page(struc
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
 				unsigned long index);
+int find_or_insert_placeholders(struct address_space *mapping,
+                                  struct page **pages, unsigned long start,
+                                  unsigned long end, unsigned long nr,
+                                  gfp_t gfp_mask,
+                                  struct page *insert,
+                                  int wait);
 extern __deprecated_for_modules struct page * find_trylock_page(
 			struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -82,6 +88,16 @@ unsigned find_get_pages_contig(struct ad
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+void remove_placeholder_pages(struct address_space *mapping,
+                             struct page **pages,
+                             struct page *placeholder,
+                             unsigned long offset,
+                             unsigned long end,
+                             unsigned long nr);
+void wake_up_placeholder_page(struct page *page);
+void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start,
+			       pgoff_t end);
+
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff -r 18a9e9f5c707 include/linux/radix-tree.h
--- a/include/linux/radix-tree.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/radix-tree.h	Tue Oct 24 15:10:48 2006 -0400
@@ -56,6 +56,7 @@ radix_tree_gang_lookup(struct radix_tree
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_needs_preload(void);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff -r 18a9e9f5c707 lib/radix-tree.c
--- a/lib/radix-tree.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/lib/radix-tree.c	Tue Oct 24 15:10:48 2006 -0400
@@ -139,6 +139,19 @@ out:
 out:
 	return ret;
 }
+
+/*
+ * Must be called inside the preload pair.  This tells you if another
+ * run of radix_tree_preload is required, allowing the caller to decide
+ * if a repeated insert will succeed
+ */
+int radix_tree_needs_preload(void)
+{
+	struct radix_tree_preload *rtp;
+	rtp = &__get_cpu_var(radix_tree_preloads);
+	return rtp->nr < ARRAY_SIZE(rtp->nodes);
+}
+EXPORT_SYMBOL_GPL(radix_tree_needs_preload);
 
 static inline void tag_set(struct radix_tree_node *node, unsigned int tag,
 		int offset)
diff -r 18a9e9f5c707 mm/filemap.c
--- a/mm/filemap.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/mm/filemap.c	Tue Oct 24 15:10:48 2006 -0400
@@ -44,6 +44,25 @@ generic_file_direct_IO(int rw, struct ki
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+static wait_queue_head_t *page_waitqueue(struct page *page);
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock);
+
+static struct address_space placeholder_address_space;
+#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
+
+/*
+ * Turns the page struct into a marker for a placeholder page.
+ * This should not be called with a real page, only a struct page stub.
+ */
+void set_page_placeholder(struct page *page)
+{
+	memset(page, 0, sizeof(*page));
+	page->mapping = &placeholder_address_space;
+}
+EXPORT_SYMBOL_GPL(set_page_placeholder);
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -437,12 +456,24 @@ int add_to_page_cache(struct page *page,
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	int error;
+again:
+	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
+		if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) {
+			struct page *tmp;
+			tmp = radix_tree_lookup(&mapping->page_tree, offset);
+			if (tmp && PagePlaceHolder(tmp)) {
+				radix_tree_preload_end();
+				/* drops the lock */
+				wait_on_placeholder_page(mapping, tmp, 1);
+				goto again;
+			}
+		}
+		if (!error && !PagePlaceHolder(page)) {
 			page_cache_get(page);
 			SetPageLocked(page);
 			page->mapping = mapping;
@@ -526,6 +557,79 @@ void fastcall wait_on_page_bit(struct pa
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+/*
+ * Call with either a read lock or a write lock on the mapping tree.
+ * The lock is dropped just before waiting for the place holder
+ */
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	if (write_lock)
+		write_unlock_irq(&mapping->tree_lock);
+	else
+		read_unlock_irq(&mapping->tree_lock);
+	io_schedule();
+	finish_wait(wqh, &wait);
+}
+
+void wake_up_placeholder_page(struct page *page)
+{
+	__wake_up_bit(page_waitqueue(page), &page->flags, PG_locked);
+}
+EXPORT_SYMBOL_GPL(wake_up_placeholder_page);
+
+/**
+ * wait_on_placeholder_pages - gang placeholder page waiter
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @end:	The max page index
+ *
+ * wait_on_placeholder_pages() will search for and wait on a range of pages
+ * in the mapping
+ *
+ * On return, the range has no placeholder pages sitting in it.
+ */
+void wait_on_placeholder_pages_range(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end)
+{
+	unsigned int i;
+	unsigned int ret;
+	struct page *pages[8];
+	pgoff_t cur = start;
+	pgoff_t highest = start;
+
+	/*
+	 * we expect a very small number of place holder pages, so
+	 * this code isn't trying to be very fast.
+	 */
+again:
+	read_lock_irq(&mapping->tree_lock);
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pages, cur, ARRAY_SIZE(pages));
+	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i])) {
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, pages[i], 0);
+			goto again;
+		} else if (pages[i]->index > highest)
+			highest = pages[i]->index;
+	}
+	if (highest < end && ret == ARRAY_SIZE(pages)) {
+		cur = highest;
+		if (need_resched()) {
+			read_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+		}
+		goto again;
+	}
+	read_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range);
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -542,6 +646,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
  */
 void fastcall unlock_page(struct page *page)
 {
+	BUG_ON(PagePlaceHolder(page));
 	smp_mb__before_clear_bit();
 	if (!TestClearPageLocked(page))
 		BUG();
@@ -578,6 +683,7 @@ void fastcall __lock_page(struct page *p
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -590,6 +696,7 @@ void fastcall __lock_page_nosync(struct 
 void fastcall __lock_page_nosync(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -608,12 +715,262 @@ struct page * find_get_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
+	if (page) {
+		if (PagePlaceHolder(page))
+			page = NULL;
+		else
+			page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
+
+/**
+ * remove_placeholder_pages - remove a range of placeholder or locked pages
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @placeholder: the placeholder page previously inserted (for verification)
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ *
+ * Any placeholder pages in the range specified are removed.  Any real
+ * pages are unlocked and released.
+ */
+void remove_placeholder_pages(struct address_space *mapping,
+			     struct page **pages,
+			     struct page *placeholder,
+			     unsigned long start,
+			     unsigned long end,
+			     unsigned long nr)
+{
+	struct page *page;
+	int ret;
+	int i;
+	unsigned long num;
+
+	write_lock_irq(&mapping->tree_lock);
+	while (start < end) {
+		num = min(nr, end-start);
+		ret = radix_tree_gang_lookup(&mapping->page_tree,
+						(void **)pages, start, num);
+		BUG_ON(ret != num);
+		for (i = 0; i < ret; i++) {
+			page = pages[i];
+			if (PagePlaceHolder(page)) {
+				BUG_ON(page != placeholder);
+				radix_tree_delete(&mapping->page_tree,
+						  start + i);
+			} else {
+				BUG_ON(page->index >= end);
+				unlock_page(page);
+				page_cache_release(page);
+			}
+		}
+		start += ret;
+		if (need_resched()) {
+			write_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+			write_lock_irq(&mapping->tree_lock);
+		}
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	if (placeholder)
+		wake_up_placeholder_page(placeholder);
+}
+EXPORT_SYMBOL_GPL(remove_placeholder_pages);
+
+/*
+ * a helper function to insert a placeholder into multiple slots
+ * in the radix tree.  This could probably use an optimized version
+ * in the radix code.  It may insert fewer than the request number
+ * of placeholders if we need to reschedule or the radix tree needs to
+ * be preloaded again.
+ *
+ * returns zero on error or the number actually inserted.
+ */
+static unsigned long insert_placeholders(struct address_space *mapping,
+					 struct page *insert,
+					 unsigned long start, unsigned long nr)
+{
+	int err;
+	unsigned long i;
+	for (i = 0 ; i < nr; i++) {
+		err = radix_tree_insert(&mapping->page_tree, start + i,
+					  insert);
+		if (err)
+			return i;
+		if (radix_tree_needs_preload() || (i && need_resched()))
+			return i + 1;
+	}
+	return i;
+}
+
+
+/**
+ * find_or_insert_placeholders - locate a group of pagecache pages or insert one
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ * @gfp_mask: page allocation mode
+ * @insert: the page to insert if none is found
+ * @iowait: 1 if you want to wait for dirty or writeback pages.
+ *
+ * This locks down a range of offsets in the address space.  Any pages
+ * already present are locked and a placeholder page is inserted into
+ * the radix tree for any offsets without pages.
+ */
+int find_or_insert_placeholders(struct address_space *mapping,
+				  struct page **pages, unsigned long start,
+				  unsigned long end, unsigned long nr,
+				  gfp_t gfp_mask,
+				  struct page *insert,
+				  int iowait)
+{
+	int err = 0;
+	int i, ret;
+	unsigned long cur = start;
+	unsigned long ins;
+	struct page *page;
+	int restart;
+
+	/*
+	 * this gets complicated.  Placeholders and page locks need to
+	 * be taken in order.  We use gang lookup to cut down on the cpu
+	 * cost, but we need to keep track of holes in the results and
+	 * insert placeholders as appropriate.
+	 *
+	 * If a locked page or a placeholder is found, we wait for it and
+	 * pick up where we left off.  If a dirty or PG_writeback page is found
+	 * and iowait==1, we have to drop all of our locks, kick/wait for the
+	 * io and start over.
+	 */
+repeat:
+	if (cur != start )
+		cond_resched();
+	err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (err)
+		goto fail;
+	write_lock_irq(&mapping->tree_lock);
+repeat_lock:
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+					(void **)pages, cur,
+					min(nr, end-cur));
+	for (i = 0 ; i < ret ; i++) {
+		restart = 0;
+		page = pages[i];
+
+		if (PagePlaceHolder(page)) {
+			radix_tree_preload_end();
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 1);
+			goto repeat;
+		}
+
+		if (page->index > cur) {
+			unsigned long top = min(end, page->index);
+			ins = insert_placeholders(mapping, insert, cur,
+						  top - cur);
+			cur += ins;
+			if (radix_tree_needs_preload() || ins != top - cur) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+				if (ins == 0) {
+					err = -1;
+					goto fail;
+				}
+				goto repeat;
+			}
+		}
+		if (page->index >= end) {
+			ret = 0;
+			break;
+		}
+		page_cache_get(page);
+		if (TestSetPageLocked(page)) {
+			unsigned long tmpoff = page->index;
+			page_cache_get(page);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			__lock_page(page);
+			/* Has the page been truncated while we slept? */
+			if (unlikely(page->mapping != mapping ||
+				     page->index != tmpoff)) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto repeat;
+			} else {
+				/* we've locked the page, but  we need to
+				 *  check it for dirty/writeback
+				 */
+				restart = 1;
+			}
+		}
+		if (iowait && (PageDirty(page) || PageWriteback(page))) {
+			/*
+			 * mapge_writepages collects a number of pages
+			 * into a bio, marking them PG_writeback as it goes.
+			 * It also locks pages while deciding what to do
+			 * with them.  What we want to do is have a bunch
+			 * of pages locked, and then wait for any io to
+			 * finish.  But if mpage_writepages is waiting for
+			 * one of our locked pages, it may never start the
+			 * io on the bio it has created.
+			 *
+			 * so, we have to drop all of our locks and
+			 * start over
+			 */
+			unlock_page(page);
+			page_cache_release(page);
+			if (!restart) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+			}
+			remove_placeholder_pages(mapping, pages, insert,
+			                         start, cur, nr);
+			filemap_write_and_wait_range(mapping,
+						 start << PAGE_CACHE_SHIFT,
+						 end << PAGE_CACHE_SHIFT);
+			cur = start;
+			goto repeat;
+		}
+		cur++;
+		if (restart)
+			goto repeat;
+		if (cur >= end)
+			break;
+	}
+
+	/* we haven't yet filled the range */
+	if (cur < end) {
+		/* if the search filled our array, there is more to do. */
+		if (ret && ret == nr)
+			goto repeat_lock;
+
+		/* otherwise insert placeholders for the remaining offsets */
+		ins = insert_placeholders(mapping, insert, cur, end - cur);
+		cur += ins;
+		if (ins != end -cur) {
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			if (ins == 0) {
+				err = -1;
+				goto fail;
+			}
+			goto repeat;
+		}
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	return err;
+fail:
+	remove_placeholder_pages(mapping, pages, insert, start, cur, nr);
+	return err;
+}
+EXPORT_SYMBOL_GPL(find_or_insert_placeholders);
 
 /**
  * find_trylock_page - find and lock a page
@@ -628,7 +985,7 @@ struct page *find_trylock_page(struct ad
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
+	if (page && (PagePlaceHolder(page) || TestSetPageLocked(page)))
 		page = NULL;
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
@@ -654,6 +1011,12 @@ repeat:
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
+		if (PagePlaceHolder(page)) {
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 0);
+			read_lock_irq(&mapping->tree_lock);
+			goto repeat;
+		}
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -737,14 +1100,25 @@ unsigned find_get_pages(struct address_s
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			    unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -771,6 +1145,8 @@ unsigned find_get_pages_contig(struct ad
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, index, nr_pages);
 	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i]))
+			break;
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
@@ -795,14 +1171,25 @@ unsigned find_get_pages_tag(struct addre
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
diff -r 18a9e9f5c707 mm/truncate.c
--- a/mm/truncate.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/mm/truncate.c	Tue Oct 24 15:10:48 2006 -0400
@@ -207,6 +207,7 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_release(&pvec);
 	}
+	wait_on_placeholder_pages_range(mapping, start, end);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/3] Change O_DIRECT to use placeholders
  2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason
  2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason
@ 2006-10-27 18:25 ` Chris Mason
  2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason
  2 siblings, 0 replies; 11+ messages in thread
From: Chris Mason @ 2006-10-27 18:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 6b363967df4b fs/direct-io.c
--- a/fs/direct-io.c	Fri Oct 27 13:54:01 2006 -0400
+++ b/fs/direct-io.c	Fri Oct 27 13:55:18 2006 -0400
@@ -35,6 +35,7 @@
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
+#include <linux/writeback.h>
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -94,6 +95,14 @@ struct dio {
 	struct buffer_head map_bh;	/* last get_block() result */
 
 	/*
+	 * kernel page pinning
+	 */
+	struct page fake;
+	struct page *tmppages[DIO_PAGES];
+	unsigned long fspages_start_off;
+	unsigned long fspages_end_off;
+
+	/*
 	 * Deferred addition of a page to the dio.  These variables are
 	 * private to dio_send_cur_page(), submit_page_section() and
 	 * dio_bio_add_page().
@@ -190,6 +199,28 @@ out:
 	return ret;	
 }
 
+static void unlock_page_range(struct dio *dio, unsigned long start,
+			      unsigned long nr)
+{
+	remove_placeholder_pages(dio->inode->i_mapping, dio->tmppages,
+				 &dio->fake,
+				 start, start + nr,
+				 ARRAY_SIZE(dio->tmppages));
+}
+
+static int lock_page_range(struct dio *dio, unsigned long start,
+			   unsigned long nr)
+{
+	struct address_space *mapping = dio->inode->i_mapping;
+	struct page *fake = &dio->fake;
+	unsigned long end = start + nr;
+	return find_or_insert_placeholders(mapping, dio->tmppages, start, end,
+	                                  ARRAY_SIZE(dio->tmppages),
+					  GFP_KERNEL, fake,
+					  dio->rw == READ);
+}
+
+
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
  * buffered inside the dio so that we can call get_user_pages() against a
@@ -219,9 +250,9 @@ static void dio_complete(struct dio *dio
 {
 	if (dio->end_io && dio->result)
 		dio->end_io(dio->iocb, offset, bytes, dio->map_bh.b_private);
-	if (dio->lock_type == DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
+	unlock_page_range(dio, dio->fspages_start_off,
+			  dio->fspages_end_off - dio->fspages_start_off);
+	dio->fspages_end_off = dio->fspages_start_off;
 }
 
 /*
@@ -517,6 +548,7 @@ static int get_more_blocks(struct dio *d
 	unsigned long fs_count;	/* Number of filesystem-sized blocks */
 	unsigned long dio_count;/* Number of dio_block-sized blocks */
 	unsigned long blkmask;
+	unsigned long index;
 	int create;
 
 	/*
@@ -544,7 +576,21 @@ static int get_more_blocks(struct dio *d
 		} else if (dio->lock_type == DIO_NO_LOCKING) {
 			create = 0;
 		}
-
+	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
+		                        dio->inode->i_blkbits);
+		if (index >= dio->fspages_end_off) {
+			unsigned long end;
+			unsigned long nr;
+			end = (dio->final_block_in_request >>
+			       dio->blkfactor) >>
+			      (PAGE_CACHE_SHIFT - dio->inode->i_blkbits);
+			nr = min(end - index + 1, (unsigned long)DIO_PAGES);
+			ret = lock_page_range(dio, dio->fspages_end_off, nr);
+			if (ret)
+				goto error;
+			dio->fspages_end_off += nr;
+			BUG_ON(index >= dio->fspages_end_off);
+		}
 		/*
 		 * For writes inside i_size we forbid block creations: only
 		 * overwrites are permitted.  We fall back to buffered writes
@@ -554,6 +600,7 @@ static int get_more_blocks(struct dio *d
 		ret = (*dio->get_block)(dio->inode, fs_startblk,
 						map_bh, create);
 	}
+error:
 	return ret;
 }
 
@@ -943,9 +990,6 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1074,14 +1118,6 @@ direct_io_worker(int rw, struct kiocb *i
 	 * In that case, we need to release all the pages we got hold on.
 	 */
 	dio_cleanup(dio);
-
-	/*
-	 * All block lookups have been performed. For READ requests
-	 * we can let i_mutex go now that its achieved its purpose
-	 * of protecting us from looking up uninitialized blocks.
-	 */
-	if ((rw == READ) && (dio->lock_type == DIO_LOCKING))
-		mutex_unlock(&dio->inode->i_mutex);
 
 	/*
 	 * OK, all BIOs are submitted, so we can decrement bio_count to truly
@@ -1165,8 +1201,6 @@ direct_io_worker(int rw, struct kiocb *i
  * DIO_LOCKING (simple locking for regular files)
  * For writes we are called under i_mutex and return with i_mutex held, even
  * though it is internally dropped.
- * For reads, i_mutex is not held on entry, but it is taken and dropped before
- * returning.
  *
  * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
  *	uninitialised data, allowing parallel direct readers and writers)
@@ -1191,8 +1225,6 @@ __blockdev_direct_IO(int rw, struct kioc
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
 	struct dio *dio;
-	int release_i_mutex = 0;
-	int acquire_i_mutex = 0;
 
 	if (rw & WRITE)
 		rw = WRITE_SYNC;
@@ -1221,51 +1253,24 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 		}
 	}
-
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
 		goto out;
 
+	set_page_placeholder(&dio->fake);
+	dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
+	dio->fspages_end_off = dio->fspages_start_off;
+
 	/*
 	 * For block device access DIO_NO_LOCKING is used,
 	 *	neither readers nor writers do any locking at all
 	 * For regular files using DIO_LOCKING,
-	 *	readers need to grab i_mutex and i_alloc_sem
-	 *	writers need to grab i_alloc_sem only (i_mutex is already held)
+	 *	No locks are taken
 	 * For regular files using DIO_OWN_LOCKING,
 	 *	neither readers nor writers take any locks here
 	 */
 	dio->lock_type = dio_lock_type;
-	if (dio_lock_type != DIO_NO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
-			struct address_space *mapping;
-
-			mapping = iocb->ki_filp->f_mapping;
-			if (dio_lock_type != DIO_OWN_LOCKING) {
-				mutex_lock(&inode->i_mutex);
-				release_i_mutex = 1;
-			}
-
-			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
-			if (retval) {
-				kfree(dio);
-				goto out;
-			}
-
-			if (dio_lock_type == DIO_OWN_LOCKING) {
-				mutex_unlock(&inode->i_mutex);
-				acquire_i_mutex = 1;
-			}
-		}
-
-		if (dio_lock_type == DIO_LOCKING)
-			/* lockdep: not the owner will release it */
-			down_read_non_owner(&inode->i_alloc_sem);
-	}
-
 	/*
 	 * For file extending writes updating i_size before data
 	 * writeouts complete can expose uninitialized blocks. So
@@ -1277,15 +1282,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
-
-	if (rw == READ && dio_lock_type == DIO_LOCKING)
-		release_i_mutex = 0;
-
 out:
-	if (release_i_mutex)
-		mutex_unlock(&inode->i_mutex);
-	else if (acquire_i_mutex)
-		mutex_lock(&inode->i_mutex);
 	return retval;
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/3] DIO: don't fall back to buffered writes
  2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason
  2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason
  2006-10-27 18:25 ` [RFC PATCH 2/3] Change O_DIRECT to use placeholders Chris Mason
@ 2006-10-27 18:26 ` Chris Mason
  2006-11-01 10:51   ` Suparna Bhattacharya
  2 siblings, 1 reply; 11+ messages in thread
From: Chris Mason @ 2006-10-27 18:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

DIO: don't fall back to buffered writes

Placeholder pages allow DIO to use locking rules similar to that of
writepage.  DIO can now fill holes, and it can extend the file via
expanding truncates.  This creates holes that are filled during the DIO
write.

i_mutex can be dropped during writes once DIO decides if the file needs to be
extended.

The expanding truncate may cause some pagecache pages to be dirtied.
The call to find_or_insert_placeholders is changed to wait on dirty
or writeback pages in the case of writes as well as reads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 30dcee85429c fs/direct-io.c
--- a/fs/direct-io.c	Fri Oct 27 13:49:19 2006 -0400
+++ b/fs/direct-io.c	Fri Oct 27 13:50:48 2006 -0400
@@ -69,6 +69,7 @@ struct dio {
 	int rw;
 	loff_t i_size;			/* i_size when submitted */
 	int lock_type;			/* doesn't change */
+	int reacquire_i_mutex;		/* should we get i_mutex when done? */
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -216,8 +217,7 @@ static int lock_page_range(struct dio *d
 	unsigned long end = start + nr;
 	return find_or_insert_placeholders(mapping, dio->tmppages, start, end,
 	                                  ARRAY_SIZE(dio->tmppages),
-					  GFP_KERNEL, fake,
-					  dio->rw == READ);
+					  GFP_KERNEL, fake, 1);
 }
 
 
@@ -253,6 +253,8 @@ static void dio_complete(struct dio *dio
 	unlock_page_range(dio, dio->fspages_start_off,
 			  dio->fspages_end_off - dio->fspages_start_off);
 	dio->fspages_end_off = dio->fspages_start_off;
+	if (dio->reacquire_i_mutex)
+		mutex_lock(&dio->inode->i_mutex);
 }
 
 /*
@@ -569,13 +571,8 @@ static int get_more_blocks(struct dio *d
 		map_bh->b_size = fs_count << dio->inode->i_blkbits;
 
 		create = dio->rw & WRITE;
-		if (dio->lock_type == DIO_LOCKING) {
-			if (dio->block_in_file < (i_size_read(dio->inode) >>
-							dio->blkbits))
-				create = 0;
-		} else if (dio->lock_type == DIO_NO_LOCKING) {
+		if (dio->lock_type == DIO_NO_LOCKING)
 			create = 0;
-		}
 	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
 		                        dio->inode->i_blkbits);
 		if (index >= dio->fspages_end_off) {
@@ -1283,6 +1280,33 @@ __blockdev_direct_IO(int rw, struct kioc
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
+	/*
+	 * extend the file if needed, and drop i_mutex for non-aio writes.
+	 * We extend the file by creating a hole that is later filled in
+	 * during the O_DIRECT.  Because pages are locked or placeholders
+	 * are inserted, the locking rules end up being the same as
+	 * mmap'd writes using writepage to fill holes
+	 */
+	dio->reacquire_i_mutex = 0;
+	if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) {
+		/* if our write goes past i_size, do an expanding
+		 * truncate to fill it before dropping i_mutex
+		 */
+		if (end > i_size_read(inode) && iocb->ki_filp) {
+			struct iattr newattrs;
+			newattrs.ia_size = end;
+			newattrs.ia_file = iocb->ki_filp;
+			newattrs.ia_valid = ATTR_SIZE | ATTR_FILE;
+			retval = notify_change(iocb->ki_filp->f_dentry,
+					       &newattrs);
+			if (retval)
+				goto out;
+		}
+		if (is_sync_kiocb(iocb)) {
+			dio->reacquire_i_mutex = 1;
+			mutex_unlock(&inode->i_mutex);
+		}
+	}
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
 out:

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache.
  2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason
@ 2006-11-01 10:23   ` Suparna Bhattacharya
  2006-11-01 12:41     ` Chris Mason
  0 siblings, 1 reply; 11+ messages in thread
From: Suparna Bhattacharya @ 2006-11-01 10:23 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown


Really appreciate your biting the bullet on this one. I have a few
minor comments/questions as I'm trying to compare this to what we
have discussed in the past and the last attempt approach outlined in
http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
back in Feb this year.

I recall that the first time you experimented with dummy struct pages
(placeholder pages as they are now called), you ended up putting it on
hold due to regressions in performance which we attributed to radix
tree insertion overheads. I think you mention in an earlier note that
you've managed to optimize some of that -- I'm just wondering what helped
there - radix_tree_preload or something more than that ?

This is sort of why we had discussed the alternative of tagged locking
in the radix tree.
	  Lookup and lock pages in the range by tagging the radix tree
	  slots as "TAG_LOCKED", and locking pages that were present.
	  Modify find_get_page et al to return a dummy locked cache page
	  if TAG_LOCKED. (Alternatively put the check in add_to_page_cache,
	  instead)
Putting the check in add_to_page_cache is more consistent with what you
do now.

Where I think we were stuck at that point was that even this helps only upto
a point in saving on leaf nodes, but did not avoid intermediate levels of
nodes from being created. With Nick's changes to the radix tree .. I have 
been thinking that some kind of sliding height scheme could become possible.

Anyway, the other reason for bringing this up is because I am wondering if
some of the complexity in find_and_insert_placeholder_pages and a few other
places could be reduced with a tagged locking approach where we do not
actually stuff placeholder pages explicitly in the page cache and hence
possibly can avoid some of the checks you have below. I need to take a closer
look to be completely sure, but just a thought ...

I guess the important thing at this point is to get the interfaces right
in a way that allows for future optimization.

Regards
Suparna

On Fri, Oct 27, 2006 at 02:23:58PM -0400, Chris Mason wrote:
> Introduce a place holder page for the pagecache.
> 
> mm/filemap.c is changed to wait on these before adding a page into the page
> cache, and truncates are changed to wait for all of the place holder pages to
> disappear.
> 
> Place holder pages can only be tested or looked at with the mapping lock
> held, and only PagePlaceHolder(page) can be trusted.  They cannot be locked,
> and cannot have references increased or decreased on them.
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
> 
> diff -r 18a9e9f5c707 include/linux/mm.h
> --- a/include/linux/mm.h	Thu Oct 19 08:30:00 2006 +0700
> +++ b/include/linux/mm.h	Tue Oct 24 15:10:48 2006 -0400
> @@ -276,6 +276,7 @@ static inline void get_page(struct page 
>  	if (unlikely(PageCompound(page)))
>  		page = (struct page *)page_private(page);
>  	VM_BUG_ON(atomic_read(&page->_count) == 0);
> +	VM_BUG_ON(PagePlaceHolder(page));
>  	atomic_inc(&page->_count);
>  }
>  
> diff -r 18a9e9f5c707 include/linux/page-flags.h
> --- a/include/linux/page-flags.h	Thu Oct 19 08:30:00 2006 +0700
> +++ b/include/linux/page-flags.h	Tue Oct 24 15:10:48 2006 -0400
> @@ -267,4 +266,6 @@ static inline void set_page_writeback(st
>  	test_set_page_writeback(page);
>  }
>  
> +void set_page_placeholder(struct page *page);
> +
>  #endif	/* PAGE_FLAGS_H */
> diff -r 18a9e9f5c707 include/linux/pagemap.h
> --- a/include/linux/pagemap.h	Thu Oct 19 08:30:00 2006 +0700
> +++ b/include/linux/pagemap.h	Tue Oct 24 15:10:48 2006 -0400
> @@ -72,6 +72,12 @@ extern struct page * find_get_page(struc
>  				unsigned long index);
>  extern struct page * find_lock_page(struct address_space *mapping,
>  				unsigned long index);
> +int find_or_insert_placeholders(struct address_space *mapping,
> +                                  struct page **pages, unsigned long start,
> +                                  unsigned long end, unsigned long nr,
> +                                  gfp_t gfp_mask,
> +                                  struct page *insert,
> +                                  int wait);
>  extern __deprecated_for_modules struct page * find_trylock_page(
>  			struct address_space *mapping, unsigned long index);
>  extern struct page * find_or_create_page(struct address_space *mapping,
> @@ -82,6 +88,16 @@ unsigned find_get_pages_contig(struct ad
>  			       unsigned int nr_pages, struct page **pages);
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages);
> +void remove_placeholder_pages(struct address_space *mapping,
> +                             struct page **pages,
> +                             struct page *placeholder,
> +                             unsigned long offset,
> +                             unsigned long end,
> +                             unsigned long nr);
> +void wake_up_placeholder_page(struct page *page);
> +void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start,
> +			       pgoff_t end);
> +
>  
>  /*
>   * Returns locked page at given index in given cache, creating it if needed.
> diff -r 18a9e9f5c707 include/linux/radix-tree.h
> --- a/include/linux/radix-tree.h	Thu Oct 19 08:30:00 2006 +0700
> +++ b/include/linux/radix-tree.h	Tue Oct 24 15:10:48 2006 -0400
> @@ -56,6 +56,7 @@ radix_tree_gang_lookup(struct radix_tree
>  radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
>  			unsigned long first_index, unsigned int max_items);
>  int radix_tree_preload(gfp_t gfp_mask);
> +int radix_tree_needs_preload(void);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  			unsigned long index, unsigned int tag);
> diff -r 18a9e9f5c707 lib/radix-tree.c
> --- a/lib/radix-tree.c	Thu Oct 19 08:30:00 2006 +0700
> +++ b/lib/radix-tree.c	Tue Oct 24 15:10:48 2006 -0400
> @@ -139,6 +139,19 @@ out:
>  out:
>  	return ret;
>  }
> +
> +/*
> + * Must be called inside the preload pair.  This tells you if another
> + * run of radix_tree_preload is required, allowing the caller to decide
> + * if a repeated insert will succeed
> + */
> +int radix_tree_needs_preload(void)
> +{
> +	struct radix_tree_preload *rtp;
> +	rtp = &__get_cpu_var(radix_tree_preloads);
> +	return rtp->nr < ARRAY_SIZE(rtp->nodes);
> +}
> +EXPORT_SYMBOL_GPL(radix_tree_needs_preload);
>  
>  static inline void tag_set(struct radix_tree_node *node, unsigned int tag,
>  		int offset)
> diff -r 18a9e9f5c707 mm/filemap.c
> --- a/mm/filemap.c	Thu Oct 19 08:30:00 2006 +0700
> +++ b/mm/filemap.c	Tue Oct 24 15:10:48 2006 -0400
> @@ -44,6 +44,25 @@ generic_file_direct_IO(int rw, struct ki
>  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>  	loff_t offset, unsigned long nr_segs);
>  
> +static wait_queue_head_t *page_waitqueue(struct page *page);
> +static void wait_on_placeholder_page(struct address_space *mapping,
> +				     struct page *page,
> +				     int write_lock);
> +
> +static struct address_space placeholder_address_space;
> +#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
> +
> +/*
> + * Turns the page struct into a marker for a placeholder page.
> + * This should not be called with a real page, only a struct page stub.
> + */
> +void set_page_placeholder(struct page *page)
> +{
> +	memset(page, 0, sizeof(*page));
> +	page->mapping = &placeholder_address_space;
> +}
> +EXPORT_SYMBOL_GPL(set_page_placeholder);
> +
>  /*
>   * Shared mappings implemented 30.11.1994. It's not fully working yet,
>   * though.
> @@ -437,12 +456,24 @@ int add_to_page_cache(struct page *page,
>  int add_to_page_cache(struct page *page, struct address_space *mapping,
>  		pgoff_t offset, gfp_t gfp_mask)
>  {
> -	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> +	int error;
> +again:
> +	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>  
>  	if (error == 0) {
>  		write_lock_irq(&mapping->tree_lock);
>  		error = radix_tree_insert(&mapping->page_tree, offset, page);
> -		if (!error) {
> +		if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) {
> +			struct page *tmp;
> +			tmp = radix_tree_lookup(&mapping->page_tree, offset);
> +			if (tmp && PagePlaceHolder(tmp)) {
> +				radix_tree_preload_end();
> +				/* drops the lock */
> +				wait_on_placeholder_page(mapping, tmp, 1);
> +				goto again;
> +			}
> +		}
> +		if (!error && !PagePlaceHolder(page)) {
>  			page_cache_get(page);
>  			SetPageLocked(page);
>  			page->mapping = mapping;
> @@ -526,6 +557,79 @@ void fastcall wait_on_page_bit(struct pa
>  }
>  EXPORT_SYMBOL(wait_on_page_bit);
>  
> +/*
> + * Call with either a read lock or a write lock on the mapping tree.
> + * The lock is dropped just before waiting for the place holder
> + */
> +static void wait_on_placeholder_page(struct address_space *mapping,
> +				     struct page *page,
> +				     int write_lock)
> +{
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = page_waitqueue(page);
> +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +	if (write_lock)
> +		write_unlock_irq(&mapping->tree_lock);
> +	else
> +		read_unlock_irq(&mapping->tree_lock);
> +	io_schedule();
> +	finish_wait(wqh, &wait);
> +}
> +
> +void wake_up_placeholder_page(struct page *page)
> +{
> +	__wake_up_bit(page_waitqueue(page), &page->flags, PG_locked);
> +}
> +EXPORT_SYMBOL_GPL(wake_up_placeholder_page);
> +
> +/**
> + * wait_on_placeholder_pages - gang placeholder page waiter
> + * @mapping:	The address_space to search
> + * @start:	The starting page index
> + * @end:	The max page index
> + *
> + * wait_on_placeholder_pages() will search for and wait on a range of pages
> + * in the mapping
> + *
> + * On return, the range has no placeholder pages sitting in it.
> + */
> +void wait_on_placeholder_pages_range(struct address_space *mapping,
> +			       pgoff_t start, pgoff_t end)
> +{
> +	unsigned int i;
> +	unsigned int ret;
> +	struct page *pages[8];
> +	pgoff_t cur = start;
> +	pgoff_t highest = start;
> +
> +	/*
> +	 * we expect a very small number of place holder pages, so
> +	 * this code isn't trying to be very fast.
> +	 */
> +again:
> +	read_lock_irq(&mapping->tree_lock);
> +	ret = radix_tree_gang_lookup(&mapping->page_tree,
> +				(void **)pages, cur, ARRAY_SIZE(pages));
> +	for (i = 0; i < ret; i++) {
> +		if (PagePlaceHolder(pages[i])) {
> +			/* drops the lock */
> +			wait_on_placeholder_page(mapping, pages[i], 0);
> +			goto again;
> +		} else if (pages[i]->index > highest)
> +			highest = pages[i]->index;
> +	}
> +	if (highest < end && ret == ARRAY_SIZE(pages)) {
> +		cur = highest;
> +		if (need_resched()) {
> +			read_unlock_irq(&mapping->tree_lock);
> +			cond_resched();
> +		}
> +		goto again;
> +	}
> +	read_unlock_irq(&mapping->tree_lock);
> +}
> +EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range);
> +
>  /**
>   * unlock_page - unlock a locked page
>   * @page: the page
> @@ -542,6 +646,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
>   */
>  void fastcall unlock_page(struct page *page)
>  {
> +	BUG_ON(PagePlaceHolder(page));
>  	smp_mb__before_clear_bit();
>  	if (!TestClearPageLocked(page))
>  		BUG();
> @@ -578,6 +683,7 @@ void fastcall __lock_page(struct page *p
>  {
>  	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>  
> +	BUG_ON(PagePlaceHolder(page));
>  	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
>  							TASK_UNINTERRUPTIBLE);
>  }
> @@ -590,6 +696,7 @@ void fastcall __lock_page_nosync(struct 
>  void fastcall __lock_page_nosync(struct page *page)
>  {
>  	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> +	BUG_ON(PagePlaceHolder(page));
>  	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
>  							TASK_UNINTERRUPTIBLE);
>  }
> @@ -608,12 +715,262 @@ struct page * find_get_page(struct addre
>  
>  	read_lock_irq(&mapping->tree_lock);
>  	page = radix_tree_lookup(&mapping->page_tree, offset);
> -	if (page)
> -		page_cache_get(page);
> +	if (page) {
> +		if (PagePlaceHolder(page))
> +			page = NULL;
> +		else
> +			page_cache_get(page);
> +	}
>  	read_unlock_irq(&mapping->tree_lock);
>  	return page;
>  }
>  EXPORT_SYMBOL(find_get_page);
> +
> +/**
> + * remove_placeholder_pages - remove a range of placeholder or locked pages
> + * @mapping: the page's address_space
> + * @pages: an array of page pointers to use for gang looukps
> + * @placeholder: the placeholder page previously inserted (for verification)
> + * @start: the search starting point
> + * @end: the search end point (offsets >= end are not touched)
> + * @nr: the size of the pages array.
> + *
> + * Any placeholder pages in the range specified are removed.  Any real
> + * pages are unlocked and released.
> + */
> +void remove_placeholder_pages(struct address_space *mapping,
> +			     struct page **pages,
> +			     struct page *placeholder,
> +			     unsigned long start,
> +			     unsigned long end,
> +			     unsigned long nr)
> +{
> +	struct page *page;
> +	int ret;
> +	int i;
> +	unsigned long num;
> +
> +	write_lock_irq(&mapping->tree_lock);
> +	while (start < end) {
> +		num = min(nr, end-start);
> +		ret = radix_tree_gang_lookup(&mapping->page_tree,
> +						(void **)pages, start, num);
> +		BUG_ON(ret != num);
> +		for (i = 0; i < ret; i++) {
> +			page = pages[i];
> +			if (PagePlaceHolder(page)) {
> +				BUG_ON(page != placeholder);
> +				radix_tree_delete(&mapping->page_tree,
> +						  start + i);
> +			} else {
> +				BUG_ON(page->index >= end);
> +				unlock_page(page);
> +				page_cache_release(page);
> +			}
> +		}
> +		start += ret;
> +		if (need_resched()) {
> +			write_unlock_irq(&mapping->tree_lock);
> +			cond_resched();
> +			write_lock_irq(&mapping->tree_lock);
> +		}
> +	}
> +	write_unlock_irq(&mapping->tree_lock);
> +	if (placeholder)
> +		wake_up_placeholder_page(placeholder);
> +}
> +EXPORT_SYMBOL_GPL(remove_placeholder_pages);
> +
> +/*
> + * a helper function to insert a placeholder into multiple slots
> + * in the radix tree.  This could probably use an optimized version
> + * in the radix code.  It may insert fewer than the request number
> + * of placeholders if we need to reschedule or the radix tree needs to
> + * be preloaded again.
> + *
> + * returns zero on error or the number actually inserted.
> + */
> +static unsigned long insert_placeholders(struct address_space *mapping,
> +					 struct page *insert,
> +					 unsigned long start, unsigned long nr)
> +{
> +	int err;
> +	unsigned long i;
> +	for (i = 0 ; i < nr; i++) {
> +		err = radix_tree_insert(&mapping->page_tree, start + i,
> +					  insert);
> +		if (err)
> +			return i;
> +		if (radix_tree_needs_preload() || (i && need_resched()))
> +			return i + 1;
> +	}
> +	return i;
> +}
> +
> +
> +/**
> + * find_or_insert_placeholders - locate a group of pagecache pages or insert one
> + * @mapping: the page's address_space
> + * @pages: an array of page pointers to use for gang looukps
> + * @start: the search starting point
> + * @end: the search end point (offsets >= end are not touched)
> + * @nr: the size of the pages array.
> + * @gfp_mask: page allocation mode
> + * @insert: the page to insert if none is found
> + * @iowait: 1 if you want to wait for dirty or writeback pages.
> + *
> + * This locks down a range of offsets in the address space.  Any pages
> + * already present are locked and a placeholder page is inserted into
> + * the radix tree for any offsets without pages.
> + */
> +int find_or_insert_placeholders(struct address_space *mapping,
> +				  struct page **pages, unsigned long start,
> +				  unsigned long end, unsigned long nr,
> +				  gfp_t gfp_mask,
> +				  struct page *insert,
> +				  int iowait)
> +{
> +	int err = 0;
> +	int i, ret;
> +	unsigned long cur = start;
> +	unsigned long ins;
> +	struct page *page;
> +	int restart;
> +
> +	/*
> +	 * this gets complicated.  Placeholders and page locks need to
> +	 * be taken in order.  We use gang lookup to cut down on the cpu
> +	 * cost, but we need to keep track of holes in the results and
> +	 * insert placeholders as appropriate.
> +	 *
> +	 * If a locked page or a placeholder is found, we wait for it and
> +	 * pick up where we left off.  If a dirty or PG_writeback page is found
> +	 * and iowait==1, we have to drop all of our locks, kick/wait for the
> +	 * io and start over.
> +	 */
> +repeat:
> +	if (cur != start )
> +		cond_resched();
> +	err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> +	if (err)
> +		goto fail;
> +	write_lock_irq(&mapping->tree_lock);
> +repeat_lock:
> +	ret = radix_tree_gang_lookup(&mapping->page_tree,
> +					(void **)pages, cur,
> +					min(nr, end-cur));
> +	for (i = 0 ; i < ret ; i++) {
> +		restart = 0;
> +		page = pages[i];
> +
> +		if (PagePlaceHolder(page)) {
> +			radix_tree_preload_end();
> +			/* drops the lock */
> +			wait_on_placeholder_page(mapping, page, 1);
> +			goto repeat;
> +		}
> +
> +		if (page->index > cur) {
> +			unsigned long top = min(end, page->index);
> +			ins = insert_placeholders(mapping, insert, cur,
> +						  top - cur);
> +			cur += ins;
> +			if (radix_tree_needs_preload() || ins != top - cur) {
> +				write_unlock_irq(&mapping->tree_lock);
> +				radix_tree_preload_end();
> +				if (ins == 0) {
> +					err = -1;
> +					goto fail;
> +				}
> +				goto repeat;
> +			}
> +		}
> +		if (page->index >= end) {
> +			ret = 0;
> +			break;
> +		}
> +		page_cache_get(page);
> +		if (TestSetPageLocked(page)) {
> +			unsigned long tmpoff = page->index;
> +			page_cache_get(page);
> +			write_unlock_irq(&mapping->tree_lock);
> +			radix_tree_preload_end();
> +			__lock_page(page);
> +			/* Has the page been truncated while we slept? */
> +			if (unlikely(page->mapping != mapping ||
> +				     page->index != tmpoff)) {
> +				unlock_page(page);
> +				page_cache_release(page);
> +				goto repeat;
> +			} else {
> +				/* we've locked the page, but  we need to
> +				 *  check it for dirty/writeback
> +				 */
> +				restart = 1;
> +			}
> +		}
> +		if (iowait && (PageDirty(page) || PageWriteback(page))) {
> +			/*
> +			 * mapge_writepages collects a number of pages
> +			 * into a bio, marking them PG_writeback as it goes.
> +			 * It also locks pages while deciding what to do
> +			 * with them.  What we want to do is have a bunch
> +			 * of pages locked, and then wait for any io to
> +			 * finish.  But if mpage_writepages is waiting for
> +			 * one of our locked pages, it may never start the
> +			 * io on the bio it has created.
> +			 *
> +			 * so, we have to drop all of our locks and
> +			 * start over
> +			 */
> +			unlock_page(page);
> +			page_cache_release(page);
> +			if (!restart) {
> +				write_unlock_irq(&mapping->tree_lock);
> +				radix_tree_preload_end();
> +			}
> +			remove_placeholder_pages(mapping, pages, insert,
> +			                         start, cur, nr);
> +			filemap_write_and_wait_range(mapping,
> +						 start << PAGE_CACHE_SHIFT,
> +						 end << PAGE_CACHE_SHIFT);
> +			cur = start;
> +			goto repeat;
> +		}
> +		cur++;
> +		if (restart)
> +			goto repeat;
> +		if (cur >= end)
> +			break;
> +	}
> +
> +	/* we haven't yet filled the range */
> +	if (cur < end) {
> +		/* if the search filled our array, there is more to do. */
> +		if (ret && ret == nr)
> +			goto repeat_lock;
> +
> +		/* otherwise insert placeholders for the remaining offsets */
> +		ins = insert_placeholders(mapping, insert, cur, end - cur);
> +		cur += ins;
> +		if (ins != end -cur) {
> +			write_unlock_irq(&mapping->tree_lock);
> +			radix_tree_preload_end();
> +			if (ins == 0) {
> +				err = -1;
> +				goto fail;
> +			}
> +			goto repeat;
> +		}
> +	}
> +	write_unlock_irq(&mapping->tree_lock);
> +	radix_tree_preload_end();
> +	return err;
> +fail:
> +	remove_placeholder_pages(mapping, pages, insert, start, cur, nr);
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(find_or_insert_placeholders);
>  
>  /**
>   * find_trylock_page - find and lock a page
> @@ -628,7 +985,7 @@ struct page *find_trylock_page(struct ad
>  
>  	read_lock_irq(&mapping->tree_lock);
>  	page = radix_tree_lookup(&mapping->page_tree, offset);
> -	if (page && TestSetPageLocked(page))
> +	if (page && (PagePlaceHolder(page) || TestSetPageLocked(page)))
>  		page = NULL;
>  	read_unlock_irq(&mapping->tree_lock);
>  	return page;
> @@ -654,6 +1011,12 @@ repeat:
>  repeat:
>  	page = radix_tree_lookup(&mapping->page_tree, offset);
>  	if (page) {
> +		if (PagePlaceHolder(page)) {
> +			/* drops the lock */
> +			wait_on_placeholder_page(mapping, page, 0);
> +			read_lock_irq(&mapping->tree_lock);
> +			goto repeat;
> +		}
>  		page_cache_get(page);
>  		if (TestSetPageLocked(page)) {
>  			read_unlock_irq(&mapping->tree_lock);
> @@ -737,14 +1100,25 @@ unsigned find_get_pages(struct address_s
>  unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
>  			    unsigned int nr_pages, struct page **pages)
>  {
> -	unsigned int i;
> +	unsigned int i = 0;
>  	unsigned int ret;
>  
>  	read_lock_irq(&mapping->tree_lock);
>  	ret = radix_tree_gang_lookup(&mapping->page_tree,
>  				(void **)pages, start, nr_pages);
> -	for (i = 0; i < ret; i++)
> -		page_cache_get(pages[i]);
> +	while(i < ret) {
> +		if (PagePlaceHolder(pages[i])) {
> +			/* we can't return a place holder, shift it away */
> +			if (i + 1 < ret) {
> +				memcpy(&pages[i], &pages[i+1],
> +		                       (ret - i - 1) * sizeof(struct page *));
> +			}
> +			ret--;
> +			continue;
> +		} else
> +			page_cache_get(pages[i]);
> +		i++;
> +	}
>  	read_unlock_irq(&mapping->tree_lock);
>  	return ret;
>  }
> @@ -771,6 +1145,8 @@ unsigned find_get_pages_contig(struct ad
>  	ret = radix_tree_gang_lookup(&mapping->page_tree,
>  				(void **)pages, index, nr_pages);
>  	for (i = 0; i < ret; i++) {
> +		if (PagePlaceHolder(pages[i]))
> +			break;
>  		if (pages[i]->mapping == NULL || pages[i]->index != index)
>  			break;
>  
> @@ -795,14 +1171,25 @@ unsigned find_get_pages_tag(struct addre
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages)
>  {
> -	unsigned int i;
> +	unsigned int i = 0;
>  	unsigned int ret;
>  
>  	read_lock_irq(&mapping->tree_lock);
>  	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
>  				(void **)pages, *index, nr_pages, tag);
> -	for (i = 0; i < ret; i++)
> -		page_cache_get(pages[i]);
> +	while(i < ret) {
> +		if (PagePlaceHolder(pages[i])) {
> +			/* we can't return a place holder, shift it away */
> +			if (i + 1 < ret) {
> +				memcpy(&pages[i], &pages[i+1],
> +		                       (ret - i - 1) * sizeof(struct page *));
> +			}
> +			ret--;
> +			continue;
> +		} else
> +			page_cache_get(pages[i]);
> +		i++;
> +	}
>  	if (ret)
>  		*index = pages[ret - 1]->index + 1;
>  	read_unlock_irq(&mapping->tree_lock);
> diff -r 18a9e9f5c707 mm/truncate.c
> --- a/mm/truncate.c	Thu Oct 19 08:30:00 2006 +0700
> +++ b/mm/truncate.c	Tue Oct 24 15:10:48 2006 -0400
> @@ -207,6 +207,7 @@ void truncate_inode_pages_range(struct a
>  		}
>  		pagevec_release(&pvec);
>  	}
> +	wait_on_placeholder_pages_range(mapping, start, end);
>  }
>  EXPORT_SYMBOL(truncate_inode_pages_range);
>  
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] DIO: don't fall back to buffered writes
  2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason
@ 2006-11-01 10:51   ` Suparna Bhattacharya
  2006-11-01 12:47     ` Chris Mason
  0 siblings, 1 reply; 11+ messages in thread
From: Suparna Bhattacharya @ 2006-11-01 10:51 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown

On Fri, Oct 27, 2006 at 02:26:12PM -0400, Chris Mason wrote:
> DIO: don't fall back to buffered writes
> 
> Placeholder pages allow DIO to use locking rules similar to that of
> writepage.  DIO can now fill holes, and it can extend the file via
> expanding truncates.  This creates holes that are filled during the DIO
> write.
> 
> i_mutex can be dropped during writes once DIO decides if the file needs to be
> extended.
> 
> The expanding truncate may cause some pagecache pages to be dirtied.
> The call to find_or_insert_placeholders is changed to wait on dirty
> or writeback pages in the case of writes as well as reads.
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
> 
> diff -r 30dcee85429c fs/direct-io.c
> --- a/fs/direct-io.c	Fri Oct 27 13:49:19 2006 -0400
> +++ b/fs/direct-io.c	Fri Oct 27 13:50:48 2006 -0400
> @@ -69,6 +69,7 @@ struct dio {
>  	int rw;
>  	loff_t i_size;			/* i_size when submitted */
>  	int lock_type;			/* doesn't change */
> +	int reacquire_i_mutex;		/* should we get i_mutex when done? */
>  	unsigned blkbits;		/* doesn't change */
>  	unsigned blkfactor;		/* When we're using an alignment which
>  					   is finer than the filesystem's soft
> @@ -216,8 +217,7 @@ static int lock_page_range(struct dio *d
>  	unsigned long end = start + nr;
>  	return find_or_insert_placeholders(mapping, dio->tmppages, start, end,
>  	                                  ARRAY_SIZE(dio->tmppages),
> -					  GFP_KERNEL, fake,
> -					  dio->rw == READ);
> +					  GFP_KERNEL, fake, 1);
>  }
>  
>  
> @@ -253,6 +253,8 @@ static void dio_complete(struct dio *dio
>  	unlock_page_range(dio, dio->fspages_start_off,
>  			  dio->fspages_end_off - dio->fspages_start_off);
>  	dio->fspages_end_off = dio->fspages_start_off;
> +	if (dio->reacquire_i_mutex)
> +		mutex_lock(&dio->inode->i_mutex);
>  }
>  
>  /*
> @@ -569,13 +571,8 @@ static int get_more_blocks(struct dio *d
>  		map_bh->b_size = fs_count << dio->inode->i_blkbits;
>  
>  		create = dio->rw & WRITE;
> -		if (dio->lock_type == DIO_LOCKING) {
> -			if (dio->block_in_file < (i_size_read(dio->inode) >>
> -							dio->blkbits))
> -				create = 0;
> -		} else if (dio->lock_type == DIO_NO_LOCKING) {
> +		if (dio->lock_type == DIO_NO_LOCKING)
>  			create = 0;
> -		}
>  	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
>  		                        dio->inode->i_blkbits);
>  		if (index >= dio->fspages_end_off) {
> @@ -1283,6 +1280,33 @@ __blockdev_direct_IO(int rw, struct kioc
>  	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
>  		(end > i_size_read(inode)));
>  
> +	/*
> +	 * extend the file if needed, and drop i_mutex for non-aio writes.
> +	 * We extend the file by creating a hole that is later filled in
> +	 * during the O_DIRECT.  Because pages are locked or placeholders
> +	 * are inserted, the locking rules end up being the same as
> +	 * mmap'd writes using writepage to fill holes

I like the idea of extending the size in the beginning to avoid having
to hold i_sem for the extended writes case. The only question is whether
there is a semantics issue here - e.g. if there isn't enough space, would the
app which did an append still see the size change even though the blocks
themselves are not populated ? Of course we already have this happen in the
-EIO case for AIO-DIO writes, no am not sure if this is an issue.

> +	 */
> +	dio->reacquire_i_mutex = 0;
> +	if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) {

Hmm, we really do want to be able to avoid all these various locking
mode checks inside the DIO code to simplify things. I thought you
had mostly managed to do away with these in your previous patch. 
Unfortunately this change while it should help concurrency, brings
that check back in. One of the plans I had earlier was to pull this
up to a higher level - i.e. in the locking version of blockdev_direct_IO()
thus doing away with the flag and related confusion. Do you think
that would make sense ?


> +		/* if our write goes past i_size, do an expanding
> +		 * truncate to fill it before dropping i_mutex
> +		 */
> +		if (end > i_size_read(inode) && iocb->ki_filp) {
> +			struct iattr newattrs;
> +			newattrs.ia_size = end;
> +			newattrs.ia_file = iocb->ki_filp;
> +			newattrs.ia_valid = ATTR_SIZE | ATTR_FILE;
> +			retval = notify_change(iocb->ki_filp->f_dentry,
> +					       &newattrs);
> +			if (retval)
> +				goto out;
> +		}
> +		if (is_sync_kiocb(iocb)) {
> +			dio->reacquire_i_mutex = 1;
> +			mutex_unlock(&inode->i_mutex);

BTW, don't the page locks cover the AIO case as well ? I'm wondering
why we need this special case here.

Regards
Suparna

> +		}
> +	}
>  	retval = direct_io_worker(rw, iocb, inode, iov, offset,
>  				nr_segs, blkbits, get_block, end_io, dio);
>  out:
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache.
  2006-11-01 10:23   ` Suparna Bhattacharya
@ 2006-11-01 12:41     ` Chris Mason
  2006-11-03  7:12       ` Suparna Bhattacharya
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Mason @ 2006-11-01 12:41 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm, zach.brown

On Wed, Nov 01, 2006 at 03:53:50PM +0530, Suparna Bhattacharya wrote:
> 
> Really appreciate your biting the bullet on this one. I have a few
> minor comments/questions as I'm trying to compare this to what we
> have discussed in the past and the last attempt approach outlined in
> http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
> back in Feb this year.

The main difference from DIO-simplify.txt is that I'm making truncate
wait on the placeholder pages instead of using i_alloc_sem.  I haven't
coded the TAG_LOCKED part, but more on that below.  As you
noticed, I'm using expanding truncates to update i_size for extending
the file, which has pluses and minuses.

> 
> I recall that the first time you experimented with dummy struct pages
> (placeholder pages as they are now called), you ended up putting it on
> hold due to regressions in performance which we attributed to radix
> tree insertion overheads. I think you mention in an earlier note that
> you've managed to optimize some of that -- I'm just wondering what helped
> there - radix_tree_preload or something more than that ?

My original code was fairly dumb and didn't use gang lookups.  It also
dropped the radix tree lock after every insert/delete.  The code posted
here is about 2x as fast (or 1/2 the regression, depending on how you
look at it).   For smaller read ios, the performance hit is about zero.
The cost of radix tree insertion is offset by less mutex and semaphore
locking.

> 
> This is sort of why we had discussed the alternative of tagged locking
> in the radix tree.
> 	  Lookup and lock pages in the range by tagging the radix tree
> 	  slots as "TAG_LOCKED", and locking pages that were present.
> 	  Modify find_get_page et al to return a dummy locked cache page
> 	  if TAG_LOCKED. (Alternatively put the check in add_to_page_cache,
> 	  instead)
> Putting the check in add_to_page_cache is more consistent with what you
> do now.
> 
> Where I think we were stuck at that point was that even this helps only upto
> a point in saving on leaf nodes, but did not avoid intermediate levels of
> nodes from being created. With Nick's changes to the radix tree .. I have 
> been thinking that some kind of sliding height scheme could become possible.
> 
> Anyway, the other reason for bringing this up is because I am wondering if
> some of the complexity in find_and_insert_placeholder_pages and a few other
> places could be reduced with a tagged locking approach where we do not
> actually stuff placeholder pages explicitly in the page cache and hence
> possibly can avoid some of the checks you have below. I need to take a closer
> look to be completely sure, but just a thought ...
> 
> I guess the important thing at this point is to get the interfaces right
> in a way that allows for future optimization.

Yes, my goal was to get the code fast enough for most machines, so that
we could test the result and see what impact it would have on the rest
of the page cache.  From here we can decide if additional radix tuning
is a good idea, or if we just want to stick with the mutex/semaphore
combo in use now.

I suspect that we won't be able to tune the radix tree enough to make it
suitable for the huge SGI 512 cpu test cases.  So we might want to stop
tuning now (letting XFS keep their current code) or switch away from
radix to something that can lock down an extent.

Radix tree gang lookups, insertions and deletions all have room for
optimization though (even without the LOCKED tag).  A new tag in the
radix tree isn't free, Andrew might prefer not to increase the space
usage across the board just for O_DIRECT.

-chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] DIO: don't fall back to buffered writes
  2006-11-01 10:51   ` Suparna Bhattacharya
@ 2006-11-01 12:47     ` Chris Mason
  2006-11-03  7:25       ` Suparna Bhattacharya
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Mason @ 2006-11-01 12:47 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm, zach.brown

On Wed, Nov 01, 2006 at 04:21:37PM +0530, Suparna Bhattacharya wrote:
> > +	/*
> > +	 * extend the file if needed, and drop i_mutex for non-aio writes.
> > +	 * We extend the file by creating a hole that is later filled in
> > +	 * during the O_DIRECT.  Because pages are locked or placeholders
> > +	 * are inserted, the locking rules end up being the same as
> > +	 * mmap'd writes using writepage to fill holes
> 
> I like the idea of extending the size in the beginning to avoid having
> to hold i_sem for the extended writes case. The only question is whether
> there is a semantics issue here - e.g. if there isn't enough space, would the
> app which did an append still see the size change even though the blocks
> themselves are not populated ? Of course we already have this happen in the
> -EIO case for AIO-DIO writes, no am not sure if this is an issue.

You're right, there is a semantic issue, I'm being a little loose with
the traditional write() expectations.  The current code creates a hole
and does not remove it in the face of errors.

> 
> > +	 */
> > +	dio->reacquire_i_mutex = 0;
> > +	if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) {
> 
> Hmm, we really do want to be able to avoid all these various locking
> mode checks inside the DIO code to simplify things. I thought you
> had mostly managed to do away with these in your previous patch. 
> Unfortunately this change while it should help concurrency, brings
> that check back in. One of the plans I had earlier was to pull this
> up to a higher level - i.e. in the locking version of blockdev_direct_IO()
> thus doing away with the flag and related confusion. Do you think
> that would make sense ?

I went the other direction, making the flags a little more formal (patch
will go out today).  I don't have strong feelings about this, but it
seems easiest to me to keep the flags inside blockdev_direct_IO.

> 
> 
> > +		/* if our write goes past i_size, do an expanding
> > +		 * truncate to fill it before dropping i_mutex
> > +		 */
> > +		if (end > i_size_read(inode) && iocb->ki_filp) {
> > +			struct iattr newattrs;
> > +			newattrs.ia_size = end;
> > +			newattrs.ia_file = iocb->ki_filp;
> > +			newattrs.ia_valid = ATTR_SIZE | ATTR_FILE;
> > +			retval = notify_change(iocb->ki_filp->f_dentry,
> > +					       &newattrs);
> > +			if (retval)
> > +				goto out;
> > +		}
> > +		if (is_sync_kiocb(iocb)) {
> > +			dio->reacquire_i_mutex = 1;
> > +			mutex_unlock(&inode->i_mutex);
> 
> BTW, don't the page locks cover the AIO case as well ? I'm wondering
> why we need this special case here.

Only because i_mutex is already dropped implicitly by the aio code when
we return.  Ideally we could drop it sooner inside fs/direct-io.c, but
I'll leave that tuning for a later patch.

-chris


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache.
  2006-11-01 12:41     ` Chris Mason
@ 2006-11-03  7:12       ` Suparna Bhattacharya
  2006-11-03 14:10         ` Chris Mason
  0 siblings, 1 reply; 11+ messages in thread
From: Suparna Bhattacharya @ 2006-11-03  7:12 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown

On Wed, Nov 01, 2006 at 07:41:53AM -0500, Chris Mason wrote:
> On Wed, Nov 01, 2006 at 03:53:50PM +0530, Suparna Bhattacharya wrote:
> > 
> > Really appreciate your biting the bullet on this one. I have a few
> > minor comments/questions as I'm trying to compare this to what we
> > have discussed in the past and the last attempt approach outlined in
> > http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
> > back in Feb this year.
> 
> The main difference from DIO-simplify.txt is that I'm making truncate
> wait on the placeholder pages instead of using i_alloc_sem.  I haven't

Ah OK, that is true.

> coded the TAG_LOCKED part, but more on that below.  As you
> noticed, I'm using expanding truncates to update i_size for extending
> the file, which has pluses and minuses.
> 
> > 
> > I recall that the first time you experimented with dummy struct pages
> > (placeholder pages as they are now called), you ended up putting it on
> > hold due to regressions in performance which we attributed to radix
> > tree insertion overheads. I think you mention in an earlier note that
> > you've managed to optimize some of that -- I'm just wondering what helped
> > there - radix_tree_preload or something more than that ?
> 
> My original code was fairly dumb and didn't use gang lookups.  It also
> dropped the radix tree lock after every insert/delete.  The code posted
> here is about 2x as fast (or 1/2 the regression, depending on how you
> look at it).   For smaller read ios, the performance hit is about zero.
> The cost of radix tree insertion is offset by less mutex and semaphore
> locking.

OK, so you think the main overheads had to do with radix-tree locking,
rather than with the additional node insertions ?

> 
> > 
> > This is sort of why we had discussed the alternative of tagged locking
> > in the radix tree.
> > 	  Lookup and lock pages in the range by tagging the radix tree
> > 	  slots as "TAG_LOCKED", and locking pages that were present.
> > 	  Modify find_get_page et al to return a dummy locked cache page
> > 	  if TAG_LOCKED. (Alternatively put the check in add_to_page_cache,
> > 	  instead)
> > Putting the check in add_to_page_cache is more consistent with what you
> > do now.
> > 
> > Where I think we were stuck at that point was that even this helps only upto
> > a point in saving on leaf nodes, but did not avoid intermediate levels of
> > nodes from being created. With Nick's changes to the radix tree .. I have 
> > been thinking that some kind of sliding height scheme could become possible.
> > 
> > Anyway, the other reason for bringing this up is because I am wondering if
> > some of the complexity in find_and_insert_placeholder_pages and a few other
> > places could be reduced with a tagged locking approach where we do not
> > actually stuff placeholder pages explicitly in the page cache and hence
> > possibly can avoid some of the checks you have below. I need to take a closer
> > look to be completely sure, but just a thought ...
> > 
> > I guess the important thing at this point is to get the interfaces right
> > in a way that allows for future optimization.
> 
> Yes, my goal was to get the code fast enough for most machines, so that
> we could test the result and see what impact it would have on the rest
> of the page cache.  From here we can decide if additional radix tuning
> is a good idea, or if we just want to stick with the mutex/semaphore
> combo in use now.

Yup, that makes sense.

The complexity in find_and_insert_placeholder_pages does worry me though,
wouldn't want to end up debugging tricky problems in that code after all
this. Do you have any ideas on that ?

> 
> I suspect that we won't be able to tune the radix tree enough to make it
> suitable for the huge SGI 512 cpu test cases.  So we might want to stop
> tuning now (letting XFS keep their current code) or switch away from
> radix to something that can lock down an extent.
> 
> Radix tree gang lookups, insertions and deletions all have room for
> optimization though (even without the LOCKED tag).  A new tag in the
> radix tree isn't free, Andrew might prefer not to increase the space
> usage across the board just for O_DIRECT.

Yes that was a consideration on my mind as well. There are costs both
ways.

Regards
Suparna
> 
> -chris

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] DIO: don't fall back to buffered writes
  2006-11-01 12:47     ` Chris Mason
@ 2006-11-03  7:25       ` Suparna Bhattacharya
  0 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2006-11-03  7:25 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown

On Wed, Nov 01, 2006 at 07:47:19AM -0500, Chris Mason wrote:
> On Wed, Nov 01, 2006 at 04:21:37PM +0530, Suparna Bhattacharya wrote:
> > > +	/*
> > > +	 * extend the file if needed, and drop i_mutex for non-aio writes.
> > > +	 * We extend the file by creating a hole that is later filled in
> > > +	 * during the O_DIRECT.  Because pages are locked or placeholders
> > > +	 * are inserted, the locking rules end up being the same as
> > > +	 * mmap'd writes using writepage to fill holes
> > 
> > I like the idea of extending the size in the beginning to avoid having
> > to hold i_sem for the extended writes case. The only question is whether
> > there is a semantics issue here - e.g. if there isn't enough space, would the
> > app which did an append still see the size change even though the blocks
> > themselves are not populated ? Of course we already have this happen in the
> > -EIO case for AIO-DIO writes, no am not sure if this is an issue.
> 
> You're right, there is a semantic issue, I'm being a little loose with
> the traditional write() expectations.  The current code creates a hole
> and does not remove it in the face of errors.
> 
> > 
> > > +	 */
> > > +	dio->reacquire_i_mutex = 0;
> > > +	if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) {
> > 
> > Hmm, we really do want to be able to avoid all these various locking
> > mode checks inside the DIO code to simplify things. I thought you
> > had mostly managed to do away with these in your previous patch. 
> > Unfortunately this change while it should help concurrency, brings
> > that check back in. One of the plans I had earlier was to pull this
> > up to a higher level - i.e. in the locking version of blockdev_direct_IO()
> > thus doing away with the flag and related confusion. Do you think
> > that would make sense ?
> 
> I went the other direction, making the flags a little more formal (patch
> will go out today).  I don't have strong feelings about this, but it
> seems easiest to me to keep the flags inside blockdev_direct_IO.

Do we have some kind of description of what conditions each flags would
be used / who would use them (not just what they do). For example I
have forgotton why DIO_CREATE was actually introduced for XFS. Is it
still needed now that we no longer needed to do the buffered IO fallback
for overwriting holes ? 

That would help the discussion.

And then there used to be special flags for cluster
filesystems which was introduced by GFS ... all of it got to be quite
confusing. The formality helps, but only to an extent.

> 
> > 
> > 
> > > +		/* if our write goes past i_size, do an expanding
> > > +		 * truncate to fill it before dropping i_mutex
> > > +		 */
> > > +		if (end > i_size_read(inode) && iocb->ki_filp) {
> > > +			struct iattr newattrs;
> > > +			newattrs.ia_size = end;
> > > +			newattrs.ia_file = iocb->ki_filp;
> > > +			newattrs.ia_valid = ATTR_SIZE | ATTR_FILE;
> > > +			retval = notify_change(iocb->ki_filp->f_dentry,
> > > +					       &newattrs);
> > > +			if (retval)
> > > +				goto out;
> > > +		}
> > > +		if (is_sync_kiocb(iocb)) {
> > > +			dio->reacquire_i_mutex = 1;
> > > +			mutex_unlock(&inode->i_mutex);
> > 
> > BTW, don't the page locks cover the AIO case as well ? I'm wondering
> > why we need this special case here.
> 
> Only because i_mutex is already dropped implicitly by the aio code when
> we return.  Ideally we could drop it sooner inside fs/direct-io.c, but
> I'll leave that tuning for a later patch.

OK, I see.

There is one other thing that I'm not sure we have covered yet. That is
ensuring ordered mode guarantees with AIO-DIO. As long as we were falling
back to buffered IO in cases involving block allocation or forcing extends
to be synchronous, we probably didn't need to worry about this. How do
we make sure we address that now.

Regards
Suparna

> 
> -chris

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache.
  2006-11-03  7:12       ` Suparna Bhattacharya
@ 2006-11-03 14:10         ` Chris Mason
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Mason @ 2006-11-03 14:10 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm, zach.brown

On Fri, Nov 03, 2006 at 12:42:57PM +0530, Suparna Bhattacharya wrote:
> > 
> > My original code was fairly dumb and didn't use gang lookups.  It also
> > dropped the radix tree lock after every insert/delete.  The code posted
> > here is about 2x as fast (or 1/2 the regression, depending on how you
> > look at it).   For smaller read ios, the performance hit is about zero.
> > The cost of radix tree insertion is offset by less mutex and semaphore
> > locking.
> 
> OK, so you think the main overheads had to do with radix-tree locking,
> rather than with the additional node insertions ?

It depends on the machine I suppose.  If you have a lot of cpus trying
to work concurrently on the same file, bouncing the lock between them
(especially on numa) may be the most expensive part.  Hopefully we can
talk the XFS guys into some benchmarking, since they have a bunch of
numbers for comparison already.

But there are other costs, including slab churn as we allocate/free
radix nodes etc etc.  I think (well, hope) that on most hardware this is
a very small cost in comparison with the actual IO.

[ ... ]
> > Yes, my goal was to get the code fast enough for most machines, so that
> > we could test the result and see what impact it would have on the rest
> > of the page cache.  From here we can decide if additional radix tuning
> > is a good idea, or if we just want to stick with the mutex/semaphore
> > combo in use now.
> 
> Yup, that makes sense.
> 
> The complexity in find_and_insert_placeholder_pages does worry me though,
> wouldn't want to end up debugging tricky problems in that code after all
> this. Do you have any ideas on that ?

If we push bulk insertion down into the radix code,
find_and_insert_placeholders will become less complex.  We could even
preallocate a node that is already filled with place holders and pop
that into multiple spots in the radix tree.

-chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-11-03 14:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason
2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason
2006-11-01 10:23   ` Suparna Bhattacharya
2006-11-01 12:41     ` Chris Mason
2006-11-03  7:12       ` Suparna Bhattacharya
2006-11-03 14:10         ` Chris Mason
2006-10-27 18:25 ` [RFC PATCH 2/3] Change O_DIRECT to use placeholders Chris Mason
2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason
2006-11-01 10:51   ` Suparna Bhattacharya
2006-11-01 12:47     ` Chris Mason
2006-11-03  7:25       ` Suparna Bhattacharya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).