[RFC PATCH 0/2] O_DIRECT locking rework

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] O_DIRECT locking rework
@ 2006-10-20 18:32 Chris Mason
  2006-10-20 18:38 ` [RFC PATCH 1/2] placeholder pages Chris Mason
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-20 18:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

Hello everyone,

O_DIRECT locking currently uses a few different per-inode locks to
prevent races between buffered io and direct io.  This is awkward, and
sometimes grows races where we expose old data on disk.

For example, I can't quite see how we protect from an mmap triggered
writepage from filling a hole in the middle of an O_DIRECT read.

This patch set changes O_DIRECT to use page locks instead of
mutex/semaphores.  It looks in the radix tree for pages affected by this
O_DIRECT read/wrte and locks any pages it finds.

For any pages not present, a stub page struct is inserted into the
radix tree.  The page cache routines are changed to either wait on this
place holder page or ignore it as appropriate.  Place holders are not
valid pages at all, you can't trust page->index or any other field.

The first patch introduces these place holder pages.  The second patch
changes fs/direct-io.c to use them.  Patch #2 needs work,
direct-io.c:lock_page_range can be made much faster, and it needs to be
changed to work in chunks instead of pinning down the whole range at
once.

But, this is enough for people to comment on the basic idea.  Testing
has been very light. I'm not sure I've covered all of the buffered vs
direct races yet.  The main goal of posting now is to talk about the
place holder pages and possible optimizations.

For the XFS guys, you probably want to avoid the page locking steps as
well, a later version will honor that.

-chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/2] placeholder pages
  2006-10-20 18:32 [RFC PATCH 0/2] O_DIRECT locking rework Chris Mason
@ 2006-10-20 18:38 ` Chris Mason
  2006-10-20 18:41 ` [RFC PATCH 1/2] page cache locking for O_DIRECT Chris Mason
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-20 18:38 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

Introduce a place holder page for the radix tree.  mm/filemap.c is
changed to wait on these before adding a page into the page cache, and
truncates are changed to wait for all of the place holder pages to
disappear.

Place holder pages can only be tested or looked at with the mapping lock
held, and only page->flags can be trusted.  They cannot be locked, and
cannot have references increased or decreased on them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 18a9e9f5c707 include/linux/mm.h
--- a/include/linux/mm.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/mm.h	Fri Oct 20 12:38:24 2006 -0400
@@ -276,6 +276,7 @@ static inline void get_page(struct page 
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
 	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(PagePlaceHolder(page));
 	atomic_inc(&page->_count);
 }
 
diff -r 18a9e9f5c707 include/linux/page-flags.h
--- a/include/linux/page-flags.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/page-flags.h	Fri Oct 20 12:46:03 2006 -0400
@@ -90,6 +90,7 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
+#define PG_placeholder		20	/* An invalid page holding a slot */
 
 
 #if (BITS_PER_LONG > 32)
@@ -251,6 +252,10 @@ static inline void SetPageUptodate(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PagePlaceHolder(page)	   test_bit(PG_placeholder, &(page)->flags)
+#define SetPagePlaceHolder(page)   set_bit(PG_placeholder, &(page)->flags)
+#define ClearPagePlaceHolder(page) clear_bit(PG_placeholder, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
diff -r 18a9e9f5c707 include/linux/pagemap.h
--- a/include/linux/pagemap.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/pagemap.h	Fri Oct 20 12:38:24 2006 -0400
@@ -72,6 +72,9 @@ extern struct page * find_get_page(struc
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
 				unsigned long index);
+extern struct page *find_or_insert_page(struct address_space *mapping,
+					unsigned long index, gfp_t gfp_mask,
+					struct page *insert);
 extern __deprecated_for_modules struct page * find_trylock_page(
 			struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -82,6 +85,12 @@ unsigned find_get_pages_contig(struct ad
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+void remove_placeholder_page(struct address_space *mapping, struct page *expected,
+		       unsigned long off);
+void wake_up_placeholder_page(struct page *page);
+void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start,
+			       pgoff_t end);
+
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff -r 18a9e9f5c707 mm/filemap.c
--- a/mm/filemap.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/mm/filemap.c	Fri Oct 20 13:46:29 2006 -0400
@@ -44,6 +44,9 @@ generic_file_direct_IO(int rw, struct ki
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+static void wait_on_placeholder_page(struct address_space *mapping,
+			       struct page *page, unsigned long offset);
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -437,12 +440,24 @@ int add_to_page_cache(struct page *page,
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	int error;
+again:
+	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
+		if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) {
+			struct page *tmp;
+			tmp = radix_tree_lookup(&mapping->page_tree, offset);
+			if (tmp && PagePlaceHolder(tmp)) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+				wait_on_placeholder_page(mapping, tmp, offset);
+				goto again;
+			}
+		}
+		if (!error && !PagePlaceHolder(page)) {
 			page_cache_get(page);
 			SetPageLocked(page);
 			page->mapping = mapping;
@@ -526,6 +541,76 @@ void fastcall wait_on_page_bit(struct pa
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+static void wait_on_placeholder_page(struct address_space *mapping,
+			       struct page *page, unsigned long offset)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	read_lock_irq(&mapping->tree_lock);
+	page = radix_tree_lookup(&mapping->page_tree, offset);
+	if (page && PagePlaceHolder(page)) {
+		read_unlock_irq(&mapping->tree_lock);
+		io_schedule();
+	} else
+		read_unlock_irq(&mapping->tree_lock);
+	finish_wait(wqh, &wait);
+}
+
+void wake_up_placeholder_page(struct page *page)
+{
+	wake_up(page_waitqueue(page));
+}
+EXPORT_SYMBOL(wake_up_placeholder_page);
+
+/**
+ * wait_on_placeholder_pages - gang placeholder page waiter
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @end:	The max page index
+ *
+ * wait_on_placeholder_pages() will search for and wait on a range of pages
+ * in the mapping
+ *
+ * On return, the range has no placeholder pages sitting in it.
+ */
+void wait_on_placeholder_pages_range(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end)
+{
+	unsigned int i;
+	unsigned int ret;
+	struct page *pages[8];
+	pgoff_t cur = start;
+	pgoff_t highest = start;
+	DEFINE_WAIT(wait);
+
+	/*
+	 * we expect a very small number of place holder pages, so
+	 * this code isn't trying to be very fast.
+	 */
+again:
+	read_lock_irq(&mapping->tree_lock);
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pages, cur, ARRAY_SIZE(pages));
+	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i])) {
+			wait_queue_head_t *wqh = page_waitqueue(pages[i]);
+			prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+			read_unlock_irq(&mapping->tree_lock);
+			io_schedule();
+			finish_wait(wqh, &wait);
+			goto again;
+		} else if (pages[i]->index > highest)
+			highest = pages[i]->index;
+	}
+	read_unlock_irq(&mapping->tree_lock);
+	if (highest < end && ret == ARRAY_SIZE(pages)) {
+		cur = highest;
+		goto again;
+	}
+}
+EXPORT_SYMBOL(wait_on_placeholder_pages_range);
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -542,6 +627,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
  */
 void fastcall unlock_page(struct page *page)
 {
+	BUG_ON(PagePlaceHolder(page));
 	smp_mb__before_clear_bit();
 	if (!TestClearPageLocked(page))
 		BUG();
@@ -578,6 +664,7 @@ void fastcall __lock_page(struct page *p
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -590,6 +677,7 @@ void fastcall __lock_page_nosync(struct 
 void fastcall __lock_page_nosync(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -608,12 +696,66 @@ struct page * find_get_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
+	if (page) {
+		if (PagePlaceHolder(page))
+			page = NULL;
+		else
+			page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
+
+/**
+ * find_or_insert_page - locate a pagecache page or insert one
+ * @mapping: the page's address_space
+ * @index: the page's index into the mapping
+ * @gfp_mask: page allocation mode
+ * @insert: the page to insert if none is found
+ *
+ * Locates a page in the pagecache.  If the page is not present,
+ * @insert is added instead.  @insert is not placed on the lrus
+ * The returned page is locked and has its reference count
+ * incremented
+ *
+ * find_or_insert_page() may sleep, even if @gfp_flags specifies an atomic
+ * allocation!
+ *
+ * find_or_insert_page() returns the desired page's address, or zero on
+ * memory exhaustion.
+ */
+struct page *find_or_insert_page(struct address_space *mapping,
+		unsigned long index, gfp_t gfp_mask, struct page *insert)
+{
+	struct page *page;
+	int err;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (!page) {
+		err = add_to_page_cache(insert, mapping, index, gfp_mask);
+		if (!err) {
+			page = insert;
+		} else if (err == -EEXIST)
+			goto repeat;
+	}
+	return page;
+}
+EXPORT_SYMBOL(find_or_insert_page);
+
+void remove_placeholder_page(struct address_space *mapping,
+			     struct page *expected, unsigned long offset)
+{
+	struct page *page;
+	write_lock_irq(&mapping->tree_lock);
+	page = radix_tree_lookup(&mapping->page_tree, offset);
+	BUG_ON(!page);
+	BUG_ON(!PagePlaceHolder(page));
+	BUG_ON(page != expected);
+	radix_tree_delete(&mapping->page_tree, offset);
+	write_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL(remove_placeholder_page);
 
 /**
  * find_trylock_page - find and lock a page
@@ -628,7 +770,7 @@ struct page *find_trylock_page(struct ad
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
+	if (page && (PagePlaceHolder(page) || TestSetPageLocked(page)))
 		page = NULL;
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
@@ -654,6 +796,12 @@ repeat:
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
+		if (PagePlaceHolder(page)) {
+			read_unlock_irq(&mapping->tree_lock);
+			wait_on_placeholder_page(mapping, page, offset);
+			read_lock_irq(&mapping->tree_lock);
+			goto repeat;
+		}
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -743,8 +891,17 @@ unsigned find_get_pages(struct address_s
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memmove(pages + i, pages + i + 1,
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+		} else
+			page_cache_get(pages[i]);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -771,6 +928,8 @@ unsigned find_get_pages_contig(struct ad
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, index, nr_pages);
 	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i]))
+			break;
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
@@ -801,8 +960,17 @@ unsigned find_get_pages_tag(struct addre
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memmove(pages + i, pages + i + 1,
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+		} else
+			page_cache_get(pages[i]);
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
diff -r 18a9e9f5c707 mm/truncate.c
--- a/mm/truncate.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/mm/truncate.c	Fri Oct 20 12:38:24 2006 -0400
@@ -207,6 +207,7 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_release(&pvec);
 	}
+	wait_on_placeholder_pages_range(mapping, start, end);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] page cache locking for O_DIRECT
  2006-10-20 18:32 [RFC PATCH 0/2] O_DIRECT locking rework Chris Mason
  2006-10-20 18:38 ` [RFC PATCH 1/2] placeholder pages Chris Mason
@ 2006-10-20 18:41 ` Chris Mason
  2006-10-20 19:30 ` [RFC PATCH 0/2] O_DIRECT locking rework Andrew Morton
  2006-10-24 19:34 ` Chris Mason
  3 siblings, 0 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-20 18:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

This changes O_DIRECT to take page locks or insert placeholder pages to
lock regions under direct io.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 18a9e9f5c707 fs/direct-io.c
--- a/fs/direct-io.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/fs/direct-io.c	Fri Oct 20 12:38:24 2006 -0400
@@ -35,6 +35,7 @@
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
+#include <linux/writeback.h>
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -94,6 +95,14 @@ struct dio {
 	struct buffer_head map_bh;	/* last get_block() result */
 
 	/*
+	 * kernel page pinning
+	 */
+	struct page fake;
+	struct page **fspages;
+	unsigned long nr_fspages;
+	loff_t fs_start_off;
+
+	/*
 	 * Deferred addition of a page to the dio.  These variables are
 	 * private to dio_send_cur_page(), submit_page_section() and
 	 * dio_bio_add_page().
@@ -190,6 +199,66 @@ out:
 	return ret;	
 }
 
+static void unlock_page_range(struct address_space *mapping,
+			      struct page **pages,
+			      unsigned long start,
+                              unsigned long nr)
+{
+	unsigned long i;
+	struct page *p;
+	struct page *placeholder = NULL;
+	for (i = 0; i < nr; i++) {
+		p = pages[i];
+		if (PagePlaceHolder(p)) {
+			placeholder = p;
+			remove_placeholder_page(mapping, p, start + i);
+		} else {
+			unlock_page(p);
+			page_cache_release(p);
+		}
+	}
+	if (placeholder)
+		wake_up_placeholder_page(placeholder);
+}
+
+static int lock_page_range(struct address_space *mapping,
+			   struct page **pages,
+		           unsigned long start,
+			   unsigned long nr,
+			   struct page *fake)
+{
+	struct page *p;
+	unsigned long numlock = 0;
+	unsigned long end = start + nr;
+	loff_t end_bytes = end << PAGE_CACHE_SHIFT;
+	unsigned long i;
+	for (i = start ; i < end; i++) {
+		p = find_or_insert_page(mapping, i, GFP_KERNEL, fake);
+		if (!p)
+			goto fail;
+		if (PageDirty(p)) {
+			/* this page was dirty, so someone raced in and
+			 * did a write.  Start IO on the whole region
+			 * and try again
+			 */
+			unlock_page(p);
+			page_cache_release(p);
+			__filemap_fdatawrite_range(mapping,
+			                        i << PAGE_CACHE_SHIFT,
+						end_bytes, WB_SYNC_ALL);
+			continue;
+		}
+		pages[numlock++] = p;
+	}
+	/* now that we have all the pages locked, wait for any io */
+	wait_on_page_writeback_range(mapping, start, end);
+	return 0;
+fail:
+	unlock_page_range(mapping, pages, start, numlock);
+	return -1;
+}
+
+
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
  * buffered inside the dio so that we can call get_user_pages() against a
@@ -219,9 +288,8 @@ static void dio_complete(struct dio *dio
 {
 	if (dio->end_io && dio->result)
 		dio->end_io(dio->iocb, offset, bytes, dio->map_bh.b_private);
-	if (dio->lock_type == DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
+	unlock_page_range(dio->inode->i_mapping, dio->fspages,
+	                  dio->fs_start_off, dio->nr_fspages);
 }
 
 /*
@@ -944,7 +1012,7 @@ out:
 }
 
 /*
- * Releases both i_mutex and i_alloc_sem
+ * Releases both i_mutex
  */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
@@ -1191,8 +1259,9 @@ __blockdev_direct_IO(int rw, struct kioc
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
 	struct dio *dio;
-	int release_i_mutex = 0;
 	int acquire_i_mutex = 0;
+	struct page **pages = NULL;
+	unsigned long nrpages;
 
 	if (rw & WRITE)
 		rw = WRITE_SYNC;
@@ -1221,12 +1290,21 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 		}
 	}
-
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
 		goto out;
 
+	memset(&dio->fake, 0, sizeof(struct page));
+	SetPagePlaceHolder(&dio->fake);
+	nrpages = (end + PAGE_CACHE_SIZE - 1 - offset) >> PAGE_CACHE_SHIFT;
+	dio->fs_start_off = offset >> PAGE_CACHE_SHIFT;
+	pages = kmalloc(sizeof(struct page *) * nrpages, GFP_KERNEL);
+	dio->fspages = pages;
+	dio->nr_fspages = nrpages;
+	if (lock_page_range(inode->i_mapping, pages, dio->fs_start_off, nrpages,
+	                    &dio->fake))
+		goto out;
 	/*
 	 * For block device access DIO_NO_LOCKING is used,
 	 *	neither readers nor writers do any locking at all
@@ -1240,30 +1318,11 @@ __blockdev_direct_IO(int rw, struct kioc
 	if (dio_lock_type != DIO_NO_LOCKING) {
 		/* watch out for a 0 len io from a tricksy fs */
 		if (rw == READ && end > offset) {
-			struct address_space *mapping;
-
-			mapping = iocb->ki_filp->f_mapping;
-			if (dio_lock_type != DIO_OWN_LOCKING) {
-				mutex_lock(&inode->i_mutex);
-				release_i_mutex = 1;
-			}
-
-			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
-			if (retval) {
-				kfree(dio);
-				goto out;
-			}
-
 			if (dio_lock_type == DIO_OWN_LOCKING) {
 				mutex_unlock(&inode->i_mutex);
 				acquire_i_mutex = 1;
 			}
 		}
-
-		if (dio_lock_type == DIO_LOCKING)
-			/* lockdep: not the owner will release it */
-			down_read_non_owner(&inode->i_alloc_sem);
 	}
 
 	/*
@@ -1278,13 +1337,8 @@ __blockdev_direct_IO(int rw, struct kioc
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
 
-	if (rw == READ && dio_lock_type == DIO_LOCKING)
-		release_i_mutex = 0;
-
 out:
-	if (release_i_mutex)
-		mutex_unlock(&inode->i_mutex);
-	else if (acquire_i_mutex)
+	if (acquire_i_mutex)
 		mutex_lock(&inode->i_mutex);
 	return retval;
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-20 18:32 [RFC PATCH 0/2] O_DIRECT locking rework Chris Mason
  2006-10-20 18:38 ` [RFC PATCH 1/2] placeholder pages Chris Mason
  2006-10-20 18:41 ` [RFC PATCH 1/2] page cache locking for O_DIRECT Chris Mason
@ 2006-10-20 19:30 ` Andrew Morton
  2006-10-20 20:03   ` Zach Brown
                     ` (2 more replies)
  2006-10-24 19:34 ` Chris Mason
  3 siblings, 3 replies; 17+ messages in thread
From: Andrew Morton @ 2006-10-20 19:30 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, zach.brown

On Fri, 20 Oct 2006 14:32:37 -0400
Chris Mason <chris.mason@oracle.com> wrote:

> Hello everyone,
> 
> O_DIRECT locking currently uses a few different per-inode locks to
> prevent races between buffered io and direct io.  This is awkward, and
> sometimes grows races where we expose old data on disk.
> 
> For example, I can't quite see how we protect from an mmap triggered
> writepage from filling a hole in the middle of an O_DIRECT read.
> 
> This patch set changes O_DIRECT to use page locks instead of
> mutex/semaphores.  It looks in the radix tree for pages affected by this
> O_DIRECT read/wrte and locks any pages it finds.
> 
> For any pages not present, a stub page struct is inserted into the
> radix tree.  The page cache routines are changed to either wait on this
> place holder page or ignore it as appropriate.  Place holders are not
> valid pages at all, you can't trust page->index or any other field.
> 
> The first patch introduces these place holder pages.  The second patch
> changes fs/direct-io.c to use them.  Patch #2 needs work,
> direct-io.c:lock_page_range can be made much faster, and it needs to be
> changed to work in chunks instead of pinning down the whole range at
> once.
> 
> But, this is enough for people to comment on the basic idea.  Testing
> has been very light. I'm not sure I've covered all of the buffered vs
> direct races yet.  The main goal of posting now is to talk about the
> place holder pages and possible optimizations.
> 
> For the XFS guys, you probably want to avoid the page locking steps as
> well, a later version will honor that.
> 

Boy, it doesn't do much to simplify the code, does it?  An opportunity for
Linus to again share with us his opinions on direct-io.

Conceptually, sticking locked pages into the mapping is a good thing to do,
because it meshes in with all our existing synchronisation design.


I think the fake placeholder page can be a kernel-wide thing rather than
per-dio?  That would be most desirable because then we have

#define PagePlaceHolder(page)	(page == global_placeholder_page)

which saves a precious page flag.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-20 19:30 ` [RFC PATCH 0/2] O_DIRECT locking rework Andrew Morton
@ 2006-10-20 20:03   ` Zach Brown
  2006-10-20 20:12     ` Chris Mason
  2006-10-20 20:05   ` Chris Mason
  2006-10-20 20:23   ` Andi Kleen
  2 siblings, 1 reply; 17+ messages in thread
From: Zach Brown @ 2006-10-20 20:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chris Mason, linux-fsdevel

> I think the fake placeholder page can be a kernel-wide thing rather than
> per-dio?  That would be most desirable because then we have
> 
> #define PagePlaceHolder(page)	(page == global_placeholder_page)

That would be nice.  The current patch hashes the page pointer with
page_waitqueue(page) to get at the zone's hashed wait queue buckets.  It
would probably want to avoid that hot spot by getting at the buckets by
hashing the mapping and file offset, I guess.

Chris and I have also talked about using tags in the nodes to mark slots
that only lead to place holder leaves instead of actually populating the
leaves with these place holder pages.  We'd avoid having to populate the
tree for larger IOs, and callers wouldn't have to filter out the place
holders from lookups, but the tree nodes would grow for a rare user :/.

- z

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-20 20:03   ` Zach Brown
@ 2006-10-20 20:12     ` Chris Mason
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-20 20:12 UTC (permalink / raw)
  To: Zach Brown; +Cc: Andrew Morton, linux-fsdevel

On Fri, Oct 20, 2006 at 01:03:24PM -0700, Zach Brown wrote:
> 
> > I think the fake placeholder page can be a kernel-wide thing rather than
> > per-dio?  That would be most desirable because then we have
> > 
> > #define PagePlaceHolder(page)	(page == global_placeholder_page)
> 
> That would be nice.  The current patch hashes the page pointer with
> page_waitqueue(page) to get at the zone's hashed wait queue buckets.  It
> would probably want to avoid that hot spot by getting at the buckets by
> hashing the mapping and file offset, I guess.

The page_waitqueue part only happens in the contention case.  This will
either be buffered vs direct io, direct io vs direct io or direct io vs
truncate.  I think all three should be rare enough that a single
waitqueue won't hurt.

It is also very hard to know the file offset without asking the radix
tree, since we can't trust placeholder_page->index.

-chris


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-20 19:30 ` [RFC PATCH 0/2] O_DIRECT locking rework Andrew Morton
  2006-10-20 20:03   ` Zach Brown
@ 2006-10-20 20:05   ` Chris Mason
  2006-10-20 20:23   ` Andi Kleen
  2 siblings, 0 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-20 20:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, zach.brown

On Fri, Oct 20, 2006 at 12:30:58PM -0700, Andrew Morton wrote:
> On Fri, 20 Oct 2006 14:32:37 -0400
> Chris Mason <chris.mason@oracle.com> wrote:
> 
> > Hello everyone,
> > 
> > O_DIRECT locking currently uses a few different per-inode locks to
> > prevent races between buffered io and direct io.  This is awkward, and
> > sometimes grows races where we expose old data on disk.
> > 
> > For example, I can't quite see how we protect from an mmap triggered
> > writepage from filling a hole in the middle of an O_DIRECT read.
> > 
> > This patch set changes O_DIRECT to use page locks instead of
> > mutex/semaphores.  It looks in the radix tree for pages affected by this
> > O_DIRECT read/wrte and locks any pages it finds.
> 
> Boy, it doesn't do much to simplify the code, does it?  An opportunity for
> Linus to again share with us his opinions on direct-io.

It definitely isn't trivial.  But hopefully if we make direct-io a
special case via the place holder pages, we can get rid of other (more
complex) special cases elsewhere.  I think we can eventually get rid of
the fall back to buffered io stuff with this.

> 
> Conceptually, sticking locked pages into the mapping is a good thing to do,
> because it meshes in with all our existing synchronisation design.
> 
> I think the fake placeholder page can be a kernel-wide thing rather than
> per-dio?  That would be most desirable because then we have
> 
> #define PagePlaceHolder(page)	(page == global_placeholder_page)
> 
> which saves a precious page flag.

I flipped a coin on the placeholder testing mechanism.  I can't think of
a reason to have more than one placeholder page, so the global is fine
by me.  Zach has some ideas on optimizations at the radix tree level,
but we'll still need some kind of PagePlaceHolder test.

-chris


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-20 19:30 ` [RFC PATCH 0/2] O_DIRECT locking rework Andrew Morton
  2006-10-20 20:03   ` Zach Brown
  2006-10-20 20:05   ` Chris Mason
@ 2006-10-20 20:23   ` Andi Kleen
  2 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2006-10-20 20:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, zach.brown, chris.mason

Andrew Morton <akpm@osdl.org> writes:
> 
> I think the fake placeholder page can be a kernel-wide thing rather than
> per-dio?  That would be most desirable because then we have

Wouldn't that create a hot spot with lots of bounced cache line 
on a large machine for the wait queue? A little more fine grained
seems safer in this regard.

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-20 18:32 [RFC PATCH 0/2] O_DIRECT locking rework Chris Mason
                   ` (2 preceding siblings ...)
  2006-10-20 19:30 ` [RFC PATCH 0/2] O_DIRECT locking rework Andrew Morton
@ 2006-10-24 19:34 ` Chris Mason
  2006-10-24 19:38   ` [RFC PATCH 1/2] placeholder pages Chris Mason
                     ` (2 more replies)
  3 siblings, 3 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-24 19:34 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

Hello,

Here is a new cut of my O_DIRECT locking rework.  It has a much lower
cpu cost than the last set, and simple benchmarks no longer show a
regression here in system time.  But, the complexity for inserting
placeholder pages has gone up.

I've also changed the way I test for a place holder page (from
mm/filemap.c):

static struct address_space placeholder_address_space;
#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)

This is more stable than the last one but I'm just starting to run race
and load testing on it.

-chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/2] placeholder pages
  2006-10-24 19:34 ` Chris Mason
@ 2006-10-24 19:38   ` Chris Mason
  2006-10-24 19:40   ` [RFC PATCH 2/2] O_DIRECT locking via placeholders Chris Mason
  2006-10-24 20:28   ` [RFC PATCH 0/2] O_DIRECT locking rework Badari Pulavarty
  2 siblings, 0 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-24 19:38 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

Introduce a place holder page for the radix tree.  mm/filemap.c is
changed to wait on these before adding a page into the page cache, and
truncates are changed to wait for all of the place holder pages to
disappear.

Place holder pages can only be tested or looked at with the mapping lock
held, and only PagePlaceHolder(page) can be trusted.  They cannot be locked,
and cannot have references increased or decreased on them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 18a9e9f5c707 include/linux/mm.h
--- a/include/linux/mm.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/mm.h	Tue Oct 24 15:10:48 2006 -0400
@@ -276,6 +276,7 @@ static inline void get_page(struct page 
 	if (unlikely(PageCompound(page)))
 		page = (struct page *)page_private(page);
 	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(PagePlaceHolder(page));
 	atomic_inc(&page->_count);
 }
 
diff -r 18a9e9f5c707 include/linux/page-flags.h
--- a/include/linux/page-flags.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/page-flags.h	Tue Oct 24 15:10:48 2006 -0400
@@ -267,4 +266,6 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
+void set_page_placeholder(struct page *page);
+
 #endif	/* PAGE_FLAGS_H */
diff -r 18a9e9f5c707 include/linux/pagemap.h
--- a/include/linux/pagemap.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/pagemap.h	Tue Oct 24 15:10:48 2006 -0400
@@ -72,6 +72,12 @@ extern struct page * find_get_page(struc
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
 				unsigned long index);
+int find_or_insert_placeholders(struct address_space *mapping,
+                                  struct page **pages, unsigned long start,
+                                  unsigned long end, unsigned long nr,
+                                  gfp_t gfp_mask,
+                                  struct page *insert,
+                                  int wait);
 extern __deprecated_for_modules struct page * find_trylock_page(
 			struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -82,6 +88,16 @@ unsigned find_get_pages_contig(struct ad
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+void remove_placeholder_pages(struct address_space *mapping,
+                             struct page **pages,
+                             struct page *placeholder,
+                             unsigned long offset,
+                             unsigned long end,
+                             unsigned long nr);
+void wake_up_placeholder_page(struct page *page);
+void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start,
+			       pgoff_t end);
+
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff -r 18a9e9f5c707 include/linux/radix-tree.h
--- a/include/linux/radix-tree.h	Thu Oct 19 08:30:00 2006 +0700
+++ b/include/linux/radix-tree.h	Tue Oct 24 15:10:48 2006 -0400
@@ -56,6 +56,7 @@ radix_tree_gang_lookup(struct radix_tree
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_needs_preload(void);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff -r 18a9e9f5c707 lib/radix-tree.c
--- a/lib/radix-tree.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/lib/radix-tree.c	Tue Oct 24 15:10:48 2006 -0400
@@ -139,6 +139,19 @@ out:
 out:
 	return ret;
 }
+
+/*
+ * Must be called inside the preload pair.  This tells you if another
+ * run of radix_tree_preload is required, allowing the caller to decide
+ * if a repeated insert will succeed
+ */
+int radix_tree_needs_preload(void)
+{
+	struct radix_tree_preload *rtp;
+	rtp = &__get_cpu_var(radix_tree_preloads);
+	return rtp->nr < ARRAY_SIZE(rtp->nodes);
+}
+EXPORT_SYMBOL_GPL(radix_tree_needs_preload);
 
 static inline void tag_set(struct radix_tree_node *node, unsigned int tag,
 		int offset)
diff -r 18a9e9f5c707 mm/filemap.c
--- a/mm/filemap.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/mm/filemap.c	Tue Oct 24 15:10:48 2006 -0400
@@ -44,6 +44,25 @@ generic_file_direct_IO(int rw, struct ki
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+static wait_queue_head_t *page_waitqueue(struct page *page);
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock);
+
+static struct address_space placeholder_address_space;
+#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
+
+/*
+ * Turns the page struct into a marker for a placeholder page.
+ * This should not be called with a real page, only a struct page stub.
+ */
+void set_page_placeholder(struct page *page)
+{
+	memset(page, 0, sizeof(*page));
+	page->mapping = &placeholder_address_space;
+}
+EXPORT_SYMBOL_GPL(set_page_placeholder);
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -437,12 +456,24 @@ int add_to_page_cache(struct page *page,
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	int error;
+again:
+	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
+		if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) {
+			struct page *tmp;
+			tmp = radix_tree_lookup(&mapping->page_tree, offset);
+			if (tmp && PagePlaceHolder(tmp)) {
+				radix_tree_preload_end();
+				/* drops the lock */
+				wait_on_placeholder_page(mapping, tmp, 1);
+				goto again;
+			}
+		}
+		if (!error && !PagePlaceHolder(page)) {
 			page_cache_get(page);
 			SetPageLocked(page);
 			page->mapping = mapping;
@@ -526,6 +557,79 @@ void fastcall wait_on_page_bit(struct pa
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+/*
+ * Call with either a read lock or a write lock on the mapping tree.
+ * The lock is dropped just before waiting for the place holder
+ */
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	if (write_lock)
+		write_unlock_irq(&mapping->tree_lock);
+	else
+		read_unlock_irq(&mapping->tree_lock);
+	io_schedule();
+	finish_wait(wqh, &wait);
+}
+
+void wake_up_placeholder_page(struct page *page)
+{
+	__wake_up_bit(page_waitqueue(page), &page->flags, PG_locked);
+}
+EXPORT_SYMBOL_GPL(wake_up_placeholder_page);
+
+/**
+ * wait_on_placeholder_pages - gang placeholder page waiter
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @end:	The max page index
+ *
+ * wait_on_placeholder_pages() will search for and wait on a range of pages
+ * in the mapping
+ *
+ * On return, the range has no placeholder pages sitting in it.
+ */
+void wait_on_placeholder_pages_range(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end)
+{
+	unsigned int i;
+	unsigned int ret;
+	struct page *pages[8];
+	pgoff_t cur = start;
+	pgoff_t highest = start;
+
+	/*
+	 * we expect a very small number of place holder pages, so
+	 * this code isn't trying to be very fast.
+	 */
+again:
+	read_lock_irq(&mapping->tree_lock);
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pages, cur, ARRAY_SIZE(pages));
+	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i])) {
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, pages[i], 0);
+			goto again;
+		} else if (pages[i]->index > highest)
+			highest = pages[i]->index;
+	}
+	if (highest < end && ret == ARRAY_SIZE(pages)) {
+		cur = highest;
+		if (need_resched()) {
+			read_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+		}
+		goto again;
+	}
+	read_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range);
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -542,6 +646,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
  */
 void fastcall unlock_page(struct page *page)
 {
+	BUG_ON(PagePlaceHolder(page));
 	smp_mb__before_clear_bit();
 	if (!TestClearPageLocked(page))
 		BUG();
@@ -578,6 +683,7 @@ void fastcall __lock_page(struct page *p
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -590,6 +696,7 @@ void fastcall __lock_page_nosync(struct 
 void fastcall __lock_page_nosync(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -608,12 +715,262 @@ struct page * find_get_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
+	if (page) {
+		if (PagePlaceHolder(page))
+			page = NULL;
+		else
+			page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
+
+/**
+ * remove_placeholder_pages - remove a range of placeholder or locked pages
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @placeholder: the placeholder page previously inserted (for verification)
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ *
+ * Any placeholder pages in the range specified are removed.  Any real
+ * pages are unlocked and released.
+ */
+void remove_placeholder_pages(struct address_space *mapping,
+			     struct page **pages,
+			     struct page *placeholder,
+			     unsigned long start,
+			     unsigned long end,
+			     unsigned long nr)
+{
+	struct page *page;
+	int ret;
+	int i;
+	unsigned long num;
+
+	write_lock_irq(&mapping->tree_lock);
+	while (start < end) {
+		num = min(nr, end-start);
+		ret = radix_tree_gang_lookup(&mapping->page_tree,
+						(void **)pages, start, num);
+		BUG_ON(ret != num);
+		for (i = 0; i < ret; i++) {
+			page = pages[i];
+			if (PagePlaceHolder(page)) {
+				BUG_ON(page != placeholder);
+				radix_tree_delete(&mapping->page_tree,
+						  start + i);
+			} else {
+				BUG_ON(page->index >= end);
+				unlock_page(page);
+				page_cache_release(page);
+			}
+		}
+		start += ret;
+		if (need_resched()) {
+			write_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+			write_lock_irq(&mapping->tree_lock);
+		}
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	if (placeholder)
+		wake_up_placeholder_page(placeholder);
+}
+EXPORT_SYMBOL_GPL(remove_placeholder_pages);
+
+/*
+ * a helper function to insert a placeholder into multiple slots
+ * in the radix tree.  This could probably use an optimized version
+ * in the radix code.  It may insert fewer than the request number
+ * of placeholders if we need to reschedule or the radix tree needs to
+ * be preloaded again.
+ *
+ * returns zero on error or the number actually inserted.
+ */
+static unsigned long insert_placeholders(struct address_space *mapping,
+					 struct page *insert,
+					 unsigned long start, unsigned long nr)
+{
+	int err;
+	unsigned long i;
+	for (i = 0 ; i < nr; i++) {
+		err = radix_tree_insert(&mapping->page_tree, start + i,
+					  insert);
+		if (err)
+			return i;
+		if (radix_tree_needs_preload() || (i && need_resched()))
+			return i + 1;
+	}
+	return i;
+}
+
+
+/**
+ * find_or_insert_placeholders - locate a group of pagecache pages or insert one
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ * @gfp_mask: page allocation mode
+ * @insert: the page to insert if none is found
+ * @iowait: 1 if you want to wait for dirty or writeback pages.
+ *
+ * This locks down a range of offsets in the address space.  Any pages
+ * already present are locked and a placeholder page is inserted into
+ * the radix tree for any offsets without pages.
+ */
+int find_or_insert_placeholders(struct address_space *mapping,
+				  struct page **pages, unsigned long start,
+				  unsigned long end, unsigned long nr,
+				  gfp_t gfp_mask,
+				  struct page *insert,
+				  int iowait)
+{
+	int err = 0;
+	int i, ret;
+	unsigned long cur = start;
+	unsigned long ins;
+	struct page *page;
+	int restart;
+
+	/*
+	 * this gets complicated.  Placeholders and page locks need to
+	 * be taken in order.  We use gang lookup to cut down on the cpu
+	 * cost, but we need to keep track of holes in the results and
+	 * insert placeholders as appropriate.
+	 *
+	 * If a locked page or a placeholder is found, we wait for it and
+	 * pick up where we left off.  If a dirty or PG_writeback page is found
+	 * and iowait==1, we have to drop all of our locks, kick/wait for the
+	 * io and start over.
+	 */
+repeat:
+	if (cur != start )
+		cond_resched();
+	err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (err)
+		goto fail;
+	write_lock_irq(&mapping->tree_lock);
+repeat_lock:
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+					(void **)pages, cur,
+					min(nr, end-cur));
+	for (i = 0 ; i < ret ; i++) {
+		restart = 0;
+		page = pages[i];
+
+		if (PagePlaceHolder(page)) {
+			radix_tree_preload_end();
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 1);
+			goto repeat;
+		}
+
+		if (page->index > cur) {
+			unsigned long top = min(end, page->index);
+			ins = insert_placeholders(mapping, insert, cur,
+						  top - cur);
+			cur += ins;
+			if (radix_tree_needs_preload() || ins != top - cur) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+				if (ins == 0) {
+					err = -1;
+					goto fail;
+				}
+				goto repeat;
+			}
+		}
+		if (page->index >= end) {
+			ret = 0;
+			break;
+		}
+		page_cache_get(page);
+		if (TestSetPageLocked(page)) {
+			unsigned long tmpoff = page->index;
+			page_cache_get(page);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			__lock_page(page);
+			/* Has the page been truncated while we slept? */
+			if (unlikely(page->mapping != mapping ||
+				     page->index != tmpoff)) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto repeat;
+			} else {
+				/* we've locked the page, but  we need to
+				 *  check it for dirty/writeback
+				 */
+				restart = 1;
+			}
+		}
+		if (iowait && (PageDirty(page) || PageWriteback(page))) {
+			/*
+			 * mapge_writepages collects a number of pages
+			 * into a bio, marking them PG_writeback as it goes.
+			 * It also locks pages while deciding what to do
+			 * with them.  What we want to do is have a bunch
+			 * of pages locked, and then wait for any io to
+			 * finish.  But if mpage_writepages is waiting for
+			 * one of our locked pages, it may never start the
+			 * io on the bio it has created.
+			 *
+			 * so, we have to drop all of our locks and
+			 * start over
+			 */
+			unlock_page(page);
+			page_cache_release(page);
+			if (!restart) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+			}
+			remove_placeholder_pages(mapping, pages, insert,
+			                         start, cur, nr);
+			filemap_write_and_wait_range(mapping,
+						 start << PAGE_CACHE_SHIFT,
+						 end << PAGE_CACHE_SHIFT);
+			cur = start;
+			goto repeat;
+		}
+		cur++;
+		if (restart)
+			goto repeat;
+		if (cur >= end)
+			break;
+	}
+
+	/* we haven't yet filled the range */
+	if (cur < end) {
+		/* if the search filled our array, there is more to do. */
+		if (ret && ret == nr)
+			goto repeat_lock;
+
+		/* otherwise insert placeholders for the remaining offsets */
+		ins = insert_placeholders(mapping, insert, cur, end - cur);
+		cur += ins;
+		if (ins != end -cur) {
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			if (ins == 0) {
+				err = -1;
+				goto fail;
+			}
+			goto repeat;
+		}
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	return err;
+fail:
+	remove_placeholder_pages(mapping, pages, insert, start, cur, nr);
+	return err;
+}
+EXPORT_SYMBOL_GPL(find_or_insert_placeholders);
 
 /**
  * find_trylock_page - find and lock a page
@@ -628,7 +985,7 @@ struct page *find_trylock_page(struct ad
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
+	if (page && (PagePlaceHolder(page) || TestSetPageLocked(page)))
 		page = NULL;
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
@@ -654,6 +1011,12 @@ repeat:
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
+		if (PagePlaceHolder(page)) {
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 0);
+			read_lock_irq(&mapping->tree_lock);
+			goto repeat;
+		}
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -737,14 +1100,25 @@ unsigned find_get_pages(struct address_s
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			    unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -771,6 +1145,8 @@ unsigned find_get_pages_contig(struct ad
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, index, nr_pages);
 	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i]))
+			break;
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
@@ -795,14 +1171,25 @@ unsigned find_get_pages_tag(struct addre
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
diff -r 18a9e9f5c707 mm/truncate.c
--- a/mm/truncate.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/mm/truncate.c	Tue Oct 24 15:10:48 2006 -0400
@@ -207,6 +207,7 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_release(&pvec);
 	}
+	wait_on_placeholder_pages_range(mapping, start, end);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/2] O_DIRECT locking via placeholders
  2006-10-24 19:34 ` Chris Mason
  2006-10-24 19:38   ` [RFC PATCH 1/2] placeholder pages Chris Mason
@ 2006-10-24 19:40   ` Chris Mason
  2006-10-24 20:28   ` [RFC PATCH 0/2] O_DIRECT locking rework Badari Pulavarty
  2 siblings, 0 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-24 19:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, zach.brown

This changes O_DIRECT to take page locks or insert placeholder pages to
lock regions under direct io.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 18a9e9f5c707 fs/direct-io.c
--- a/fs/direct-io.c	Thu Oct 19 08:30:00 2006 +0700
+++ b/fs/direct-io.c	Tue Oct 24 15:10:48 2006 -0400
@@ -35,6 +35,7 @@
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
+#include <linux/writeback.h>
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -94,6 +95,14 @@ struct dio {
 	struct buffer_head map_bh;	/* last get_block() result */
 
 	/*
+	 * kernel page pinning
+	 */
+	struct page fake;
+	struct page *tmppages[DIO_PAGES];
+	unsigned long fspages_start_off;
+	unsigned long fspages_end_off;
+
+	/*
 	 * Deferred addition of a page to the dio.  These variables are
 	 * private to dio_send_cur_page(), submit_page_section() and
 	 * dio_bio_add_page().
@@ -190,6 +199,28 @@ out:
 	return ret;	
 }
 
+static void unlock_page_range(struct dio *dio, unsigned long start,
+			      unsigned long nr)
+{
+	remove_placeholder_pages(dio->inode->i_mapping, dio->tmppages,
+				 &dio->fake,
+				 start, start + nr,
+				 ARRAY_SIZE(dio->tmppages));
+}
+
+static int lock_page_range(struct dio *dio, unsigned long start,
+			   unsigned long nr)
+{
+	struct address_space *mapping = dio->inode->i_mapping;
+	struct page *fake = &dio->fake;
+	unsigned long end = start + nr;
+	return find_or_insert_placeholders(mapping, dio->tmppages, start, end,
+	                                  ARRAY_SIZE(dio->tmppages),
+					  GFP_KERNEL, fake,
+					  dio->rw == READ);
+}
+
+
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
  * buffered inside the dio so that we can call get_user_pages() against a
@@ -219,9 +250,9 @@ static void dio_complete(struct dio *dio
 {
 	if (dio->end_io && dio->result)
 		dio->end_io(dio->iocb, offset, bytes, dio->map_bh.b_private);
-	if (dio->lock_type == DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
+	unlock_page_range(dio, dio->fspages_start_off,
+			  dio->fspages_end_off - dio->fspages_start_off);
+	dio->fspages_end_off = dio->fspages_start_off;
 }
 
 /*
@@ -517,6 +548,7 @@ static int get_more_blocks(struct dio *d
 	unsigned long fs_count;	/* Number of filesystem-sized blocks */
 	unsigned long dio_count;/* Number of dio_block-sized blocks */
 	unsigned long blkmask;
+	unsigned long index;
 	int create;
 
 	/*
@@ -544,7 +576,21 @@ static int get_more_blocks(struct dio *d
 		} else if (dio->lock_type == DIO_NO_LOCKING) {
 			create = 0;
 		}
-
+	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
+		                        dio->inode->i_blkbits);
+		if (index >= dio->fspages_end_off) {
+			unsigned long end;
+			unsigned long nr;
+			end = (dio->final_block_in_request >>
+			       dio->blkfactor) >>
+			      (PAGE_CACHE_SHIFT - dio->inode->i_blkbits);
+			nr = min(end - index + 1, (unsigned long)DIO_PAGES);
+			ret = lock_page_range(dio, dio->fspages_end_off, nr);
+			if (ret)
+				goto error;
+			dio->fspages_end_off += nr;
+			BUG_ON(index >= dio->fspages_end_off);
+		}
 		/*
 		 * For writes inside i_size we forbid block creations: only
 		 * overwrites are permitted.  We fall back to buffered writes
@@ -554,6 +600,7 @@ static int get_more_blocks(struct dio *d
 		ret = (*dio->get_block)(dio->inode, fs_startblk,
 						map_bh, create);
 	}
+error:
 	return ret;
 }
 
@@ -944,7 +991,7 @@ out:
 }
 
 /*
- * Releases both i_mutex and i_alloc_sem
+ * Releases both i_mutex
  */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
@@ -1191,7 +1238,6 @@ __blockdev_direct_IO(int rw, struct kioc
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
 	struct dio *dio;
-	int release_i_mutex = 0;
 	int acquire_i_mutex = 0;
 
 	if (rw & WRITE)
@@ -1221,11 +1267,14 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 		}
 	}
-
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
 		goto out;
+
+	set_page_placeholder(&dio->fake);
+	dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
+	dio->fspages_end_off = dio->fspages_start_off;
 
 	/*
 	 * For block device access DIO_NO_LOCKING is used,
@@ -1240,30 +1289,11 @@ __blockdev_direct_IO(int rw, struct kioc
 	if (dio_lock_type != DIO_NO_LOCKING) {
 		/* watch out for a 0 len io from a tricksy fs */
 		if (rw == READ && end > offset) {
-			struct address_space *mapping;
-
-			mapping = iocb->ki_filp->f_mapping;
-			if (dio_lock_type != DIO_OWN_LOCKING) {
-				mutex_lock(&inode->i_mutex);
-				release_i_mutex = 1;
-			}
-
-			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
-			if (retval) {
-				kfree(dio);
-				goto out;
-			}
-
 			if (dio_lock_type == DIO_OWN_LOCKING) {
 				mutex_unlock(&inode->i_mutex);
 				acquire_i_mutex = 1;
 			}
 		}
-
-		if (dio_lock_type == DIO_LOCKING)
-			/* lockdep: not the owner will release it */
-			down_read_non_owner(&inode->i_alloc_sem);
 	}
 
 	/*
@@ -1278,13 +1308,8 @@ __blockdev_direct_IO(int rw, struct kioc
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
 
-	if (rw == READ && dio_lock_type == DIO_LOCKING)
-		release_i_mutex = 0;
-
 out:
-	if (release_i_mutex)
-		mutex_unlock(&inode->i_mutex);
-	else if (acquire_i_mutex)
+	if (acquire_i_mutex)
 		mutex_lock(&inode->i_mutex);
 	return retval;
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-24 19:34 ` Chris Mason
  2006-10-24 19:38   ` [RFC PATCH 1/2] placeholder pages Chris Mason
  2006-10-24 19:40   ` [RFC PATCH 2/2] O_DIRECT locking via placeholders Chris Mason
@ 2006-10-24 20:28   ` Badari Pulavarty
  2006-10-24 20:50     ` Chris Mason
  2 siblings, 1 reply; 17+ messages in thread
From: Badari Pulavarty @ 2006-10-24 20:28 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm, Zach Brown

On Tue, 2006-10-24 at 15:34 -0400, Chris Mason wrote:
> Hello,
> 
> Here is a new cut of my O_DIRECT locking rework.  It has a much lower
> cpu cost than the last set, and simple benchmarks no longer show a
> regression here in system time.  But, the complexity for inserting
> placeholder pages has gone up.
> 
> I've also changed the way I test for a place holder page (from
> mm/filemap.c):
> 
> static struct address_space placeholder_address_space;
> #define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
> 
> This is more stable than the last one but I'm just starting to run race
> and load testing on it.
> 
> -chris

Gave it a spin with simple fsx tests and ran into ..

=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
fsx-linux/2409 is trying to release lock (&inode->i_mutex) at:
[<ffffffff804a09e9>] mutex_unlock+0x9/0xb
but there are no more locks to release!

other info that might help us debug this:
no locks held by fsx-linux/2409.

stack backtrace:

Call Trace:
 [<ffffffff8020b0d7>] dump_trace+0xaa/0x3ed
 [<ffffffff8020b456>] show_trace+0x3c/0x52
 [<ffffffff8020b481>] dump_stack+0x15/0x17
 [<ffffffff8024b269>] print_unlock_inbalance_bug+0x108/0x118
 [<ffffffff8024cf32>] lock_release+0x92/0x13d
 [<ffffffff804a0977>] __mutex_unlock_slowpath+0xbf/0x128
 [<ffffffff804a09e9>] mutex_unlock+0x9/0xb
 [<ffffffff802b56b8>] __blockdev_direct_IO+0x961/0xb37
 [<ffffffff80301813>] ext3_direct_IO+0xf4/0x192
 [<ffffffff8026794a>] generic_file_direct_IO+0xa2/0xeb
 [<ffffffff8026951e>] generic_file_aio_read+0xc7/0x19d
 [<ffffffff8028f589>] do_sync_read+0xe2/0x126
 [<ffffffff8028fe39>] vfs_read+0xcc/0x172
 [<ffffffff80290240>] sys_read+0x47/0x6f
 [<ffffffff80209cee>] system_call+0x7e/0x83
DWARF2 unwinder stuck at system_call+0x7e/0x83
Leftover inexact backtrace:





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-24 20:28   ` [RFC PATCH 0/2] O_DIRECT locking rework Badari Pulavarty
@ 2006-10-24 20:50     ` Chris Mason
  2006-10-24 21:52       ` Badari Pulavarty
  2006-10-24 22:37       ` Russell Cattelan
  0 siblings, 2 replies; 17+ messages in thread
From: Chris Mason @ 2006-10-24 20:50 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, akpm, Zach Brown

On Tue, Oct 24, 2006 at 01:28:31PM -0700, Badari Pulavarty wrote:
> On Tue, 2006-10-24 at 15:34 -0400, Chris Mason wrote:
> > Hello,
> > 
> > Here is a new cut of my O_DIRECT locking rework.  It has a much lower
> > cpu cost than the last set, and simple benchmarks no longer show a
> > regression here in system time.  But, the complexity for inserting
> > placeholder pages has gone up.
> > 
> > I've also changed the way I test for a place holder page (from
> > mm/filemap.c):
> > 
> > static struct address_space placeholder_address_space;
> > #define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
> > 
> > This is more stable than the last one but I'm just starting to run race
> > and load testing on it.
> > 
> > -chris
> 
> Gave it a spin with simple fsx tests and ran into ..

Thanks, I missed an i_mutex change.  Let me audit mutexes coming in and
out again and resend.

I think the new rules should be:

reads: don't need i_mutex at all
writes: don't need i_mutex for io inside i_size.

Oh, and don't mess with i_mutex under DIO_OWN_LOCKING.

-chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-24 20:50     ` Chris Mason
@ 2006-10-24 21:52       ` Badari Pulavarty
  2006-10-24 22:22         ` Chris Mason
  2006-10-24 22:37       ` Russell Cattelan
  1 sibling, 1 reply; 17+ messages in thread
From: Badari Pulavarty @ 2006-10-24 21:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm, Zach Brown

On Tue, 2006-10-24 at 16:50 -0400, Chris Mason wrote:
> On Tue, Oct 24, 2006 at 01:28:31PM -0700, Badari Pulavarty wrote:
> > On Tue, 2006-10-24 at 15:34 -0400, Chris Mason wrote:
> > > Hello,
> > > 
> > > Here is a new cut of my O_DIRECT locking rework.  It has a much lower
> > > cpu cost than the last set, and simple benchmarks no longer show a
> > > regression here in system time.  But, the complexity for inserting
> > > placeholder pages has gone up.
> > > 
> > > I've also changed the way I test for a place holder page (from
> > > mm/filemap.c):
> > > 
> > > static struct address_space placeholder_address_space;
> > > #define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
> > > 
> > > This is more stable than the last one but I'm just starting to run race
> > > and load testing on it.
> > > 
> > > -chris
> > 
> > Gave it a spin with simple fsx tests and ran into ..
> 
> Thanks, I missed an i_mutex change.  Let me audit mutexes coming in and
> out again and resend.
> 
> I think the new rules should be:
> 
> reads: don't need i_mutex at all
> writes: don't need i_mutex for io inside i_size.
> 
> Oh, and don't mess with i_mutex under DIO_OWN_LOCKING.

No problem.

I ran into one more nasty looking bug ..

------------[ cut here ]------------
kernel BUG at mm/filemap.c:818!
invalid opcode: 0000 [1] SMP
last sysfs file: /block/sda/sda1/size
CPU 0
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc af_packet
xt_state ip_conntrack nfnetlink xt_tcpudp ip6table_filter ip6_tables
x_tables ipv6 freq_table processor binfmt_misc parport_pc lp parport
ide_cd cdrom generic tsdev floppy piix e752x_edac edac_mc ehci_hcd
shpchp uhci_hcd i2c_i801 pci_hotplug i2c_core serio_raw usbcore
dm_snapshot dm_zero dm_mirror dm_mod ide_disk ide_core
Pid: 2669, comm: fsx-linux Not tainted 2.6.19-rc2-mm1-smp #3
RIP: 0010:[<ffffffff802667cf>]  [<ffffffff802667cf>]
remove_placeholder_pages+0x6e/0x129
RSP: 0018:ffff8101090cba28  EFLAGS: 00010093
RAX: 0000000000000002 RBX: 0000000000000004 RCX: 0000000000000006
RDX: 0000000000000040 RSI: 0000000000000002 RDI: 0000000000000040
RBP: ffff8101090cba88 R08: ffff810109163fa0 R09: 0000000000001000
R10: 0000000000000002 R11: ffff810109163d88 R12: 0000000000000000
R13: 000000000000003f R14: 0000000000000000 R15: ffff810120739718
FS:  00002ab9e503b860(0000) GS:ffffffff806b6000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000608000 CR3: 00000001096af000 CR4: 00000000000006e0
Process fsx-linux (pid: 2669, threadinfo ffff8101090ca000, task
ffff8101091d3080)
Stack:  0000000000000002 0000000000000040 0000000000000043
ffff8101186506f8
 ffff810118650750 0000000000000002 0000000200000000 ffff810118650608
 0000000000000000 0000000000000000 0000000000000000 00000000fffffff1
Call Trace:
 [<ffffffff802b4d1f>] dio_complete+0x9f/0xd7
 [<ffffffff802b57ee>] __blockdev_direct_IO+0xa97/0xb37
 [<ffffffff80301813>] ext3_direct_IO+0xf4/0x192
 [<ffffffff8026794a>] generic_file_direct_IO+0xa2/0xeb
 [<ffffffff802679f7>] generic_file_direct_write+0x64/0x104
 [<ffffffff8026838e>] __generic_file_aio_write_nolock+0x2e4/0x417
 [<ffffffff80268525>] generic_file_aio_write+0x64/0xc0
 [<ffffffff802fce06>] ext3_file_write+0x1e/0x9b
 [<ffffffff8028f463>] do_sync_write+0xe2/0x126
 [<ffffffff8028fcc7>] vfs_write+0xcf/0x175
 [<ffffffff802902af>] sys_write+0x47/0x70
 [<ffffffff80209cee>] system_call+0x7e/0x83
DWARF2 unwinder stuck at system_call+0x7e/0x83
Leftover inexact backtrace:


Code: 0f 0b eb fe 49 8b 1e 48 81 7b 18 40 2d c1 80 75 18 48 3b 5d
RIP  [<ffffffff802667cf>] remove_placeholder_pages+0x6e/0x129
 RSP <ffff8101090cba28>
 <3>BUG: sleeping function called from invalid context at
kernel/rwsem.c:20
in_atomic():0, irqs_disabled():1

Call Trace:
 [<ffffffff8020b0d7>] dump_trace+0xaa/0x3ed
 [<ffffffff8020b456>] show_trace+0x3c/0x52
 [<ffffffff8020b481>] dump_stack+0x15/0x17
 [<ffffffff8022a2ad>] __might_sleep+0xb2/0xb4
 [<ffffffff80248d8e>] down_read+0x1d/0x4a
 [<ffffffff8023f867>] blocking_notifier_call_chain+0x1b/0x40
 [<ffffffff80234195>] profile_task_exit+0x15/0x17
 [<ffffffff80235c1a>] do_exit+0x25/0x8a5
 [<ffffffff8020b75c>] kernel_math_error+0x0/0x96
 [<ffffffff804a323c>] do_trap+0xdf/0xee
 [<ffffffff8020bd87>] do_invalid_op+0xac/0xb8
 [<ffffffff804a2b1d>] error_exit+0x0/0x96
DWARF2 unwinder stuck at error_exit+0x0/0x96
Leftover inexact backtrace:
 [<ffffffff802667cf>] remove_placeholder_pages+0x6e/0x129
 [<ffffffff802667b8>] remove_placeholder_pages+0x57/0x129
 [<ffffffff802b4d1f>] dio_complete+0x9f/0xd7
 [<ffffffff802b57ee>] __blockdev_direct_IO+0xa97/0xb37
 [<ffffffff80301813>] ext3_direct_IO+0xf4/0x192
 [<ffffffff802ffc10>] ext3_get_block+0x0/0xec
 [<ffffffff8026794a>] generic_file_direct_IO+0xa2/0xeb
 [<ffffffff802679f7>] generic_file_direct_write+0x64/0x104
 [<ffffffff8026838e>] __generic_file_aio_write_nolock+0x2e4/0x417
 [<ffffffff804a0c6d>] mutex_lock+0x2a/0x2e
 [<ffffffff80268525>] generic_file_aio_write+0x64/0xc0
 [<ffffffff802fce06>] ext3_file_write+0x1e/0x9b
 [<ffffffff8028f463>] do_sync_write+0xe2/0x126
 [<ffffffff80246220>] autoremove_wake_function+0x0/0x38
 [<ffffffff8028fcc7>] vfs_write+0xcf/0x175
 [<ffffffff802902af>] sys_write+0x47/0x70
 [<ffffffff80209cee>] system_call+0x7e/0x83

NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc af_packet
xt_state ip_conntrack nfnetlink xt_tcpudp ip6table_filter ip6_tables
x_tables ipv6 freq_table processor binfmt_misc parport_pc lp parport
ide_cd cdrom generic tsdev floppy piix e752x_edac edac_mc ehci_hcd
shpchp uhci_hcd i2c_i801 pci_hotplug i2c_core serio_raw usbcore
dm_snapshot dm_zero dm_mirror dm_mod ide_disk ide_core
Pid: 106, comm: pdflush Not tainted 2.6.19-rc2-mm1-smp #3
RIP: 0010:[<ffffffff8036d445>]  [<ffffffff8036d445>] __read_lock_failed
+0x5/0x14RSP: 0018:ffff8101210d9b38  EFLAGS: 00000097
RAX: ffff810120c480c0 RBX: ffff810120739730 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff810120739730
RBP: ffff8101210d9b50 R08: 0000000000000002 R09: 0000000000000001
R10: ffffffff80266222 R11: ffff810120c480c0 R12: 0000000000000000
R13: 000000000000000e R14: ffff8101210d9c30 R15: ffff810120739730
FS:  0000000000000000(0000) GS:ffffffff806b6000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b8ea4469b20 CR3: 000000011576d000 CR4: 00000000000006e0
Process pdflush (pid: 106, threadinfo ffff8101210d8000, task
ffff810120c480c0)
Stack:  ffffffff8036e17d ffff810120c47988 ffff810120739730
ffff8101210d9b70
 ffffffff804a25f2 ffff8101210d9c30 ffff810120739720 ffff8101210d9bb0
 ffffffff80266222 ffff8101210d9ca8 ffff8101210d9c20 ffff8101210d9e20
Call Trace:
Inexact backtrace:
 [<ffffffff8036e17d>] _raw_read_lock+0x2f/0x33
 [<ffffffff804a25f2>] _read_lock_irq+0x36/0x3b
 [<ffffffff80266222>] find_get_pages_tag+0x31/0xce
 [<ffffffff8026ed64>] pagevec_lookup_tag+0x22/0x2b
 [<ffffffff8026c920>] generic_writepages+0x309/0x36d
 [<ffffffff8030131d>] ext3_ordered_writepage+0x0/0x199
 [<ffffffff8026c9b6>] do_writepages+0x32/0x3a
 [<ffffffff802ab6e1>] __writeback_single_inode+0x1ab/0x37d
 [<ffffffff802abcbf>] generic_sync_sb_inodes+0x1e8/0x2e2
 [<ffffffff802abdd6>] sync_sb_inodes+0x1d/0x1f
 [<ffffffff802ac181>] writeback_inodes+0x8f/0xde
 [<ffffffff8026d2c9>] pdflush+0x0/0x1f5
 [<ffffffff8026cec6>] wb_kupdate+0x9f/0x113
 [<ffffffff8026d413>] pdflush+0x14a/0x1f5
 [<ffffffff8026ce27>] wb_kupdate+0x0/0x113
 [<ffffffff802460ee>] kthread+0xea/0x120
 [<ffffffff8020ab25>] child_rip+0xa/0x15
 [<ffffffff804a2622>] _spin_unlock_irq+0x2b/0x31
 [<ffffffff8020a29c>] restore_args+0x0/0x30
 [<ffffffff80242659>] run_workqueue+0x19/0xf5
 [<ffffffff80246004>] kthread+0x0/0x120
 [<ffffffff8020ab1b>] child_rip+0x0/0x15


Code: 83 3f 01 78 f9 f0 ff 0f 0f 88 ed ff ff ff c3 55 31 c9 48 89



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-24 21:52       ` Badari Pulavarty
@ 2006-10-24 22:22         ` Chris Mason
  2006-10-24 22:47           ` Badari Pulavarty
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Mason @ 2006-10-24 22:22 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, akpm, Zach Brown

On Tue, Oct 24, 2006 at 02:52:03PM -0700, Badari Pulavarty wrote:
> > 
> > Thanks, I missed an i_mutex change.  Let me audit mutexes coming in and
> > out again and resend.
> > 
> > I think the new rules should be:
> > 
> > reads: don't need i_mutex at all
> > writes: don't need i_mutex for io inside i_size.
> > 
> > Oh, and don't mess with i_mutex under DIO_OWN_LOCKING.
> 
> No problem.
> 
> I ran into one more nasty looking bug ..

Neat.  Which fsx command line did you use?

-chris


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-24 22:22         ` Chris Mason
@ 2006-10-24 22:47           ` Badari Pulavarty
  0 siblings, 0 replies; 17+ messages in thread
From: Badari Pulavarty @ 2006-10-24 22:47 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm, Zach Brown

On Tue, 2006-10-24 at 18:22 -0400, Chris Mason wrote:
> On Tue, Oct 24, 2006 at 02:52:03PM -0700, Badari Pulavarty wrote:
> > > 
> > > Thanks, I missed an i_mutex change.  Let me audit mutexes coming in and
> > > out again and resend.
> > > 
> > > I think the new rules should be:
> > > 
> > > reads: don't need i_mutex at all
> > > writes: don't need i_mutex for io inside i_size.
> > > 
> > > Oh, and don't mess with i_mutex under DIO_OWN_LOCKING.
> > 
> > No problem.
> > 
> > I ran into one more nasty looking bug ..
> 
> Neat.  Which fsx command line did you use?
> 

Not sure exactly which one .. but one of these :)

./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -n -N 10000 test/junkfile
./fsx-linux -r 4096 -N 10000 test/junkfile
./fsx-linux -r 2048 -N 10000 test/junkfile
./fsx-linux -b 1000 -N 10000 test/junkfile
./fsx-linux -s 1 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 2048 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096  -t 4096 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 2048  -t 2048 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096  -A -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048  -A -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -A -O -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048 -A -O -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -A -S 0 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048 -A -S 2000 -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -W -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048 -W -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -W -A -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048 -W -A -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -A -R -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048 -A -R -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -t 2048 -w 2048 -Z -R -W -N 10000 test/junkfile
./fsx-linux -l 500000 -r 4096 -w 4096 -Z -R -W -A  -N 10000 test/junkfile
./fsx-linux -A -N 10000 test/junkfile
./fsx-linux -N 10000 test/junkfile
./fsx-linux -N 10000 test/junkfile
./fsx-linux -N 10000 test/junkfile
./fsx-linux -N 10000 test/junkfile
./fsx-linux -N 10000 -o 1024 test/junkfile
./fsx-linux -N 10000 -o 2048 test/junkfile
./fsx-linux -N 10000 -o 4096 test/junkfile
./fsx-linux -N 10000 -o 8192 test/junkfile
./fsx-linux -N 10000 -o 16384 test/junkfile
./fsx-linux -N 10000 -o 32768 test/junkfile
./fsx-linux -N 10000 -o 128000 test/junkfile
./fsx-linux -N 10000 -o 1024 -A test/junkfile
./fsx-linux -N 10000 -o 2048 -A test/junkfile
./fsx-linux -N 10000 -o 4096 -A test/junkfile
./fsx-linux -N 10000 -o 8192 -A test/junkfile
./fsx-linux -N 10000 -o 16384 -A test/junkfile
./fsx-linux -N 10000 -o 32768 -A test/junkfile
./fsx-linux -N 10000 -o 128000 -A test/junkfile
./fsx-linux -N 10000 -o 1024 -A -l 500000 -r 2048 -t 4096 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 2048 -A -l 500000 -r 512 -t 2048 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 4096 -A -l 500000 -r 512 -t 4096 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 8192 -A -l 500000 -r 1024 -t 2048 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 16384 -A -l 500000 -r 4096 -t 4096 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 32768 -A -l 500000 -r 2048 -t 2048 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 128000 -A -l 500000 -r 512 -t 4096 -w 1024 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 32768 -r 4096 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 128000 -r 2048 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 1024 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 2048 -l 500000 -r 4096 -t 2048 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 4096 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 8192 -l 500000 -r 4096 -t 2048 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 16384 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 32768 -l 500000 -r 4096 -t 2048 -w 2048 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 128000 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 128000 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 16384 -A -l 500000 -r 4096 -t 4096 -w 4096 test/junkfile1 &
./fsx-linux -N 10000 -o 32768 -A -l 500000 -r 4096 -t 2048 -w 2048 test/junkfile2 &
./fsx-linux -N 10000 -o 128000 -A -l 500000 -r 4096 -t 4096 -w 4096 test/junkfile3 &
./fsx-linux -N 10000 -o 16384 -A -l 500000 -r 4096 -t 4096 -w 4096 test/junkfile4 &
./fsx-linux -N 10000 -o 32768 -A -l 500000 -r 4096 -t 2048 -w 2048 test/junkfile5 &
./fsx-linux -N 10000 -o 128000 -A -l 500000 -r 4096 -t 4096 -w 4096 test/junkfile6 &
./fsx-linux -N 10000 -A test/junkfile7 &
./fsx-linux -N 100000 -A test/junkfile8 &
./fsx-linux -N 100000 -A test/junkfile9 &
./fsx-linux -N 10000 -o 8192 -A -l 500000 -r 1024 -t 2048 -w 1024 -Z -R -W test/junkfile
./fsx-linux -N 10000 -o 16384 -A -l 500000 -r 2048 -t 4096 -w 1024 -Z -R -W test/junkfile



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/2] O_DIRECT locking rework
  2006-10-24 20:50     ` Chris Mason
  2006-10-24 21:52       ` Badari Pulavarty
@ 2006-10-24 22:37       ` Russell Cattelan
  1 sibling, 0 replies; 17+ messages in thread
From: Russell Cattelan @ 2006-10-24 22:37 UTC (permalink / raw)
  To: Chris Mason; +Cc: Badari Pulavarty, linux-fsdevel, akpm, Zach Brown

[-- Attachment #1: Type: text/plain, Size: 1630 bytes --]

On Tue, 2006-10-24 at 16:50 -0400, Chris Mason wrote:
> On Tue, Oct 24, 2006 at 01:28:31PM -0700, Badari Pulavarty wrote:
> > On Tue, 2006-10-24 at 15:34 -0400, Chris Mason wrote:
> > > Hello,
> > > 
> > > Here is a new cut of my O_DIRECT locking rework.  It has a much lower
> > > cpu cost than the last set, and simple benchmarks no longer show a
> > > regression here in system time.  But, the complexity for inserting
> > > placeholder pages has gone up.
> > > 
> > > I've also changed the way I test for a place holder page (from
> > > mm/filemap.c):
> > > 
> > > static struct address_space placeholder_address_space;
> > > #define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
> > > 
> > > This is more stable than the last one but I'm just starting to run race
> > > and load testing on it.
> > > 
> > > -chris
> > 
> > Gave it a spin with simple fsx tests and ran into ..
> 
> Thanks, I missed an i_mutex change.  Let me audit mutexes coming in and
> out again and resend.
> 
> I think the new rules should be:
> 
> reads: don't need i_mutex at all
> writes: don't need i_mutex for io inside i_size.
> 
> Oh, and don't mess with i_mutex under DIO_OWN_LOCKING.
Can we get rid of DIO_OWN_LOCKING altogether.
It's horribly confusing and overload with regards to the
create flag in get_blocks.


> -chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-- 
Russell Cattelan <cattelan@thebarn.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-10-24 22:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-20 18:32 [RFC PATCH 0/2] O_DIRECT locking rework Chris Mason
2006-10-20 18:38 ` [RFC PATCH 1/2] placeholder pages Chris Mason
2006-10-20 18:41 ` [RFC PATCH 1/2] page cache locking for O_DIRECT Chris Mason
2006-10-20 19:30 ` [RFC PATCH 0/2] O_DIRECT locking rework Andrew Morton
2006-10-20 20:03   ` Zach Brown
2006-10-20 20:12     ` Chris Mason
2006-10-20 20:05   ` Chris Mason
2006-10-20 20:23   ` Andi Kleen
2006-10-24 19:34 ` Chris Mason
2006-10-24 19:38   ` [RFC PATCH 1/2] placeholder pages Chris Mason
2006-10-24 19:40   ` [RFC PATCH 2/2] O_DIRECT locking via placeholders Chris Mason
2006-10-24 20:28   ` [RFC PATCH 0/2] O_DIRECT locking rework Badari Pulavarty
2006-10-24 20:50     ` Chris Mason
2006-10-24 21:52       ` Badari Pulavarty
2006-10-24 22:22         ` Chris Mason
2006-10-24 22:47           ` Badari Pulavarty
2006-10-24 22:37       ` Russell Cattelan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).