public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] aio: invalidate async directio writes
@ 2008-06-18 18:09 Jeff Moyer
  2008-06-18 18:22 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jeff Moyer @ 2008-06-18 18:09 UTC (permalink / raw)
  To: akpm; +Cc: zach.brown, linux-aio, linux-kernel

Hi, Andrew,

This is a follow-up to:

commit bdb76ef5a4bc8676a81034a443f1eda450b4babb
Author: Zach Brown <zach.brown@oracle.com>
Date:   Tue Oct 30 11:45:46 2007 -0700

    dio: fix cache invalidation after sync writes
    
    Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
    clean pages before dio write") introduced a bug which stopped dio from
    ever invalidating the page cache after writes.  It still invalidated it
    before writes so most users were fine.
    
    Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
    this bug when he had a buffered reader immediately reading file data
    after an O_DIRECT [writer] had written the data.  The kernel issued
    read-ahead beyond the position of the reader which overlapped with the
    O_DIRECT writer.  The failure to invalidate after writes caused the
    reader to see stale data from the read-ahead.
    
    The following patch is originally from Karl.  The following commentary
    is his:
    
        The below 3rd try takes on your suggestion of just invalidating
        no matter what the retval from the direct_IO call.  I ran it
        thru the test-case several times and it has worked every time.
        The post-invalidate is probably still too early for async-directio,
        but I don't have a testcase for that;  just sync.  And, this
        won't be any worse in the async case.
    
    I added a test to the aio-dio-regress repository which mimics Karl's IO
    pattern.  It verifed the bad behaviour and that the patch fixed it.  I
    agree with Karl, this still doesn't help the case where a buffered
    reader follows an AIO O_DIRECT writer.  That will require a bit more
    work.
    
    This gives up on the idea of returning EIO to indicate to userspace that
    stale data remains if the invalidation failed.

Note the second-to-last paragraph, where it mentions that this does not fix
the AIO case.  I updated the regression test to also perform asynchronous
I/O and verified that the problem does exist.

To fix the problem, we need to invalidate the pages that were under write
I/O after the I/O completes.  Because the I/O completion handler can be called
in interrupt context (and invalidate_inode_pages2 cannot be called in interrupt
context), this patch opts to defer the completion to a workqueue.  That
workqueue is responsible for invalidating the page cache pages and completing
the I/O.

I verified that the test case passes with the following patch applied.
Since this affects the code path for all async writers, I had our
performance group run a patched kernel through an OLTP workload.  I'm
told that the noise level for this test is in the 5% range.  The results
showed a 2.9% degradation in performance at 80 users, and 80 users was
the peak performance for the test.  For other user counts, the patched
kernel varied from a percent better to a couple of percent worse.  So,
overall I think the patch is worth taking, given that it fixes a
potential data corrupter.  (Sorry I can't report the actual numbers,
since the tests were run on unreleased hardware.)

Comments, as always, are encouraged.

Cheers,

Jeff

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9e81add..90fa9e2 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -131,6 +131,8 @@ struct dio {
 	int is_async;			/* is IO async ? */
 	int io_error;			/* IO error in completion path */
 	ssize_t result;                 /* IO result */
+	struct list_head done_list;	/* AIO DIO writes to be completed
+					 * in process context */
 };
 
 /*
@@ -260,6 +262,40 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret)
 	return ret;
 }
 
+static void aio_complete_fn(void *data);
+static DECLARE_WORK(aio_complete_work, aio_complete_fn, NULL);
+static DEFINE_SPINLOCK(iocb_completion_list_lock);
+static LIST_HEAD(iocb_completion_list);
+
+static void aio_complete_fn(void *data)
+{
+	unsigned long flags;
+	LIST_HEAD(private);
+	struct dio *dio, *tmp;
+
+	spin_lock_irqsave(&iocb_completion_list_lock, flags);
+	list_splice_init(&iocb_completion_list, &private);
+	spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
+
+	list_for_each_entry_safe(dio, tmp, &private, done_list) {
+		struct kiocb *iocb = dio->iocb;
+		loff_t offset = iocb->ki_pos;
+		struct file *file = iocb->ki_filp;
+		struct address_space *mapping = file->f_mapping;
+		int ret;
+                pgoff_t end = (offset + dio->size - 1) >> PAGE_CACHE_SHIFT;
+
+		/* and now invalidate the page cache */
+		ret = dio_complete(dio, offset, 0);
+		if (mapping->nrpages)
+			invalidate_inode_pages2_range(mapping,
+						offset >> PAGE_CACHE_SHIFT, end);
+		aio_complete(dio->iocb, ret, 0);
+		list_del(&dio->done_list);
+		kfree(dio);
+	}
+}
+
 static int dio_bio_complete(struct dio *dio, struct bio *bio);
 /*
  * Asynchronous IO callback. 
@@ -280,9 +316,22 @@ static void dio_bio_end_aio(struct bio *bio, int error)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
-		int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
-		aio_complete(dio->iocb, ret, 0);
-		kfree(dio);
+		/* For async O_DIRECT writes, we need to invalidate the 
+		 * page cache after the write completes.  Kick off a
+		 * workqueue to do this and issue the completion in process
+		 * context.
+		 */
+		if (dio->rw == READ) {
+			int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
+			aio_complete(dio->iocb, ret, 0);
+			kfree(dio);
+		} else {
+			unsigned long flags;
+			spin_lock_irqsave(&iocb_completion_list_lock, flags);
+			list_add(&dio->done_list, &iocb_completion_list);
+			spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
+			schedule_work(&aio_complete_work);
+		}
 	}
 }
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-06-19 17:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-18 18:09 [patch] aio: invalidate async directio writes Jeff Moyer
2008-06-18 18:22 ` Christoph Hellwig
2008-06-18 19:45   ` Jeff Moyer
2008-06-18 19:48     ` Christoph Hellwig
2008-06-19  7:51 ` Peter Zijlstra
2008-06-19 13:50   ` Jeff Moyer
2008-06-19 13:58     ` Peter Zijlstra
2008-06-19 14:05       ` Jeff Moyer
2008-06-19 17:50   ` Zach Brown
2008-06-19 17:23 ` Zach Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox