All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
To: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	"Darrick J. Wong"
	<darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
	Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch
Date: Wed, 25 Jul 2018 16:28:39 -0600	[thread overview]
Message-ID: <20180725222839.GA28304@linux.intel.com> (raw)
In-Reply-To: <20180711081741.lmr44sp4cmt3f6um-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>

On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > Changes since v3:
> >  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> >    that the {ext4,xfs}_break_layouts() calls have the same meaning.
> >    (Dave, Darrick and Jan)
> 
> How about the occasional WARN_ON_ONCE you mention below. Were you able to
> hunt them down?

The root cause of this issue is that while the ei->i_mmap_sem provides
synchronization between ext4_break_layouts() and page faults, it doesn't
provide synchronize us with the direct I/O path.  This exact same issue exists
in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

This allows the direct I/O path to do I/O and raise & lower page->_refcount
while we're executing a truncate/hole punch.  This leads to us trying to free
a page with an elevated refcount.

Here's one instance of the race:

CPU 0					CPU 1
-----					-----
ext4_punch_hole()
  ext4_break_layouts() # all pages have refcount=1
  					
					ext4_direct_IO()
					  ... lots of layers ...
					  follow_page_pte()
					    get_page() # elevates refcount
  
  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()

A similar race occurs when the refcount is being dropped while we're running
ext4_break_layouts(), and this is the one that my test was actually hitting:

CPU 0					CPU 1
-----					-----
					ext4_direct_IO()
					  ... lots of layers ...
					  follow_page_pte()
					    get_page()
					    # elevates refcount of page X
ext4_punch_hole()
  ext4_break_layouts() # two pages, X & Y, have refcount == 2
    __wait_var_event() # called for page X
  					
					  __put_devmap_managed_page()
					  # drops refcount of X to 1

   # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
   # We never actually called ext4_wait_dax_page(), so 'retry' in
   # ext4_break_layouts() is still false.  Exit do/while loop in
   # ext4_break_layouts, never attempting to wait on page Y which still has an
   # elevated refcount of 2.

  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()

This second race can be fixed with the patch at the end of this function,
which I think should go in, unless there is a benfit to the current retry
scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
With this patch applied I've been able to run my unit test through
thousands of iterations, where it used to failed consistently within 10 or
so.

Even so, I wonder if the real solution is to add synchronization between
the direct I/O path and {ext4,xfs}_break_layouts()?  Other ideas on how
this should be handled?

--- >8 ---

>From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
From: Ross Zwisler <zwisler-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Wed, 25 Jul 2018 16:16:05 -0600
Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Note that this works around the race exposed by my unit test, but I think
that there is another race that needs to be addressed, probably with
additional synchronization added between direct I/O and
{ext4,xfs}_break_layouts().

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 fs/ext4/inode.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 27e9643bc13b..fedb88104bbf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 	return 0;
 }
 
-static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+static void ext4_wait_dax_page(struct ext4_inode_info *ei)
 {
-	*did_unlock = true;
 	up_write(&ei->i_mmap_sem);
 	schedule();
 	down_write(&ei->i_mmap_sem);
@@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct page *page;
-	bool retry;
 	int error;
 
 	if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
 		return -EINVAL;
 
 	do {
-		retry = false;
 		page = dax_layout_busy_page(inode->i_mapping);
 		if (!page)
 			return 0;
@@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode)
 		error = ___wait_var_event(&page->_refcount,
 				atomic_read(&page->_refcount) == 1,
 				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(ei, &retry));
-	} while (error == 0 && retry);
+				ext4_wait_dax_page(ei));
+	} while (error == 0);
 
 	return error;
 }
-- 
2.14.4

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-nvdimm@lists.01.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Dave Chinner <david@fromorbit.com>,
	Lukas Czerner <lczerner@redhat.com>,
	linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch
Date: Wed, 25 Jul 2018 16:28:39 -0600	[thread overview]
Message-ID: <20180725222839.GA28304@linux.intel.com> (raw)
In-Reply-To: <20180711081741.lmr44sp4cmt3f6um@quack2.suse.cz>

On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > Changes since v3:
> >  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> >    that the {ext4,xfs}_break_layouts() calls have the same meaning.
> >    (Dave, Darrick and Jan)
> 
> How about the occasional WARN_ON_ONCE you mention below. Were you able to
> hunt them down?

The root cause of this issue is that while the ei->i_mmap_sem provides
synchronization between ext4_break_layouts() and page faults, it doesn't
provide synchronize us with the direct I/O path.  This exact same issue exists
in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

This allows the direct I/O path to do I/O and raise & lower page->_refcount
while we're executing a truncate/hole punch.  This leads to us trying to free
a page with an elevated refcount.

Here's one instance of the race:

CPU 0					CPU 1
-----					-----
ext4_punch_hole()
  ext4_break_layouts() # all pages have refcount=1
  					
					ext4_direct_IO()
					  ... lots of layers ...
					  follow_page_pte()
					    get_page() # elevates refcount
  
  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()

A similar race occurs when the refcount is being dropped while we're running
ext4_break_layouts(), and this is the one that my test was actually hitting:

CPU 0					CPU 1
-----					-----
					ext4_direct_IO()
					  ... lots of layers ...
					  follow_page_pte()
					    get_page()
					    # elevates refcount of page X
ext4_punch_hole()
  ext4_break_layouts() # two pages, X & Y, have refcount == 2
    __wait_var_event() # called for page X
  					
					  __put_devmap_managed_page()
					  # drops refcount of X to 1

   # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
   # We never actually called ext4_wait_dax_page(), so 'retry' in
   # ext4_break_layouts() is still false.  Exit do/while loop in
   # ext4_break_layouts, never attempting to wait on page Y which still has an
   # elevated refcount of 2.

  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()

This second race can be fixed with the patch at the end of this function,
which I think should go in, unless there is a benfit to the current retry
scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
With this patch applied I've been able to run my unit test through
thousands of iterations, where it used to failed consistently within 10 or
so.

Even so, I wonder if the real solution is to add synchronization between
the direct I/O path and {ext4,xfs}_break_layouts()?  Other ideas on how
this should be handled?

--- >8 ---

>From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
From: Ross Zwisler <zwisler@kernel.org>
Date: Wed, 25 Jul 2018 16:16:05 -0600
Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Note that this works around the race exposed by my unit test, but I think
that there is another race that needs to be addressed, probably with
additional synchronization added between direct I/O and
{ext4,xfs}_break_layouts().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/inode.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 27e9643bc13b..fedb88104bbf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 	return 0;
 }
 
-static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+static void ext4_wait_dax_page(struct ext4_inode_info *ei)
 {
-	*did_unlock = true;
 	up_write(&ei->i_mmap_sem);
 	schedule();
 	down_write(&ei->i_mmap_sem);
@@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct page *page;
-	bool retry;
 	int error;
 
 	if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
 		return -EINVAL;
 
 	do {
-		retry = false;
 		page = dax_layout_busy_page(inode->i_mapping);
 		if (!page)
 			return 0;
@@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode)
 		error = ___wait_var_event(&page->_refcount,
 				atomic_read(&page->_refcount) == 1,
 				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(ei, &retry));
-	} while (error == 0 && retry);
+				ext4_wait_dax_page(ei));
+	} while (error == 0);
 
 	return error;
 }
-- 
2.14.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

  parent reply	other threads:[~2018-07-25 22:28 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-10 19:10 [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Ross Zwisler
2018-07-10 19:10 ` Ross Zwisler
     [not found] ` <20180710191031.17919-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2018-07-10 19:10   ` [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler
2018-07-10 19:10     ` Ross Zwisler
     [not found]     ` <20180710191031.17919-2-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2018-08-10 19:52       ` Eric Sandeen
2018-08-10 19:52         ` Eric Sandeen
     [not found]         ` <6ca60cf6-f1dc-27d5-fb79-7547aea39dfa-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-08-10 20:33           ` Theodore Y. Ts'o
2018-08-10 20:33             ` Theodore Y. Ts'o
     [not found]             ` <20180810203349.GH627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2018-08-11  2:10               ` Theodore Y. Ts'o
2018-08-11  2:10                 ` Theodore Y. Ts'o
     [not found]                 ` <20180811021053.GB14368-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2018-08-13 10:12                   ` Jan Kara
2018-08-13 10:12                     ` Jan Kara
     [not found]                     ` <20180813101252.GC8927-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2018-08-13 12:46                       ` Theodore Y. Ts'o
2018-08-13 12:46                         ` Theodore Y. Ts'o
2018-08-24 15:44                       ` Jan Kara
2018-08-24 15:44                         ` Jan Kara
2018-08-27 16:09                       ` Jan Kara
2018-08-27 16:09                         ` Jan Kara
2018-07-10 19:10   ` [PATCH v4 2/2] ext4: handle layout changes to pinned DAX mappings Ross Zwisler
2018-07-10 19:10     ` Ross Zwisler
2018-07-11  8:17   ` [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Jan Kara
2018-07-11  8:17     ` Jan Kara
     [not found]     ` <20180711081741.lmr44sp4cmt3f6um-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2018-07-11 15:41       ` Ross Zwisler
2018-07-11 15:41         ` Ross Zwisler
2018-07-25 22:28       ` Ross Zwisler [this message]
2018-07-25 22:28         ` Ross Zwisler
     [not found]         ` <20180725222839.GA28304-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2018-07-27 16:28           ` Ross Zwisler
2018-07-27 16:28             ` Ross Zwisler
2018-07-27 16:28             ` Ross Zwisler
     [not found]             ` <CAOxpaSWCT+XRtGxGQT_PVMcn79csMxbqb+cf4yycbV3eqjjnkQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-08-06  3:55               ` Dave Chinner
2018-08-06  3:55                 ` Dave Chinner
2018-08-06  3:55                 ` Dave Chinner
2018-08-06 15:49                 ` Christoph Hellwig
2018-08-06 15:49                   ` Christoph Hellwig
2018-08-06 15:49                   ` Christoph Hellwig
     [not found]                   ` <20180806154943.GA17666-jcswGhMUV9g@public.gmane.org>
2018-08-06 22:25                     ` Dave Chinner
2018-08-06 22:25                       ` Dave Chinner
2018-08-06 22:25                       ` Dave Chinner
2018-08-07  8:45               ` Jan Kara
2018-08-07  8:45                 ` Jan Kara
2018-08-07  8:45                 ` Jan Kara
     [not found]                 ` <20180807084545.6gzxrtrvb34hyhdq-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2018-09-10 22:18                   ` Eric Sandeen
2018-09-10 22:18                     ` Eric Sandeen
2018-09-10 22:18                     ` Eric Sandeen
     [not found]                     ` <8c70d61a-fc5c-b928-334a-fbb2567b8dea-+82itfer+wXR7s880joybQ@public.gmane.org>
2018-09-11 15:14                       ` Jan Kara
2018-09-11 15:14                         ` Jan Kara
2018-09-11 15:14                         ` Jan Kara
     [not found]                         ` <20180911151421.GD6104-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2018-09-11 15:20                           ` Jan Kara
2018-09-11 15:20                             ` Jan Kara
2018-09-11 15:20                             ` Jan Kara
     [not found]                             ` <20180911152024.GE6104-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2018-09-11 17:28                               ` Theodore Y. Ts'o
2018-09-11 17:28                                 ` Theodore Y. Ts'o
2018-09-11 17:28                                 ` Theodore Y. Ts'o
     [not found]                                 ` <20180911172806.GC2225-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2018-09-11 18:21                                   ` Eric Sandeen
2018-09-11 18:21                                     ` Eric Sandeen
2018-09-11 18:21                                     ` Eric Sandeen
2018-07-31 19:44   ` Ross Zwisler
2018-07-31 19:44     ` Ross Zwisler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180725222839.GA28304@linux.intel.com \
    --to=ross.zwisler-vuqaysv1563yd54fqh9/ca@public.gmane.org \
    --cc=darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
    --cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
    --cc=hch-jcswGhMUV9g@public.gmane.org \
    --cc=jack-AlSwsSmVLrQ@public.gmane.org \
    --cc=lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.