linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ext4: Prevent race while waling extent tree
@ 2012-11-08 11:08 Lukas Czerner
  2012-11-08 12:01 ` Dmitry Monakhov
  2012-11-08 21:52 ` Zach Brown
  0 siblings, 2 replies; 16+ messages in thread
From: Lukas Czerner @ 2012-11-08 11:08 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Lukas Czerner

Currently ext4_ext_walk_space() only takes i_data_sem for read when
searching for the extent at given block with ext4_ext_find_extent().
Then it drops the lock and the extent tree can be changed at will.
However later on we're searching for the 'next' extent, but the extent
tree might already have changed, so the information might not be
accurate.

In fact we can hit BUG_ON(end <= start) if the extent got inserted into
the tree after the one we found and before the block we were searching
for. This has been reproduced by running xfstests 225 in loop on s390x
architecture, but theoretically we could hit this on any other
architecture as well, but probably not as often.

ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
if we do not hit the BUG_ON() fiemap might return scrambled information
to the user.

Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
held. By calling it from ext4_fiemap() we can only take the i_data_sem
for read, but possibly other users might want to modify the extents so
they will be able to take write lock.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/extents.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7011ac9..f1aca06 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1959,6 +1959,11 @@ cleanup:
 	return err;
 }
 
+/*
+ * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
+ * not modifying found extents, or extent tree in callback function, then
+ * read lock is ok.
+ */
 static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 			       ext4_lblk_t num, ext_prepare_callback func,
 			       void *cbdata)
@@ -1976,9 +1981,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 	while (block < last && block != EXT_MAX_BLOCKS) {
 		num = last - block;
 		/* find extent for this block */
-		down_read(&EXT4_I(inode)->i_data_sem);
 		path = ext4_ext_find_extent(inode, block, path);
-		up_read(&EXT4_I(inode)->i_data_sem);
 		if (IS_ERR(path)) {
 			err = PTR_ERR(path);
 			path = NULL;
@@ -5021,8 +5024,10 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		 * Walk the extent tree gathering extent information.
 		 * ext4_ext_fiemap_cb will push extents back to user.
 		 */
+		down_read(&EXT4_I(inode)->i_data_sem);
 		error = ext4_ext_walk_space(inode, start_blk, len_blks,
 					  ext4_ext_fiemap_cb, fieinfo);
+		up_read(&EXT4_I(inode)->i_data_sem);
 	}
 
 	return error;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] ext4: Prevent race while waling extent tree
  2012-11-08 11:08 [PATCH] " Lukas Czerner
@ 2012-11-08 12:01 ` Dmitry Monakhov
  2012-11-08 13:43   ` Lukáš Czerner
  2012-11-08 21:52 ` Zach Brown
  1 sibling, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2012-11-08 12:01 UTC (permalink / raw)
  To: Lukas Czerner, linux-ext4; +Cc: tytso, Lukas Czerner

On Thu,  8 Nov 2012 12:08:49 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
> Currently ext4_ext_walk_space() only takes i_data_sem for read when
> searching for the extent at given block with ext4_ext_find_extent().
> Then it drops the lock and the extent tree can be changed at will.
> However later on we're searching for the 'next' extent, but the extent
> tree might already have changed, so the information might not be
> accurate.
> 
> In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> the tree after the one we found and before the block we were searching
> for. This has been reproduced by running xfstests 225 in loop on s390x
> architecture, but theoretically we could hit this on any other
> architecture as well, but probably not as often.
> 
> ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
> if we do not hit the BUG_ON() fiemap might return scrambled information
> to the user.
> 
> Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
> held. By calling it from ext4_fiemap() we can only take the i_data_sem
> for read, but possibly other users might want to modify the extents so
> they will be able to take write lock.
Agree as a short term fix for BUGON case, but Theodore suggested to use
seqlock approach http://lists.openwall.net/linux-ext4/2011/10/26/25

> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
>  fs/ext4/extents.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 7011ac9..f1aca06 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1959,6 +1959,11 @@ cleanup:
>  	return err;
>  }
>  
> +/*
> + * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
> + * not modifying found extents, or extent tree in callback function, then
> + * read lock is ok.
> + */
>  static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>  			       ext4_lblk_t num, ext_prepare_callback func,
>  			       void *cbdata)
> @@ -1976,9 +1981,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>  	while (block < last && block != EXT_MAX_BLOCKS) {
>  		num = last - block;
>  		/* find extent for this block */
> -		down_read(&EXT4_I(inode)->i_data_sem);
>  		path = ext4_ext_find_extent(inode, block, path);
> -		up_read(&EXT4_I(inode)->i_data_sem);
>  		if (IS_ERR(path)) {
>  			err = PTR_ERR(path);
>  			path = NULL;
> @@ -5021,8 +5024,10 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		 * Walk the extent tree gathering extent information.
>  		 * ext4_ext_fiemap_cb will push extents back to user.
>  		 */
> +		down_read(&EXT4_I(inode)->i_data_sem);
>  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
>  					  ext4_ext_fiemap_cb, fieinfo);
> +		up_read(&EXT4_I(inode)->i_data_sem);
>  	}
>  
>  	return error;
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] ext4: Prevent race while waling extent tree
  2012-11-08 12:01 ` Dmitry Monakhov
@ 2012-11-08 13:43   ` Lukáš Czerner
  2012-11-08 16:07     ` Lukáš Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Lukáš Czerner @ 2012-11-08 13:43 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Lukas Czerner, linux-ext4, tytso

On Thu, 8 Nov 2012, Dmitry Monakhov wrote:

> Date: Thu, 08 Nov 2012 16:01:17 +0400
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> To: Lukas Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org
> Cc: tytso@mit.edu, Lukas Czerner <lczerner@redhat.com>
> Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> 
> On Thu,  8 Nov 2012 12:08:49 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
> > Currently ext4_ext_walk_space() only takes i_data_sem for read when
> > searching for the extent at given block with ext4_ext_find_extent().
> > Then it drops the lock and the extent tree can be changed at will.
> > However later on we're searching for the 'next' extent, but the extent
> > tree might already have changed, so the information might not be
> > accurate.
> > 
> > In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> > the tree after the one we found and before the block we were searching
> > for. This has been reproduced by running xfstests 225 in loop on s390x
> > architecture, but theoretically we could hit this on any other
> > architecture as well, but probably not as often.
> > 
> > ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
> > if we do not hit the BUG_ON() fiemap might return scrambled information
> > to the user.
> > 
> > Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
> > held. By calling it from ext4_fiemap() we can only take the i_data_sem
> > for read, but possibly other users might want to modify the extents so
> > they will be able to take write lock.
> Agree as a short term fix for BUGON case, but Theodore suggested to use
> seqlock approach http://lists.openwall.net/linux-ext4/2011/10/26/25

Yeah, it make sense to protect us from fiemap abuse, however using
seqlock for walking the extent tree seems like an overkill
especially considering how much work will that require. We would
have to make sure that everything we do in the ext4_ext_walk_space()
and other function we're calling there is safe even if the extent
tree change under our hands. I do not think this is the right way.

I was thinking about checking for contentions on the semaphore from
within the ext4_ext_walk_space() - possibly enabling/disabling it
with a function parameter ?

Sadly kernel does not provide a helper to check for that so what
about something like this in the beginning of the while loop in
ext4_ext_walk_space ?

if (check_contention) {
	int contends = 0;
	unsigned int flags;

	raw_spin_lock_irqsave(&EXT4_I(inode)->i_data_sem->wait_lock, flags);
	if (!list_empty(&EXT4_I(inode)->i_data_sem->wait_list)
		contends = 1
	raw_spin_unlock_irqrestore(&EXT4_I(inode)->i_data_sem->wait_lock, flags);

	if (contends)
		break
}

or we can add the helper to the rwsem code and use that.


What do you think ?

Thanks!
-Lukas

> 
> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > ---
> >  fs/ext4/extents.c |    9 +++++++--
> >  1 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 7011ac9..f1aca06 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -1959,6 +1959,11 @@ cleanup:
> >  	return err;
> >  }
> >  
> > +/*
> > + * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
> > + * not modifying found extents, or extent tree in callback function, then
> > + * read lock is ok.
> > + */
> >  static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >  			       ext4_lblk_t num, ext_prepare_callback func,
> >  			       void *cbdata)
> > @@ -1976,9 +1981,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >  	while (block < last && block != EXT_MAX_BLOCKS) {
> >  		num = last - block;
> >  		/* find extent for this block */
> > -		down_read(&EXT4_I(inode)->i_data_sem);
> >  		path = ext4_ext_find_extent(inode, block, path);
> > -		up_read(&EXT4_I(inode)->i_data_sem);
> >  		if (IS_ERR(path)) {
> >  			err = PTR_ERR(path);
> >  			path = NULL;
> > @@ -5021,8 +5024,10 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> >  		 * Walk the extent tree gathering extent information.
> >  		 * ext4_ext_fiemap_cb will push extents back to user.
> >  		 */
> > +		down_read(&EXT4_I(inode)->i_data_sem);
> >  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
> >  					  ext4_ext_fiemap_cb, fieinfo);
> > +		up_read(&EXT4_I(inode)->i_data_sem);
> >  	}
> >  
> >  	return error;
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] ext4: Prevent race while waling extent tree
  2012-11-08 13:43   ` Lukáš Czerner
@ 2012-11-08 16:07     ` Lukáš Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukáš Czerner @ 2012-11-08 16:07 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Dmitry Monakhov, linux-ext4, tytso

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5413 bytes --]

On Thu, 8 Nov 2012, Lukáš Czerner wrote:

> Date: Thu, 8 Nov 2012 14:43:19 +0100 (CET)
> From: Lukáš Czerner <lczerner@redhat.com>
> To: Dmitry Monakhov <dmonakhov@openvz.org>
> Cc: Lukas Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org,
>     tytso@mit.edu
> Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> 
> On Thu, 8 Nov 2012, Dmitry Monakhov wrote:
> 
> > Date: Thu, 08 Nov 2012 16:01:17 +0400
> > From: Dmitry Monakhov <dmonakhov@openvz.org>
> > To: Lukas Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org
> > Cc: tytso@mit.edu, Lukas Czerner <lczerner@redhat.com>
> > Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> > 
> > On Thu,  8 Nov 2012 12:08:49 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
> > > Currently ext4_ext_walk_space() only takes i_data_sem for read when
> > > searching for the extent at given block with ext4_ext_find_extent().
> > > Then it drops the lock and the extent tree can be changed at will.
> > > However later on we're searching for the 'next' extent, but the extent
> > > tree might already have changed, so the information might not be
> > > accurate.
> > > 
> > > In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> > > the tree after the one we found and before the block we were searching
> > > for. This has been reproduced by running xfstests 225 in loop on s390x
> > > architecture, but theoretically we could hit this on any other
> > > architecture as well, but probably not as often.
> > > 
> > > ext4_ext_walk_space() is currently only used from ext4_fiemap() and even
> > > if we do not hit the BUG_ON() fiemap might return scrambled information
> > > to the user.
> > > 
> > > Fix this by requiring ext4_ext_walk_space() to be called with i_data_sem
> > > held. By calling it from ext4_fiemap() we can only take the i_data_sem
> > > for read, but possibly other users might want to modify the extents so
> > > they will be able to take write lock.
> > Agree as a short term fix for BUGON case, but Theodore suggested to use
> > seqlock approach http://lists.openwall.net/linux-ext4/2011/10/26/25
> 
> Yeah, it make sense to protect us from fiemap abuse, however using
> seqlock for walking the extent tree seems like an overkill
> especially considering how much work will that require. We would
> have to make sure that everything we do in the ext4_ext_walk_space()
> and other function we're calling there is safe even if the extent
> tree change under our hands. I do not think this is the right way.
> 
> I was thinking about checking for contentions on the semaphore from
> within the ext4_ext_walk_space() - possibly enabling/disabling it
> with a function parameter ?
> 
> Sadly kernel does not provide a helper to check for that so what
> about something like this in the beginning of the while loop in
> ext4_ext_walk_space ?
> 
> if (check_contention) {
> 	int contends = 0;
> 	unsigned int flags;
> 
> 	raw_spin_lock_irqsave(&EXT4_I(inode)->i_data_sem->wait_lock, flags);
> 	if (!list_empty(&EXT4_I(inode)->i_data_sem->wait_list)
> 		contends = 1
> 	raw_spin_unlock_irqrestore(&EXT4_I(inode)->i_data_sem->wait_lock, flags);
> 
> 	if (contends)
> 		break
> }
> 
> or we can add the helper to the rwsem code and use that.
> 
> 
> What do you think ?

Nevermind, trhere is no generic way to tell how many waiters for the
semaphore there is...

-Lukas

> 
> Thanks!
> -Lukas
> 
> > 
> > > 
> > > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > > ---
> > >  fs/ext4/extents.c |    9 +++++++--
> > >  1 files changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > index 7011ac9..f1aca06 100644
> > > --- a/fs/ext4/extents.c
> > > +++ b/fs/ext4/extents.c
> > > @@ -1959,6 +1959,11 @@ cleanup:
> > >  	return err;
> > >  }
> > >  
> > > +/*
> > > + * ext4_ext_walk_space() should be called with i_data_sem locked. If we're
> > > + * not modifying found extents, or extent tree in callback function, then
> > > + * read lock is ok.
> > > + */
> > >  static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >  			       ext4_lblk_t num, ext_prepare_callback func,
> > >  			       void *cbdata)
> > > @@ -1976,9 +1981,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >  	while (block < last && block != EXT_MAX_BLOCKS) {
> > >  		num = last - block;
> > >  		/* find extent for this block */
> > > -		down_read(&EXT4_I(inode)->i_data_sem);
> > >  		path = ext4_ext_find_extent(inode, block, path);
> > > -		up_read(&EXT4_I(inode)->i_data_sem);
> > >  		if (IS_ERR(path)) {
> > >  			err = PTR_ERR(path);
> > >  			path = NULL;
> > > @@ -5021,8 +5024,10 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> > >  		 * Walk the extent tree gathering extent information.
> > >  		 * ext4_ext_fiemap_cb will push extents back to user.
> > >  		 */
> > > +		down_read(&EXT4_I(inode)->i_data_sem);
> > >  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
> > >  					  ext4_ext_fiemap_cb, fieinfo);
> > > +		up_read(&EXT4_I(inode)->i_data_sem);
> > >  	}
> > >  
> > >  	return error;
> > > -- 
> > > 1.7.7.6
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] ext4: Prevent race while waling extent tree
  2012-11-08 11:08 [PATCH] " Lukas Czerner
  2012-11-08 12:01 ` Dmitry Monakhov
@ 2012-11-08 21:52 ` Zach Brown
  2012-11-09  9:19   ` Lukáš Czerner
  1 sibling, 1 reply; 16+ messages in thread
From: Zach Brown @ 2012-11-08 21:52 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso

On Thu, Nov 08, 2012 at 12:08:49PM +0100, Lukas Czerner wrote:
> +		down_read(&EXT4_I(inode)->i_data_sem);
>  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
>  					  ext4_ext_fiemap_cb, fieinfo);
> +		up_read(&EXT4_I(inode)->i_data_sem);

Can this deadlock?  ext4_ext_fiemap_cb() seems to be doing all kinds of
exciting things that might also try and acquire the i_data_sem, like
GFP_KERNEL allocs (reclaim -> writepage) and copying to userspace (mmap
fault -> readpage -> get blocks).

It seems like the safer fix is to broaden the sampling lock coverage to
include referencing all the extent data but to release it around the
callback.

No?

- z

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] ext4: Prevent race while waling extent tree
  2012-11-08 21:52 ` Zach Brown
@ 2012-11-09  9:19   ` Lukáš Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukáš Czerner @ 2012-11-09  9:19 UTC (permalink / raw)
  To: Zach Brown; +Cc: Lukas Czerner, linux-ext4, tytso

On Thu, 8 Nov 2012, Zach Brown wrote:

> Date: Thu, 8 Nov 2012 13:52:33 -0800
> From: Zach Brown <zab@redhat.com>
> To: Lukas Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, tytso@mit.edu
> Subject: Re: [PATCH] ext4: Prevent race while waling extent tree
> 
> On Thu, Nov 08, 2012 at 12:08:49PM +0100, Lukas Czerner wrote:
> > +		down_read(&EXT4_I(inode)->i_data_sem);
> >  		error = ext4_ext_walk_space(inode, start_blk, len_blks,
> >  					  ext4_ext_fiemap_cb, fieinfo);
> > +		up_read(&EXT4_I(inode)->i_data_sem);
> 
> Can this deadlock?  ext4_ext_fiemap_cb() seems to be doing all kinds of
> exciting things that might also try and acquire the i_data_sem, like
> GFP_KERNEL allocs (reclaim -> writepage) and copying to userspace (mmap
> fault -> readpage -> get blocks).
> 
> It seems like the safer fix is to broaden the sampling lock coverage to
> include referencing all the extent data but to release it around the
> callback.
> 
> No?
> 
> - z

Yeah, you're right. Having the lock around the whole
ext4_ext_walk_space() might deadlock. I'll fix this.

Thanks for noticing this!
-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] ext4: Prevent race while waling extent tree
@ 2012-11-12 14:57 Lukas Czerner
  2012-11-13  8:22 ` [PATCH v3] " Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2012-11-12 14:57 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, zab, dmonakhov, Lukas Czerner

Currently ext4_ext_walk_space() only takes i_data_sem for read when
searching for the extent at given block with ext4_ext_find_extent().
Then it drops the lock and the extent tree can be changed at will.
However later on we're searching for the 'next' extent, but the extent
tree might already have changed, so the information might not be
accurate.

In fact we can hit BUG_ON(end <= start) if the extent got inserted into
the tree after the one we found and before the block we were searching
for. This has been reproduced by running xfstests 225 in loop on s390x
architecture, but theoretically we could hit this on any other
architecture as well, but probably not as often.

Fix this by extending the critical section to include
ext4_ext_next_allocated_block() as well. It means that if there are any
operation going on on the particular inode, the fiemap will return
inaccurate data. However this will also fix the concerns about starving
writers to the extent tree, because we will put and reacquire the
semaphore with every iteration. This will not be particularly fast, but
fiemap is not critical operation.

However we also need to limit the access to the extent structure to the
critical section, because outside of it the content can change. So we
remove extent and next block parameters from ext4_ext_fiemap_cb()
function and pass just flags instead.

Also we have to move path reinitialization inside the critical section.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4_extents.h |    5 ++---
 fs/ext4/extents.c      |   40 +++++++++++++++++++++-------------------
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index cb1b2c9..356ad9f 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -149,9 +149,8 @@ struct ext4_ext_path {
  * positive retcode - signal for ext4_ext_walk_space(), see below
  * callback must return valid extent (passed or newly created)
  */
-typedef int (*ext_prepare_callback)(struct inode *, ext4_lblk_t,
-					struct ext4_ext_cache *,
-					struct ext4_extent *, void *);
+typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_cache *,
+				    unsigned int, void *);
 
 #define EXT_CONTINUE   0
 #define EXT_BREAK      1
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7011ac9..c097acf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1968,7 +1968,8 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 	struct ext4_extent *ex;
 	ext4_lblk_t next, start = 0, end = 0;
 	ext4_lblk_t last = block + num;
-	int depth, exists, err = 0;
+	int exists, depth = 0, err = 0;
+	unsigned int flags = 0;
 
 	BUG_ON(func == NULL);
 	BUG_ON(inode == NULL);
@@ -1977,9 +1978,16 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 		num = last - block;
 		/* find extent for this block */
 		down_read(&EXT4_I(inode)->i_data_sem);
+
+		if (path && ext_depth(inode) != depth) {
+			/* depth was changed. we have to realloc path */
+			kfree(path);
+			path = NULL;
+		}
+
 		path = ext4_ext_find_extent(inode, block, path);
-		up_read(&EXT4_I(inode)->i_data_sem);
 		if (IS_ERR(path)) {
+			up_read(&EXT4_I(inode)->i_data_sem);
 			err = PTR_ERR(path);
 			path = NULL;
 			break;
@@ -1987,6 +1995,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 
 		depth = ext_depth(inode);
 		if (unlikely(path[depth].p_hdr == NULL)) {
+			up_read(&EXT4_I(inode)->i_data_sem);
 			EXT4_ERROR_INODE(inode, "path[%d].p_hdr == NULL", depth);
 			err = -EIO;
 			break;
@@ -2037,14 +2046,21 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 			cbex.ec_block = le32_to_cpu(ex->ee_block);
 			cbex.ec_len = ext4_ext_get_actual_len(ex);
 			cbex.ec_start = ext4_ext_pblock(ex);
+			if (ext4_ext_is_uninitialized(ex))
+				flags |= FIEMAP_EXTENT_UNWRITTEN;
 		}
+		up_read(&EXT4_I(inode)->i_data_sem);
 
 		if (unlikely(cbex.ec_len == 0)) {
 			EXT4_ERROR_INODE(inode, "cbex.ec_len == 0");
 			err = -EIO;
 			break;
 		}
-		err = func(inode, next, &cbex, ex, cbdata);
+
+		if (next == EXT_MAX_BLOCKS)
+			flags |= FIEMAP_EXTENT_LAST;
+
+		err = func(inode, &cbex, flags, cbdata);
 		ext4_ext_drop_refs(path);
 
 		if (err < 0)
@@ -2057,12 +2073,6 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 			break;
 		}
 
-		if (ext_depth(inode) != depth) {
-			/* depth was changed. we have to realloc path */
-			kfree(path);
-			path = NULL;
-		}
-
 		block = cbex.ec_block + cbex.ec_len;
 	}
 
@@ -4574,14 +4584,12 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 /*
  * Callback function called for each extent to gather FIEMAP information.
  */
-static int ext4_ext_fiemap_cb(struct inode *inode, ext4_lblk_t next,
-		       struct ext4_ext_cache *newex, struct ext4_extent *ex,
-		       void *data)
+static int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_cache *newex,
+			      unsigned int flags, void *data)
 {
 	__u64	logical;
 	__u64	physical;
 	__u64	length;
-	__u32	flags = 0;
 	int		ret = 0;
 	struct fiemap_extent_info *fieinfo = data;
 	unsigned char blksize_bits;
@@ -4759,12 +4767,6 @@ found_delayed_extent:
 	physical = (__u64)newex->ec_start << blksize_bits;
 	length =   (__u64)newex->ec_len << blksize_bits;
 
-	if (ex && ext4_ext_is_uninitialized(ex))
-		flags |= FIEMAP_EXTENT_UNWRITTEN;
-
-	if (next == EXT_MAX_BLOCKS)
-		flags |= FIEMAP_EXTENT_LAST;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-12 14:57 [PATCH] ext4: Prevent race while waling extent tree Lukas Czerner
@ 2012-11-13  8:22 ` Lukas Czerner
  2012-11-13 11:34   ` Peng Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2012-11-13  8:22 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, zab, dmonakhov, Lukas Czerner

Currently ext4_ext_walk_space() only takes i_data_sem for read when
searching for the extent at given block with ext4_ext_find_extent().
Then it drops the lock and the extent tree can be changed at will.
However later on we're searching for the 'next' extent, but the extent
tree might already have changed, so the information might not be
accurate.

In fact we can hit BUG_ON(end <= start) if the extent got inserted into
the tree after the one we found and before the block we were searching
for. This has been reproduced by running xfstests 225 in loop on s390x
architecture, but theoretically we could hit this on any other
architecture as well, but probably not as often.

Fix this by extending the critical section to include
ext4_ext_next_allocated_block() as well. It means that if there are any
operation going on on the particular inode, the fiemap will return
inaccurate data. However this will also fix the concerns about starving
writers to the extent tree, because we will put and reacquire the
semaphore with every iteration. This will not be particularly fast, but
fiemap is not critical operation.

However we also need to limit the access to the extent structure to the
critical section, because outside of it the content can change. So we
remove extent and next block parameters from ext4_ext_fiemap_cb()
function and pass just flags instead.

Also we have to move path reinitialization inside the critical section.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v3: reworked

 fs/ext4/ext4_extents.h |    5 ++---
 fs/ext4/extents.c      |   40 +++++++++++++++++++++-------------------
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index cb1b2c9..356ad9f 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -149,9 +149,8 @@ struct ext4_ext_path {
  * positive retcode - signal for ext4_ext_walk_space(), see below
  * callback must return valid extent (passed or newly created)
  */
-typedef int (*ext_prepare_callback)(struct inode *, ext4_lblk_t,
-					struct ext4_ext_cache *,
-					struct ext4_extent *, void *);
+typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_cache *,
+				    unsigned int, void *);
 
 #define EXT_CONTINUE   0
 #define EXT_BREAK      1
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7011ac9..c097acf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1968,7 +1968,8 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 	struct ext4_extent *ex;
 	ext4_lblk_t next, start = 0, end = 0;
 	ext4_lblk_t last = block + num;
-	int depth, exists, err = 0;
+	int exists, depth = 0, err = 0;
+	unsigned int flags = 0;
 
 	BUG_ON(func == NULL);
 	BUG_ON(inode == NULL);
@@ -1977,9 +1978,16 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 		num = last - block;
 		/* find extent for this block */
 		down_read(&EXT4_I(inode)->i_data_sem);
+
+		if (path && ext_depth(inode) != depth) {
+			/* depth was changed. we have to realloc path */
+			kfree(path);
+			path = NULL;
+		}
+
 		path = ext4_ext_find_extent(inode, block, path);
-		up_read(&EXT4_I(inode)->i_data_sem);
 		if (IS_ERR(path)) {
+			up_read(&EXT4_I(inode)->i_data_sem);
 			err = PTR_ERR(path);
 			path = NULL;
 			break;
@@ -1987,6 +1995,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 
 		depth = ext_depth(inode);
 		if (unlikely(path[depth].p_hdr == NULL)) {
+			up_read(&EXT4_I(inode)->i_data_sem);
 			EXT4_ERROR_INODE(inode, "path[%d].p_hdr == NULL", depth);
 			err = -EIO;
 			break;
@@ -2037,14 +2046,21 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 			cbex.ec_block = le32_to_cpu(ex->ee_block);
 			cbex.ec_len = ext4_ext_get_actual_len(ex);
 			cbex.ec_start = ext4_ext_pblock(ex);
+			if (ext4_ext_is_uninitialized(ex))
+				flags |= FIEMAP_EXTENT_UNWRITTEN;
 		}
+		up_read(&EXT4_I(inode)->i_data_sem);
 
 		if (unlikely(cbex.ec_len == 0)) {
 			EXT4_ERROR_INODE(inode, "cbex.ec_len == 0");
 			err = -EIO;
 			break;
 		}
-		err = func(inode, next, &cbex, ex, cbdata);
+
+		if (next == EXT_MAX_BLOCKS)
+			flags |= FIEMAP_EXTENT_LAST;
+
+		err = func(inode, &cbex, flags, cbdata);
 		ext4_ext_drop_refs(path);
 
 		if (err < 0)
@@ -2057,12 +2073,6 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 			break;
 		}
 
-		if (ext_depth(inode) != depth) {
-			/* depth was changed. we have to realloc path */
-			kfree(path);
-			path = NULL;
-		}
-
 		block = cbex.ec_block + cbex.ec_len;
 	}
 
@@ -4574,14 +4584,12 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 /*
  * Callback function called for each extent to gather FIEMAP information.
  */
-static int ext4_ext_fiemap_cb(struct inode *inode, ext4_lblk_t next,
-		       struct ext4_ext_cache *newex, struct ext4_extent *ex,
-		       void *data)
+static int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_cache *newex,
+			      unsigned int flags, void *data)
 {
 	__u64	logical;
 	__u64	physical;
 	__u64	length;
-	__u32	flags = 0;
 	int		ret = 0;
 	struct fiemap_extent_info *fieinfo = data;
 	unsigned char blksize_bits;
@@ -4759,12 +4767,6 @@ found_delayed_extent:
 	physical = (__u64)newex->ec_start << blksize_bits;
 	length =   (__u64)newex->ec_len << blksize_bits;
 
-	if (ex && ext4_ext_is_uninitialized(ex))
-		flags |= FIEMAP_EXTENT_UNWRITTEN;
-
-	if (next == EXT_MAX_BLOCKS)
-		flags |= FIEMAP_EXTENT_LAST;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-13  8:22 ` [PATCH v3] " Lukas Czerner
@ 2012-11-13 11:34   ` Peng Tao
  2012-11-13 12:07     ` Lukáš Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Peng Tao @ 2012-11-13 11:34 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso, zab, dmonakhov

On Tue, Nov 13, 2012 at 4:22 PM, Lukas Czerner <lczerner@redhat.com> wrote:
> Currently ext4_ext_walk_space() only takes i_data_sem for read when
> searching for the extent at given block with ext4_ext_find_extent().
> Then it drops the lock and the extent tree can be changed at will.
> However later on we're searching for the 'next' extent, but the extent
> tree might already have changed, so the information might not be
> accurate.
>
> In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> the tree after the one we found and before the block we were searching
> for. This has been reproduced by running xfstests 225 in loop on s390x
> architecture, but theoretically we could hit this on any other
> architecture as well, but probably not as often.
>
> Fix this by extending the critical section to include
> ext4_ext_next_allocated_block() as well. It means that if there are any
> operation going on on the particular inode, the fiemap will return
> inaccurate data. However this will also fix the concerns about starving
> writers to the extent tree, because we will put and reacquire the
> semaphore with every iteration. This will not be particularly fast, but
> fiemap is not critical operation.
>
> However we also need to limit the access to the extent structure to the
> critical section, because outside of it the content can change. So we
> remove extent and next block parameters from ext4_ext_fiemap_cb()
> function and pass just flags instead.
>
> Also we have to move path reinitialization inside the critical section.
>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
> v3: reworked
>
>  fs/ext4/ext4_extents.h |    5 ++---
>  fs/ext4/extents.c      |   40 +++++++++++++++++++++-------------------
>  2 files changed, 23 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
> index cb1b2c9..356ad9f 100644
> --- a/fs/ext4/ext4_extents.h
> +++ b/fs/ext4/ext4_extents.h
> @@ -149,9 +149,8 @@ struct ext4_ext_path {
>   * positive retcode - signal for ext4_ext_walk_space(), see below
>   * callback must return valid extent (passed or newly created)
>   */
> -typedef int (*ext_prepare_callback)(struct inode *, ext4_lblk_t,
> -                                       struct ext4_ext_cache *,
> -                                       struct ext4_extent *, void *);
> +typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_cache *,
> +                                   unsigned int, void *);
>
>  #define EXT_CONTINUE   0
>  #define EXT_BREAK      1
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 7011ac9..c097acf 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1968,7 +1968,8 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>         struct ext4_extent *ex;
>         ext4_lblk_t next, start = 0, end = 0;
>         ext4_lblk_t last = block + num;
> -       int depth, exists, err = 0;
> +       int exists, depth = 0, err = 0;
> +       unsigned int flags = 0;
>
>         BUG_ON(func == NULL);
>         BUG_ON(inode == NULL);
> @@ -1977,9 +1978,16 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>                 num = last - block;
>                 /* find extent for this block */
>                 down_read(&EXT4_I(inode)->i_data_sem);
> +
> +               if (path && ext_depth(inode) != depth) {
> +                       /* depth was changed. we have to realloc path */
> +                       kfree(path);
> +                       path = NULL;
> +               }
> +
>                 path = ext4_ext_find_extent(inode, block, path);
> -               up_read(&EXT4_I(inode)->i_data_sem);
>                 if (IS_ERR(path)) {
> +                       up_read(&EXT4_I(inode)->i_data_sem);
>                         err = PTR_ERR(path);
>                         path = NULL;
>                         break;
> @@ -1987,6 +1995,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>
>                 depth = ext_depth(inode);
>                 if (unlikely(path[depth].p_hdr == NULL)) {
> +                       up_read(&EXT4_I(inode)->i_data_sem);
>                         EXT4_ERROR_INODE(inode, "path[%d].p_hdr == NULL", depth);
>                         err = -EIO;
>                         break;
> @@ -2037,14 +2046,21 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
>                         cbex.ec_block = le32_to_cpu(ex->ee_block);
>                         cbex.ec_len = ext4_ext_get_actual_len(ex);
>                         cbex.ec_start = ext4_ext_pblock(ex);
> +                       if (ext4_ext_is_uninitialized(ex))
> +                               flags |= FIEMAP_EXTENT_UNWRITTEN;
>                 }
> +               up_read(&EXT4_I(inode)->i_data_sem);
>
>                 if (unlikely(cbex.ec_len == 0)) {
>                         EXT4_ERROR_INODE(inode, "cbex.ec_len == 0");
>                         err = -EIO;
>                         break;
>                 }
> -               err = func(inode, next, &cbex, ex, cbdata);
> +
> +               if (next == EXT_MAX_BLOCKS)
> +                       flags |= FIEMAP_EXTENT_LAST;
> +
> +               err = func(inode, &cbex, flags, cbdata);
You may want to include func() in the critical section as well, to fix
the cp data corruption reported by Roger Niva. It looks to be the same
race.
http://thread.gmane.org/gmane.comp.file-systems.ext4/35393

-- 
Thanks,
Tao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-13 11:34   ` Peng Tao
@ 2012-11-13 12:07     ` Lukáš Czerner
  2012-11-13 14:19       ` Peng Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Lukáš Czerner @ 2012-11-13 12:07 UTC (permalink / raw)
  To: Peng Tao; +Cc: Lukas Czerner, linux-ext4, tytso, zab, dmonakhov

On Tue, 13 Nov 2012, Peng Tao wrote:

> Date: Tue, 13 Nov 2012 19:34:41 +0800
> From: Peng Tao <bergwolf@gmail.com>
> To: Lukas Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, tytso@mit.edu, zab@redhat.com,
>     dmonakhov@openvz.org
> Subject: Re: [PATCH v3] ext4: Prevent race while waling extent tree
> 
> On Tue, Nov 13, 2012 at 4:22 PM, Lukas Czerner <lczerner@redhat.com> wrote:
> > Currently ext4_ext_walk_space() only takes i_data_sem for read when
> > searching for the extent at given block with ext4_ext_find_extent().
> > Then it drops the lock and the extent tree can be changed at will.
> > However later on we're searching for the 'next' extent, but the extent
> > tree might already have changed, so the information might not be
> > accurate.
> >
> > In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> > the tree after the one we found and before the block we were searching
> > for. This has been reproduced by running xfstests 225 in loop on s390x
> > architecture, but theoretically we could hit this on any other
> > architecture as well, but probably not as often.
> >
> > Fix this by extending the critical section to include
> > ext4_ext_next_allocated_block() as well. It means that if there are any
> > operation going on on the particular inode, the fiemap will return
> > inaccurate data. However this will also fix the concerns about starving
> > writers to the extent tree, because we will put and reacquire the
> > semaphore with every iteration. This will not be particularly fast, but
> > fiemap is not critical operation.
> >
> > However we also need to limit the access to the extent structure to the
> > critical section, because outside of it the content can change. So we
> > remove extent and next block parameters from ext4_ext_fiemap_cb()
> > function and pass just flags instead.
> >
> > Also we have to move path reinitialization inside the critical section.
> >
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > ---
> > v3: reworked
> >
> >  fs/ext4/ext4_extents.h |    5 ++---
> >  fs/ext4/extents.c      |   40 +++++++++++++++++++++-------------------
> >  2 files changed, 23 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
> > index cb1b2c9..356ad9f 100644
> > --- a/fs/ext4/ext4_extents.h
> > +++ b/fs/ext4/ext4_extents.h
> > @@ -149,9 +149,8 @@ struct ext4_ext_path {
> >   * positive retcode - signal for ext4_ext_walk_space(), see below
> >   * callback must return valid extent (passed or newly created)
> >   */
> > -typedef int (*ext_prepare_callback)(struct inode *, ext4_lblk_t,
> > -                                       struct ext4_ext_cache *,
> > -                                       struct ext4_extent *, void *);
> > +typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_cache *,
> > +                                   unsigned int, void *);
> >
> >  #define EXT_CONTINUE   0
> >  #define EXT_BREAK      1
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 7011ac9..c097acf 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -1968,7 +1968,8 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >         struct ext4_extent *ex;
> >         ext4_lblk_t next, start = 0, end = 0;
> >         ext4_lblk_t last = block + num;
> > -       int depth, exists, err = 0;
> > +       int exists, depth = 0, err = 0;
> > +       unsigned int flags = 0;
> >
> >         BUG_ON(func == NULL);
> >         BUG_ON(inode == NULL);
> > @@ -1977,9 +1978,16 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >                 num = last - block;
> >                 /* find extent for this block */
> >                 down_read(&EXT4_I(inode)->i_data_sem);
> > +
> > +               if (path && ext_depth(inode) != depth) {
> > +                       /* depth was changed. we have to realloc path */
> > +                       kfree(path);
> > +                       path = NULL;
> > +               }
> > +
> >                 path = ext4_ext_find_extent(inode, block, path);
> > -               up_read(&EXT4_I(inode)->i_data_sem);
> >                 if (IS_ERR(path)) {
> > +                       up_read(&EXT4_I(inode)->i_data_sem);
> >                         err = PTR_ERR(path);
> >                         path = NULL;
> >                         break;
> > @@ -1987,6 +1995,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >
> >                 depth = ext_depth(inode);
> >                 if (unlikely(path[depth].p_hdr == NULL)) {
> > +                       up_read(&EXT4_I(inode)->i_data_sem);
> >                         EXT4_ERROR_INODE(inode, "path[%d].p_hdr == NULL", depth);
> >                         err = -EIO;
> >                         break;
> > @@ -2037,14 +2046,21 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> >                         cbex.ec_block = le32_to_cpu(ex->ee_block);
> >                         cbex.ec_len = ext4_ext_get_actual_len(ex);
> >                         cbex.ec_start = ext4_ext_pblock(ex);
> > +                       if (ext4_ext_is_uninitialized(ex))
> > +                               flags |= FIEMAP_EXTENT_UNWRITTEN;
> >                 }
> > +               up_read(&EXT4_I(inode)->i_data_sem);
> >
> >                 if (unlikely(cbex.ec_len == 0)) {
> >                         EXT4_ERROR_INODE(inode, "cbex.ec_len == 0");
> >                         err = -EIO;
> >                         break;
> >                 }
> > -               err = func(inode, next, &cbex, ex, cbdata);
> > +
> > +               if (next == EXT_MAX_BLOCKS)
> > +                       flags |= FIEMAP_EXTENT_LAST;
> > +
> > +               err = func(inode, &cbex, flags, cbdata);
> You may want to include func() in the critical section as well, to fix
> the cp data corruption reported by Roger Niva. It looks to be the same
> race.

That's not a good idea. As already mentioned by Zach Brown
ext4_ext_fiemap_cb() is doing all kinds of things including possibly
taking i_data_sem. Moreover even if we do that, after we drop the
semaphore and return data to the user it might no longer be valid
anyway in the case there is any IO going on on the file.

-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-13 12:07     ` Lukáš Czerner
@ 2012-11-13 14:19       ` Peng Tao
  2012-11-13 18:51         ` Zach Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Peng Tao @ 2012-11-13 14:19 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4, tytso, zab, dmonakhov

Hi Lukáš,

On Tue, Nov 13, 2012 at 01:07:03PM +0100, Lukáš Czerner wrote:
> On Tue, 13 Nov 2012, Peng Tao wrote:
> 
> > Date: Tue, 13 Nov 2012 19:34:41 +0800
> > From: Peng Tao <bergwolf@gmail.com>
> > To: Lukas Czerner <lczerner@redhat.com>
> > Cc: linux-ext4@vger.kernel.org, tytso@mit.edu, zab@redhat.com,
> >     dmonakhov@openvz.org
> > Subject: Re: [PATCH v3] ext4: Prevent race while waling extent tree
> > 
> > On Tue, Nov 13, 2012 at 4:22 PM, Lukas Czerner <lczerner@redhat.com> wrote:
> > > Currently ext4_ext_walk_space() only takes i_data_sem for read when
> > > searching for the extent at given block with ext4_ext_find_extent().
> > > Then it drops the lock and the extent tree can be changed at will.
> > > However later on we're searching for the 'next' extent, but the extent
> > > tree might already have changed, so the information might not be
> > > accurate.
> > >
> > > In fact we can hit BUG_ON(end <= start) if the extent got inserted into
> > > the tree after the one we found and before the block we were searching
> > > for. This has been reproduced by running xfstests 225 in loop on s390x
> > > architecture, but theoretically we could hit this on any other
> > > architecture as well, but probably not as often.
> > >
> > > Fix this by extending the critical section to include
> > > ext4_ext_next_allocated_block() as well. It means that if there are any
> > > operation going on on the particular inode, the fiemap will return
> > > inaccurate data. However this will also fix the concerns about starving
> > > writers to the extent tree, because we will put and reacquire the
> > > semaphore with every iteration. This will not be particularly fast, but
> > > fiemap is not critical operation.
> > >
> > > However we also need to limit the access to the extent structure to the
> > > critical section, because outside of it the content can change. So we
> > > remove extent and next block parameters from ext4_ext_fiemap_cb()
> > > function and pass just flags instead.
> > >
> > > Also we have to move path reinitialization inside the critical section.
> > >
> > > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > > ---
> > > v3: reworked
> > >
> > >  fs/ext4/ext4_extents.h |    5 ++---
> > >  fs/ext4/extents.c      |   40 +++++++++++++++++++++-------------------
> > >  2 files changed, 23 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
> > > index cb1b2c9..356ad9f 100644
> > > --- a/fs/ext4/ext4_extents.h
> > > +++ b/fs/ext4/ext4_extents.h
> > > @@ -149,9 +149,8 @@ struct ext4_ext_path {
> > >   * positive retcode - signal for ext4_ext_walk_space(), see below
> > >   * callback must return valid extent (passed or newly created)
> > >   */
> > > -typedef int (*ext_prepare_callback)(struct inode *, ext4_lblk_t,
> > > -                                       struct ext4_ext_cache *,
> > > -                                       struct ext4_extent *, void *);
> > > +typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_cache *,
> > > +                                   unsigned int, void *);
> > >
> > >  #define EXT_CONTINUE   0
> > >  #define EXT_BREAK      1
> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > index 7011ac9..c097acf 100644
> > > --- a/fs/ext4/extents.c
> > > +++ b/fs/ext4/extents.c
> > > @@ -1968,7 +1968,8 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >         struct ext4_extent *ex;
> > >         ext4_lblk_t next, start = 0, end = 0;
> > >         ext4_lblk_t last = block + num;
> > > -       int depth, exists, err = 0;
> > > +       int exists, depth = 0, err = 0;
> > > +       unsigned int flags = 0;
> > >
> > >         BUG_ON(func == NULL);
> > >         BUG_ON(inode == NULL);
> > > @@ -1977,9 +1978,16 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >                 num = last - block;
> > >                 /* find extent for this block */
> > >                 down_read(&EXT4_I(inode)->i_data_sem);
> > > +
> > > +               if (path && ext_depth(inode) != depth) {
> > > +                       /* depth was changed. we have to realloc path */
> > > +                       kfree(path);
> > > +                       path = NULL;
> > > +               }
> > > +
> > >                 path = ext4_ext_find_extent(inode, block, path);
> > > -               up_read(&EXT4_I(inode)->i_data_sem);
> > >                 if (IS_ERR(path)) {
> > > +                       up_read(&EXT4_I(inode)->i_data_sem);
> > >                         err = PTR_ERR(path);
> > >                         path = NULL;
> > >                         break;
> > > @@ -1987,6 +1995,7 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >
> > >                 depth = ext_depth(inode);
> > >                 if (unlikely(path[depth].p_hdr == NULL)) {
> > > +                       up_read(&EXT4_I(inode)->i_data_sem);
> > >                         EXT4_ERROR_INODE(inode, "path[%d].p_hdr == NULL", depth);
> > >                         err = -EIO;
> > >                         break;
> > > @@ -2037,14 +2046,21 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
> > >                         cbex.ec_block = le32_to_cpu(ex->ee_block);
> > >                         cbex.ec_len = ext4_ext_get_actual_len(ex);
> > >                         cbex.ec_start = ext4_ext_pblock(ex);
> > > +                       if (ext4_ext_is_uninitialized(ex))
> > > +                               flags |= FIEMAP_EXTENT_UNWRITTEN;
> > >                 }
> > > +               up_read(&EXT4_I(inode)->i_data_sem);
> > >
> > >                 if (unlikely(cbex.ec_len == 0)) {
> > >                         EXT4_ERROR_INODE(inode, "cbex.ec_len == 0");
> > >                         err = -EIO;
> > >                         break;
> > >                 }
> > > -               err = func(inode, next, &cbex, ex, cbdata);
> > > +
> > > +               if (next == EXT_MAX_BLOCKS)
> > > +                       flags |= FIEMAP_EXTENT_LAST;
> > > +
> > > +               err = func(inode, &cbex, flags, cbdata);
> > You may want to include func() in the critical section as well, to fix
> > the cp data corruption reported by Roger Niva. It looks to be the same
> > race.
> 
> That's not a good idea. As already mentioned by Zach Brown
> ext4_ext_fiemap_cb() is doing all kinds of things including possibly
> taking i_data_sem. 
Execpt that the race is real. If a page is written back between
ext4_ext_find_extent() and ext4_ext_fiemap_cb(), find_get_pages_tag()
cannot find the dirty page and thus ext4_fiemap returns hole for the
corresponding blocks, even if it is written by application before.

As a result, cp(1) that relies on FIEMAP will write zero for the
corresponding block and cause data corruption.

The deadlock mentioned by Zach Brown can be fixed by simply switching
to GFP_NOFS.

> Moreover even if we do that, after we drop the
> semaphore and return data to the user it might no longer be valid
> anyway in the case there is any IO going on on the file.
The race is different from concurrent user space writers. It is similar
to the original bug fixed by commit 6d9c85e, ext4_fiemap reporting
incorrect file mapping for ranges where pages are written back between
ext4_ext_find_extent() and ext4_ext_fiemap_cb().

Thanks,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-13 14:19       ` Peng Tao
@ 2012-11-13 18:51         ` Zach Brown
  2012-11-15 16:39           ` Lukáš Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Zach Brown @ 2012-11-13 18:51 UTC (permalink / raw)
  To: Peng Tao; +Cc: Lukáš Czerner, linux-ext4, tytso, dmonakhov

> The deadlock mentioned by Zach Brown can be fixed by simply switching
> to GFP_NOFS.

That's a start, but it doesn't address the copy_to_user().  You could
pin that memory, I suppose, but that starts to feel like more trouble
than its worth.

- z

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-13 18:51         ` Zach Brown
@ 2012-11-15 16:39           ` Lukáš Czerner
  2012-11-15 19:10             ` Zach Brown
  2012-11-19  3:24             ` Theodore Ts'o
  0 siblings, 2 replies; 16+ messages in thread
From: Lukáš Czerner @ 2012-11-15 16:39 UTC (permalink / raw)
  To: Zach Brown
  Cc: Peng Tao, Lukáš Czerner, linux-ext4, tytso, dmonakhov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 16852 bytes --]

On Tue, 13 Nov 2012, Zach Brown wrote:

> Date: Tue, 13 Nov 2012 10:51:04 -0800
> From: Zach Brown <zab@redhat.com>
> To: Peng Tao <bergwolf@gmail.com>
> Cc: Lukáš Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org,
>     tytso@mit.edu, dmonakhov@openvz.org
> Subject: Re: [PATCH v3] ext4: Prevent race while waling extent tree
> 
> > The deadlock mentioned by Zach Brown can be fixed by simply switching
> > to GFP_NOFS.
> 
> That's a start, but it doesn't address the copy_to_user().  You could
> pin that memory, I suppose, but that starts to feel like more trouble
> than its worth.
> 
> - z

You're both right. The code have to be reorganized. The
ext4_ext_fiemap_cb() is ugly as hell, but luckily that will be
resolved with extent status tree patches from Zheng Liu, however the
indirection in ext4_ext_walk_space() and the callback business
is also pointless.

I have prepared a path to fix this and I am going to test this
right not. Basically it will:

1. remove the callback

2. rename functions
	ext4_ext_walk_space -> ext4_fill_fiemap_extents
	ext4_ext_fiemap_cb -> ext4_find_delayed_extent

3. put fiemap_fill_next_extent() into ext4_fill_fiemap_extents)_

4. Call ext4_find_delayed_extent() only for non existing extents

5. Use GFP_NOFS in ext4_find_delayed_extent()

5. hold the i_data_sem for:
	ext4_ext_find_extent()
	ext4_ext_next_allocated_block()
	ext4_find_delayed_extent()

6. call fiemap_fill_next_extent after releasing the i_data_sem

does it sounds ok?

There are some collisions with extent status tree patches, but it
could be easily resolved depending on which goes in first.

I am going to test the patch now, so this is entirely untested and
possibly still contains bugs, but if you're interested to see how it
looks like, here it is:



---
I should probably split that into two parts to make the changes
clearer, because removing the if (newex->ec_start == 0) in the old
ext4_ext_fiemap_cb() obfuscated it a lot.


diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index cb1b2c9..1718ff1 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -144,20 +144,6 @@ struct ext4_ext_path {
  */
 
 /*
- * to be called by ext4_ext_walk_space()
- * negative retcode - error
- * positive retcode - signal for ext4_ext_walk_space(), see below
- * callback must return valid extent (passed or newly created)
- */
-typedef int (*ext_prepare_callback)(struct inode *, ext4_lblk_t,
-					struct ext4_ext_cache *,
-					struct ext4_extent *, void *);
-
-#define EXT_CONTINUE   0
-#define EXT_BREAK      1
-#define EXT_REPEAT     2
-
-/*
  * Maximum number of logical blocks in a file; ext4_extent's ee_block is
  * __le32.
  */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7011ac9..5c56194 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -109,6 +109,9 @@ static int ext4_split_extent_at(handle_t *handle,
 			     int split_flag,
 			     int flags);
 
+static int ext4_find_delayed_extent(struct inode *inode,
+				    struct ext4_ext_cache *newex);
+
 static int ext4_ext_truncate_extend_restart(handle_t *handle,
 					    struct inode *inode,
 					    int needed)
@@ -1959,27 +1962,35 @@ cleanup:
 	return err;
 }
 
-static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
-			       ext4_lblk_t num, ext_prepare_callback func,
-			       void *cbdata)
+static int ext4_fill_fiemap_extents(struct inode *inode,
+				    ext4_lblk_t block, ext4_lblk_t num,
+				    struct fiemap_extent_info *fieinfo)
 {
 	struct ext4_ext_path *path = NULL;
-	struct ext4_ext_cache cbex;
+	struct ext4_ext_cache newex;
 	struct ext4_extent *ex;
 	ext4_lblk_t next, start = 0, end = 0;
 	ext4_lblk_t last = block + num;
-	int depth, exists, err = 0;
+	int exists, depth = 0, err = 0;
+	unsigned int flags = 0;
+	unsigned char blksize_bits = inode->i_sb->s_blocksize_bits;
 
-	BUG_ON(func == NULL);
 	BUG_ON(inode == NULL);
 
 	while (block < last && block != EXT_MAX_BLOCKS) {
 		num = last - block;
 		/* find extent for this block */
 		down_read(&EXT4_I(inode)->i_data_sem);
+
+		if (path && ext_depth(inode) != depth) {
+			/* depth was changed. we have to realloc path */
+			kfree(path);
+			path = NULL;
+		}
+
 		path = ext4_ext_find_extent(inode, block, path);
-		up_read(&EXT4_I(inode)->i_data_sem);
 		if (IS_ERR(path)) {
+			up_read(&EXT4_I(inode)->i_data_sem);
 			err = PTR_ERR(path);
 			path = NULL;
 			break;
@@ -1987,12 +1998,14 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 
 		depth = ext_depth(inode);
 		if (unlikely(path[depth].p_hdr == NULL)) {
+			up_read(&EXT4_I(inode)->i_data_sem);
 			EXT4_ERROR_INODE(inode, "path[%d].p_hdr == NULL", depth);
 			err = -EIO;
 			break;
 		}
 		ex = path[depth].p_ext;
 		next = ext4_ext_next_allocated_block(path);
+		ext4_ext_drop_refs(path);
 
 		exists = 0;
 		if (!ex) {
@@ -2030,40 +2043,49 @@ static int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
 		BUG_ON(end <= start);
 
 		if (!exists) {
-			cbex.ec_block = start;
-			cbex.ec_len = end - start;
-			cbex.ec_start = 0;
+			newex.ec_block = start;
+			newex.ec_len = end - start;
+			newex.ec_start = 0;
+			err = ext4_find_delayed_extent(inode, &newex);
 		} else {
-			cbex.ec_block = le32_to_cpu(ex->ee_block);
-			cbex.ec_len = ext4_ext_get_actual_len(ex);
-			cbex.ec_start = ext4_ext_pblock(ex);
+			newex.ec_block = le32_to_cpu(ex->ee_block);
+			newex.ec_len = ext4_ext_get_actual_len(ex);
+			newex.ec_start = ext4_ext_pblock(ex);
+			if (ext4_ext_is_uninitialized(ex))
+				flags |= FIEMAP_EXTENT_UNWRITTEN;
 		}
+		up_read(&EXT4_I(inode)->i_data_sem);
 
-		if (unlikely(cbex.ec_len == 0)) {
-			EXT4_ERROR_INODE(inode, "cbex.ec_len == 0");
+		if (unlikely(newex.ec_len == 0)) {
+			EXT4_ERROR_INODE(inode, "newex.ec_len == 0");
 			err = -EIO;
 			break;
 		}
-		err = func(inode, next, &cbex, ex, cbdata);
-		ext4_ext_drop_refs(path);
-
 		if (err < 0)
 			break;
-
-		if (err == EXT_REPEAT)
-			continue;
-		else if (err == EXT_BREAK) {
-			err = 0;
-			break;
+		if (err == 1) {
+			exists = 1;
+			flags |= FIEMAP_EXTENT_DELALLOC;
 		}
 
-		if (ext_depth(inode) != depth) {
-			/* depth was changed. we have to realloc path */
-			kfree(path);
-			path = NULL;
+		if (next == EXT_MAX_BLOCKS)
+			flags |= FIEMAP_EXTENT_LAST;
+
+		if (exists) {
+			err = fiemap_fill_next_extent(fieinfo,
+				(__u64)newex.ec_block << blksize_bits,
+				(__u64)newex.ec_start << blksize_bits,
+				(__u64)newex.ec_len << blksize_bits,
+				flags);
+			if (err < 0)
+				break;
+			if (err == 1) {
+				err = 0;
+				break;
+			}
 		}
 
-		block = cbex.ec_block + cbex.ec_len;
+		block = newex.ec_block + newex.ec_len;
 	}
 
 	if (path) {
@@ -4574,204 +4596,179 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 /*
  * Callback function called for each extent to gather FIEMAP information.
  */
-static int ext4_ext_fiemap_cb(struct inode *inode, ext4_lblk_t next,
-		       struct ext4_ext_cache *newex, struct ext4_extent *ex,
-		       void *data)
+static int ext4_find_delayed_extent(struct inode *inode,
+				    struct ext4_ext_cache *newex)
 {
-	__u64	logical;
-	__u64	physical;
-	__u64	length;
-	__u32	flags = 0;
 	int		ret = 0;
-	struct fiemap_extent_info *fieinfo = data;
-	unsigned char blksize_bits;
+	unsigned int flags = 0;
+	ext4_lblk_t	end = 0;
+	pgoff_t		last_offset;
+	pgoff_t		offset;
+	pgoff_t		index;
+	pgoff_t		start_index = 0;
+	struct page	**pages = NULL;
+	struct buffer_head *bh = NULL;
+	struct buffer_head *head = NULL;
+	unsigned int nr_pages = PAGE_SIZE / sizeof(struct page *);
+	unsigned char blksize_bits = inode->i_sb->s_blocksize_bits;
 
-	blksize_bits = inode->i_sb->s_blocksize_bits;
-	logical = (__u64)newex->ec_block << blksize_bits;
+	/*
+	 * No extent in extent-tree contains block @newex->ec_start,
+	 * then the block may stay in 1)a hole or 2)delayed-extent.
+	 *
+	 * Holes or delayed-extents are processed as follows.
+	 * 1. lookup dirty pages with specified range in pagecache.
+	 *    If no page is got, then there is no delayed-extent and
+	 *    return with EXT_CONTINUE.
+	 * 2. find the 1st mapped buffer,
+	 * 3. check if the mapped buffer is both in the request range
+	 *    and a delayed buffer. If not, there is no delayed-extent,
+	 *    then return.
+	 * 4. a delayed-extent is found, the extent will be collected.
+	 */
 
-	if (newex->ec_start == 0) {
-		/*
-		 * No extent in extent-tree contains block @newex->ec_start,
-		 * then the block may stay in 1)a hole or 2)delayed-extent.
-		 *
-		 * Holes or delayed-extents are processed as follows.
-		 * 1. lookup dirty pages with specified range in pagecache.
-		 *    If no page is got, then there is no delayed-extent and
-		 *    return with EXT_CONTINUE.
-		 * 2. find the 1st mapped buffer,
-		 * 3. check if the mapped buffer is both in the request range
-		 *    and a delayed buffer. If not, there is no delayed-extent,
-		 *    then return.
-		 * 4. a delayed-extent is found, the extent will be collected.
-		 */
-		ext4_lblk_t	end = 0;
-		pgoff_t		last_offset;
-		pgoff_t		offset;
-		pgoff_t		index;
-		pgoff_t		start_index = 0;
-		struct page	**pages = NULL;
-		struct buffer_head *bh = NULL;
-		struct buffer_head *head = NULL;
-		unsigned int nr_pages = PAGE_SIZE / sizeof(struct page *);
-
-		pages = kmalloc(PAGE_SIZE, GFP_KERNEL);
-		if (pages == NULL)
-			return -ENOMEM;
+	pages = kmalloc(PAGE_SIZE, GFP_NOFS);
+	if (pages == NULL)
+		return -ENOMEM;
 
-		offset = logical >> PAGE_SHIFT;
+	offset = ((__u64)newex->ec_block << PAGE_SHIFT) >>
+			blksize_bits;
 repeat:
-		last_offset = offset;
-		head = NULL;
-		ret = find_get_pages_tag(inode->i_mapping, &offset,
-					PAGECACHE_TAG_DIRTY, nr_pages, pages);
-
-		if (!(flags & FIEMAP_EXTENT_DELALLOC)) {
-			/* First time, try to find a mapped buffer. */
-			if (ret == 0) {
+	last_offset = offset;
+	head = NULL;
+	ret = find_get_pages_tag(inode->i_mapping, &offset,
+				PAGECACHE_TAG_DIRTY, nr_pages, pages);
+
+	if (!(flags & FIEMAP_EXTENT_DELALLOC)) {
+		/* First time, try to find a mapped buffer. */
+		if (ret == 0) {
 out:
-				for (index = 0; index < ret; index++)
-					page_cache_release(pages[index]);
-				/* just a hole. */
-				kfree(pages);
-				return EXT_CONTINUE;
-			}
-			index = 0;
+			for (index = 0; index < ret; index++)
+				page_cache_release(pages[index]);
+			/* just a hole. */
+			kfree(pages);
+			return 0;
+		}
+		index = 0;
 
 next_page:
-			/* Try to find the 1st mapped buffer. */
-			end = ((__u64)pages[index]->index << PAGE_SHIFT) >>
-				  blksize_bits;
-			if (!page_has_buffers(pages[index]))
-				goto out;
-			head = page_buffers(pages[index]);
-			if (!head)
-				goto out;
+		/* Try to find the 1st mapped buffer. */
+		end = ((__u64)pages[index]->index << PAGE_SHIFT) >>
+			  blksize_bits;
+		if (!page_has_buffers(pages[index]))
+			goto out;
+		head = page_buffers(pages[index]);
+		if (!head)
+			goto out;
 
-			index++;
-			bh = head;
-			do {
-				if (end >= newex->ec_block +
-					newex->ec_len)
-					/* The buffer is out of
-					 * the request range.
-					 */
-					goto out;
+		index++;
+		bh = head;
+		do {
+			if (end >= newex->ec_block +
+				newex->ec_len)
+				/* The buffer is out of
+				 * the request range.
+				 */
+				goto out;
 
-				if (buffer_mapped(bh) &&
-				    end >= newex->ec_block) {
-					start_index = index - 1;
-					/* get the 1st mapped buffer. */
-					goto found_mapped_buffer;
-				}
+			if (buffer_mapped(bh) &&
+			    end >= newex->ec_block) {
+				start_index = index - 1;
+				/* get the 1st mapped buffer. */
+				goto found_mapped_buffer;
+			}
 
-				bh = bh->b_this_page;
-				end++;
-			} while (bh != head);
+			bh = bh->b_this_page;
+			end++;
+		} while (bh != head);
 
-			/* No mapped buffer in the range found in this page,
-			 * We need to look up next page.
+		/* No mapped buffer in the range found in this page,
+		 * We need to look up next page.
+		 */
+		if (index >= ret) {
+			/* There is no page left, but we need to limit
+			 * newex->ec_len.
 			 */
-			if (index >= ret) {
-				/* There is no page left, but we need to limit
-				 * newex->ec_len.
-				 */
-				newex->ec_len = end - newex->ec_block;
-				goto out;
-			}
-			goto next_page;
-		} else {
-			/*Find contiguous delayed buffers. */
-			if (ret > 0 && pages[0]->index == last_offset)
-				head = page_buffers(pages[0]);
-			bh = head;
-			index = 1;
-			start_index = 0;
+			newex->ec_len = end - newex->ec_block;
+			goto out;
 		}
+		goto next_page;
+	} else {
+		/*Find contiguous delayed buffers. */
+		if (ret > 0 && pages[0]->index == last_offset)
+			head = page_buffers(pages[0]);
+		bh = head;
+		index = 1;
+		start_index = 0;
+	}
 
 found_mapped_buffer:
-		if (bh != NULL && buffer_delay(bh)) {
-			/* 1st or contiguous delayed buffer found. */
-			if (!(flags & FIEMAP_EXTENT_DELALLOC)) {
-				/*
-				 * 1st delayed buffer found, record
-				 * the start of extent.
-				 */
-				flags |= FIEMAP_EXTENT_DELALLOC;
-				newex->ec_block = end;
-				logical = (__u64)end << blksize_bits;
+	if (bh != NULL && buffer_delay(bh)) {
+		/* 1st or contiguous delayed buffer found. */
+		if (!(flags & FIEMAP_EXTENT_DELALLOC)) {
+			/*
+			 * 1st delayed buffer found, record
+			 * the start of extent.
+			 */
+			flags |= FIEMAP_EXTENT_DELALLOC;
+			newex->ec_block = end;
+		}
+		/* Find contiguous delayed buffers. */
+		do {
+			if (!buffer_delay(bh))
+				goto found_delayed_extent;
+			bh = bh->b_this_page;
+			end++;
+		} while (bh != head);
+
+		for (; index < ret; index++) {
+			if (!page_has_buffers(pages[index])) {
+				bh = NULL;
+				break;
+			}
+			head = page_buffers(pages[index]);
+			if (!head) {
+				bh = NULL;
+				break;
+			}
+
+			if (pages[index]->index !=
+			    pages[start_index]->index + index
+			    - start_index) {
+				/* Blocks are not contiguous. */
+				bh = NULL;
+				break;
 			}
-			/* Find contiguous delayed buffers. */
+			bh = head;
 			do {
 				if (!buffer_delay(bh))
+					/* Delayed-extent ends. */
 					goto found_delayed_extent;
 				bh = bh->b_this_page;
 				end++;
 			} while (bh != head);
-
-			for (; index < ret; index++) {
-				if (!page_has_buffers(pages[index])) {
-					bh = NULL;
-					break;
-				}
-				head = page_buffers(pages[index]);
-				if (!head) {
-					bh = NULL;
-					break;
-				}
-
-				if (pages[index]->index !=
-				    pages[start_index]->index + index
-				    - start_index) {
-					/* Blocks are not contiguous. */
-					bh = NULL;
-					break;
-				}
-				bh = head;
-				do {
-					if (!buffer_delay(bh))
-						/* Delayed-extent ends. */
-						goto found_delayed_extent;
-					bh = bh->b_this_page;
-					end++;
-				} while (bh != head);
-			}
-		} else if (!(flags & FIEMAP_EXTENT_DELALLOC))
-			/* a hole found. */
-			goto out;
-
-found_delayed_extent:
-		newex->ec_len = min(end - newex->ec_block,
-						(ext4_lblk_t)EXT_INIT_MAX_LEN);
-		if (ret == nr_pages && bh != NULL &&
-			newex->ec_len < EXT_INIT_MAX_LEN &&
-			buffer_delay(bh)) {
-			/* Have not collected an extent and continue. */
-			for (index = 0; index < ret; index++)
-				page_cache_release(pages[index]);
-			goto repeat;
 		}
+	} else if (!(flags & FIEMAP_EXTENT_DELALLOC))
+		/* a hole found. */
+		goto out;
 
+found_delayed_extent:
+	newex->ec_len = min(end - newex->ec_block,
+					(ext4_lblk_t)EXT_INIT_MAX_LEN);
+	if (ret == nr_pages && bh != NULL &&
+		newex->ec_len < EXT_INIT_MAX_LEN &&
+		buffer_delay(bh)) {
+		/* Have not collected an extent and continue. */
 		for (index = 0; index < ret; index++)
 			page_cache_release(pages[index]);
-		kfree(pages);
+		goto repeat;
 	}
 
-	physical = (__u64)newex->ec_start << blksize_bits;
-	length =   (__u64)newex->ec_len << blksize_bits;
-
-	if (ex && ext4_ext_is_uninitialized(ex))
-		flags |= FIEMAP_EXTENT_UNWRITTEN;
-
-	if (next == EXT_MAX_BLOCKS)
-		flags |= FIEMAP_EXTENT_LAST;
+	for (index = 0; index < ret; index++)
+		page_cache_release(pages[index]);
+	kfree(pages);
 
-	ret = fiemap_fill_next_extent(fieinfo, logical, physical,
-					length, flags);
-	if (ret < 0)
-		return ret;
-	if (ret == 1)
-		return EXT_BREAK;
-	return EXT_CONTINUE;
+	return 1;
 }
 /* fiemap flags we can handle specified here */
 #define EXT4_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR)
@@ -4991,6 +4988,7 @@ out_mutex:
 	mutex_unlock(&inode->i_mutex);
 	return err;
 }
+
 int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len)
 {
@@ -5021,8 +5019,8 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		 * Walk the extent tree gathering extent information.
 		 * ext4_ext_fiemap_cb will push extents back to user.
 		 */
-		error = ext4_ext_walk_space(inode, start_blk, len_blks,
-					  ext4_ext_fiemap_cb, fieinfo);
+		error = ext4_fill_fiemap_extents(inode, start_blk,
+						 len_blks, fieinfo);
 	}
 
 	return error;


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-15 16:39           ` Lukáš Czerner
@ 2012-11-15 19:10             ` Zach Brown
  2012-11-19  3:24             ` Theodore Ts'o
  1 sibling, 0 replies; 16+ messages in thread
From: Zach Brown @ 2012-11-15 19:10 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Peng Tao, linux-ext4, tytso, dmonakhov

> I have prepared a path to fix this and I am going to test this
> right not. Basically it will:
> 
> 1. remove the callback
> 
> 2. rename functions
> 	ext4_ext_walk_space -> ext4_fill_fiemap_extents
> 	ext4_ext_fiemap_cb -> ext4_find_delayed_extent
> 
> 3. put fiemap_fill_next_extent() into ext4_fill_fiemap_extents)_
> 
> 4. Call ext4_find_delayed_extent() only for non existing extents
> 
> 5. Use GFP_NOFS in ext4_find_delayed_extent()
> 
> 5. hold the i_data_sem for:
> 	ext4_ext_find_extent()
> 	ext4_ext_next_allocated_block()
> 	ext4_find_delayed_extent()
> 
> 6. call fiemap_fill_next_extent after releasing the i_data_sem
> 
> does it sounds ok?

That sounds.. good?  I'm no ext* locking expert :).

> ---
> I should probably split that into two parts to make the changes
> clearer, because removing the if (newex->ec_start == 0) in the old
> ext4_ext_fiemap_cb() obfuscated it a lot.

Yeah, definitely.  I found myself squinting looking for code changes in
all the code motion.

- z

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-15 16:39           ` Lukáš Czerner
  2012-11-15 19:10             ` Zach Brown
@ 2012-11-19  3:24             ` Theodore Ts'o
  2012-11-19 11:11               ` Lukáš Czerner
  1 sibling, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2012-11-19  3:24 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Zach Brown, Peng Tao, linux-ext4, dmonakhov

On Thu, Nov 15, 2012 at 05:39:06PM +0100, Lukáš Czerner wrote:
> You're both right. The code have to be reorganized. The
> ext4_ext_fiemap_cb() is ugly as hell, but luckily that will be
> resolved with extent status tree patches from Zheng Liu, however the
> indirection in ext4_ext_walk_space() and the callback business
> is also pointless.

Hi Lukas,

Is your patch going to be dependent on whether the extent status tree
patches are applied?

If so, you might want to develop your patch versus the ext4 git tree's
master branch, which includes the extent status patches currently
queued for the next merge window.

Thanks!!

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] ext4: Prevent race while waling extent tree
  2012-11-19  3:24             ` Theodore Ts'o
@ 2012-11-19 11:11               ` Lukáš Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukáš Czerner @ 2012-11-19 11:11 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lukáš Czerner, Zach Brown, Peng Tao, linux-ext4,
	dmonakhov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1129 bytes --]

On Sun, 18 Nov 2012, Theodore Ts'o wrote:

> Date: Sun, 18 Nov 2012 22:24:30 -0500
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Zach Brown <zab@redhat.com>, Peng Tao <bergwolf@gmail.com>,
>     linux-ext4@vger.kernel.org, dmonakhov@openvz.org
> Subject: Re: [PATCH v3] ext4: Prevent race while waling extent tree
> 
> On Thu, Nov 15, 2012 at 05:39:06PM +0100, Lukáš Czerner wrote:
> > You're both right. The code have to be reorganized. The
> > ext4_ext_fiemap_cb() is ugly as hell, but luckily that will be
> > resolved with extent status tree patches from Zheng Liu, however the
> > indirection in ext4_ext_walk_space() and the callback business
> > is also pointless.
> 
> Hi Lukas,
> 
> Is your patch going to be dependent on whether the extent status tree
> patches are applied?
> 
> If so, you might want to develop your patch versus the ext4 git tree's
> master branch, which includes the extent status patches currently
> queued for the next merge window.
> 
> Thanks!!
> 
> 					- Ted

Yes, there is a small conflict. I'll base it on top of the ext4 git
tree.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-11-19 11:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-12 14:57 [PATCH] ext4: Prevent race while waling extent tree Lukas Czerner
2012-11-13  8:22 ` [PATCH v3] " Lukas Czerner
2012-11-13 11:34   ` Peng Tao
2012-11-13 12:07     ` Lukáš Czerner
2012-11-13 14:19       ` Peng Tao
2012-11-13 18:51         ` Zach Brown
2012-11-15 16:39           ` Lukáš Czerner
2012-11-15 19:10             ` Zach Brown
2012-11-19  3:24             ` Theodore Ts'o
2012-11-19 11:11               ` Lukáš Czerner
  -- strict thread matches above, loose matches on Subject: below --
2012-11-08 11:08 [PATCH] " Lukas Czerner
2012-11-08 12:01 ` Dmitry Monakhov
2012-11-08 13:43   ` Lukáš Czerner
2012-11-08 16:07     ` Lukáš Czerner
2012-11-08 21:52 ` Zach Brown
2012-11-09  9:19   ` Lukáš Czerner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).