[PATCH] Btrfs: do not move em to modified list when unpinning

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Btrfs: do not move em to modified list when unpinning
@ 2014-11-14 21:16 Josef Bacik
  2014-11-18  7:13 ` Liu Bo
  2014-11-19  3:45 ` Dave Chinner
  0 siblings, 2 replies; 5+ messages in thread
From: Josef Bacik @ 2014-11-14 21:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time.  Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created.  We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log.  The problem is when
something like this happens

log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
	at this point we assume the csum has been copied
sync log
crash

On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr.  This of course causes us to
have errors about not having csums for certain ranges of our inode.  So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged.  This
fixes my test that I could reliably reproduce this problem with.  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent_map.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 225302b..6a98bdd 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -287,8 +287,6 @@ int unpin_extent_cache(struct extent_map_tree *tree, u64 start, u64 len,
 	if (!em)
 		goto out;

-	if (!test_bit(EXTENT_FLAG_LOGGING, &em->flags))
-		list_move(&em->list, &tree->modified_extents);
 	em->generation = gen;
 	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
 	em->mod_start = em->start;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: do not move em to modified list when unpinning
  2014-11-14 21:16 [PATCH] Btrfs: do not move em to modified list when unpinning Josef Bacik
@ 2014-11-18  7:13 ` Liu Bo
  2014-11-18 16:03   ` Josef Bacik
  2014-11-19  3:45 ` Dave Chinner
  1 sibling, 1 reply; 5+ messages in thread
From: Liu Bo @ 2014-11-18  7:13 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, stable

On Fri, Nov 14, 2014 at 04:16:30PM -0500, Josef Bacik wrote:
> We use the modified list to keep track of which extents have been modified so we
> know which ones are candidates for logging at fsync() time.  Newly modified
> extents are added to the list at modification time, around the same time the
> ordered extent is created.  We do this so that we don't have to wait for ordered
> extents to complete before we know what we need to log.  The problem is when
> something like this happens
> 
> log extent 0-4k on inode 1
> copy csum for 0-4k from ordered extent into log
> sync log
> commit transaction
> log some other extent on inode 1
> ordered extent for 0-4k completes and adds itself onto modified list again
> log changed extents
> see ordered extent for 0-4k has already been logged
> 	at this point we assume the csum has been copied
> sync log
> crash
> 
> On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
> which is the same one that we are replaying which also drops the csum, and then
> we won't find the csum in the log for that bytenr.  This of course causes us to
> have errors about not having csums for certain ranges of our inode.  So remove
> the modified list manipulation in unpin_extent_cache, any modified extents
> should have been added well before now, and we don't want them re-logged.  This
> fixes my test that I could reliably reproduce this problem with.  Thanks,

This will make em->generation remain -1 in the above case, no?

Csum is copied after ordered extent is set with "IO_DONE", but before
unpin_extent_cache(), "sync log" happens, so if we dont have it 're-logged',
em->generation will not get updated so that the btrfs_file_extent_item's generation will not be
updated.

thanks,
-liubo

> 
> cc: stable@vger.kernel.org
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent_map.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 225302b..6a98bdd 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -287,8 +287,6 @@ int unpin_extent_cache(struct extent_map_tree *tree, u64 start, u64 len,
>  	if (!em)
>  		goto out;
>  
> -	if (!test_bit(EXTENT_FLAG_LOGGING, &em->flags))
> -		list_move(&em->list, &tree->modified_extents);
>  	em->generation = gen;
>  	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
>  	em->mod_start = em->start;
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: do not move em to modified list when unpinning
  2014-11-18  7:13 ` Liu Bo
@ 2014-11-18 16:03   ` Josef Bacik
  0 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2014-11-18 16:03 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

On 11/18/2014 02:13 AM, Liu Bo wrote:
> On Fri, Nov 14, 2014 at 04:16:30PM -0500, Josef Bacik wrote:
>> We use the modified list to keep track of which extents have been modified so we
>> know which ones are candidates for logging at fsync() time.  Newly modified
>> extents are added to the list at modification time, around the same time the
>> ordered extent is created.  We do this so that we don't have to wait for ordered
>> extents to complete before we know what we need to log.  The problem is when
>> something like this happens
>>
>> log extent 0-4k on inode 1
>> copy csum for 0-4k from ordered extent into log
>> sync log
>> commit transaction
>> log some other extent on inode 1
>> ordered extent for 0-4k completes and adds itself onto modified list again
>> log changed extents
>> see ordered extent for 0-4k has already been logged
>> 	at this point we assume the csum has been copied
>> sync log
>> crash
>>
>> On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
>> which is the same one that we are replaying which also drops the csum, and then
>> we won't find the csum in the log for that bytenr.  This of course causes us to
>> have errors about not having csums for certain ranges of our inode.  So remove
>> the modified list manipulation in unpin_extent_cache, any modified extents
>> should have been added well before now, and we don't want them re-logged.  This
>> fixes my test that I could reliably reproduce this problem with.  Thanks,
>
> This will make em->generation remain -1 in the above case, no?
>
> Csum is copied after ordered extent is set with "IO_DONE", but before
> unpin_extent_cache(), "sync log" happens, so if we dont have it 're-logged',
> em->generation will not get updated so that the btrfs_file_extent_item's generation will not be
> updated.
>

Huh this is a good point, but it brings up another horrible thing I just 
realized, in this case that we die directly after that transaction 
commit we'll lose the extent anyway because we'll drop the tree log and 
the extent won't actually be in its actual tree.  Woo I love this stuf.

Josef

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: do not move em to modified list when unpinning
  2014-11-14 21:16 [PATCH] Btrfs: do not move em to modified list when unpinning Josef Bacik
  2014-11-18  7:13 ` Liu Bo
@ 2014-11-19  3:45 ` Dave Chinner
  2014-11-19 14:57   ` Josef Bacik
  1 sibling, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2014-11-19  3:45 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, stable

On Fri, Nov 14, 2014 at 04:16:30PM -0500, Josef Bacik wrote:
> We use the modified list to keep track of which extents have been modified so we
> know which ones are candidates for logging at fsync() time.  Newly modified
> extents are added to the list at modification time, around the same time the
> ordered extent is created.  We do this so that we don't have to wait for ordered
> extents to complete before we know what we need to log.  The problem is when
> something like this happens
> 
> log extent 0-4k on inode 1
> copy csum for 0-4k from ordered extent into log
> sync log
> commit transaction
> log some other extent on inode 1
> ordered extent for 0-4k completes and adds itself onto modified list again
> log changed extents
> see ordered extent for 0-4k has already been logged
> 	at this point we assume the csum has been copied
> sync log
> crash
> 
> On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
> which is the same one that we are replaying which also drops the csum, and then
> we won't find the csum in the log for that bytenr.  This of course causes us to
> have errors about not having csums for certain ranges of our inode.  So remove
> the modified list manipulation in unpin_extent_cache, any modified extents
> should have been added well before now, and we don't want them re-logged.  This
> fixes my test that I could reliably reproduce this problem with.  Thanks,

Is it possiible to turn this unspecified test in into another
generic fsync xfstest?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: do not move em to modified list when unpinning
  2014-11-19  3:45 ` Dave Chinner
@ 2014-11-19 14:57   ` Josef Bacik
  0 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2014-11-19 14:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs, stable

On 11/18/2014 10:45 PM, Dave Chinner wrote:
> On Fri, Nov 14, 2014 at 04:16:30PM -0500, Josef Bacik wrote:
>> We use the modified list to keep track of which extents have been modified so we
>> know which ones are candidates for logging at fsync() time.  Newly modified
>> extents are added to the list at modification time, around the same time the
>> ordered extent is created.  We do this so that we don't have to wait for ordered
>> extents to complete before we know what we need to log.  The problem is when
>> something like this happens
>>
>> log extent 0-4k on inode 1
>> copy csum for 0-4k from ordered extent into log
>> sync log
>> commit transaction
>> log some other extent on inode 1
>> ordered extent for 0-4k completes and adds itself onto modified list again
>> log changed extents
>> see ordered extent for 0-4k has already been logged
>> 	at this point we assume the csum has been copied
>> sync log
>> crash
>>
>> On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
>> which is the same one that we are replaying which also drops the csum, and then
>> we won't find the csum in the log for that bytenr.  This of course causes us to
>> have errors about not having csums for certain ranges of our inode.  So remove
>> the modified list manipulation in unpin_extent_cache, any modified extents
>> should have been added well before now, and we don't want them re-logged.  This
>> fixes my test that I could reliably reproduce this problem with.  Thanks,
>
> Is it possiible to turn this unspecified test in into another
> generic fsync xfstest?
>

It depends on a new dm target I'm working on to better test power fail 
scenarios, once I have that merged I have a few xfstests I'll be 
submitting in this area.  Would you actually mind taking a quick look at 
it to make sure it seems sane?

https://git.kernel.org/cgit/linux/kernel/git/josef/btrfs-next.git/log/?h=dm-powerfail

The 'split' option is what is meant for ext* and xfs (I haven't tested 
that part yet), which will just return the old data in the case of 
unflushed data/metadata.  Anything you'd like to see added or changed? 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-11-19 14:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-14 21:16 [PATCH] Btrfs: do not move em to modified list when unpinning Josef Bacik
2014-11-18  7:13 ` Liu Bo
2014-11-18 16:03   ` Josef Bacik
2014-11-19  3:45 ` Dave Chinner
2014-11-19 14:57   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).