Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Nikolay Borisov <nborisov@suse.com>
To: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v3 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary
Date: Wed, 23 Oct 2019 18:41:14 +0300	[thread overview]
Message-ID: <1322eb4c-5d7a-29c0-befc-952a012f1bcc@suse.com> (raw)
In-Reply-To: <20191023135727.64358-3-wqu@suse.com>



On 23.10.19 г. 16:57 ч., Qu Wenruo wrote:
> [BUG]
> When deleting large files (which cross block group boundary) with discard
> mount option, we find some btrfs_discard_extent() calls only trimmed part
> of its space, not the whole range:
> 
>   btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
> 
> type:		bbio->map_type, in above case, it's SINGLE DATA.
> start:		Logical address of this trim
> len:		Logical length of this trim
> trimmed:	Physically trimmed bytes
> ratio:		trimmed / len
> 
> Thus leading some unused space not discarded.
> 
> [CAUSE]
> When discard mount option is specified, after a transaction is fully
> committed (super block written to disk), we begin to cleanup pinned
> extents in the following call chain:
> 
> btrfs_commit_transaction()
> |- write_all_supers()

You can remove write_all_supers

> |- btrfs_finish_extent_commit()
>    |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
>    |- btrfs_discard_extent()
> 
> However pinned extents are recorded in an extent_io_tree, which can
> merge adjacent extent states.
> 
> When a large file get deleted and it has adjacent file extents across
> block group boundary, we will get a large merged range.

This is wrong, it will only get merged if the extent spans contiguous bg boundaries
(this is very important!)

> 
> Then when we pass the large range into btrfs_discard_extent(),
> btrfs_discard_extent() will just trim the first part, without trimming
> the remaining part.

Here is what my testing shows: 

mkfs.btrfs -f /dev/vdc

mount -onodatasum,nospace_cache /dev/vdc /media/scratch/
xfs_io -f -c "pwrite 0 800m" /media/scratch/file1 && sync
xfs_io -f -c "pwrite 0 300m" /media/scratch/file2 && sync
umount /media/scratch

mount -odiscard /dev/vdc /media/scratch
rm -f /media/scratch/file2 && sync
trace-cmd show

umount /media/scratch

The output I get in trace-cmd is: 

sync-1014  [001] ....   534.272310: btrfs_finish_extent_commit: Discarding 1943011328-2077229055 (len: 134217728)
sync-1014  [001] ....   534.272315: btrfs_discard_extent: Requested to discard: 134217728 but discarded: 134217728

sync-1014  [001] ....   534.272325: btrfs_finish_extent_commit: Discarding 2177892352-2358247423 (len: 180355072)
sync-1014  [001] ....   534.272330: btrfs_discard_extent: Requested to discard: 180355072 but discarded: 180355072

The extents of this file look like this in the extent tree prior to the trim: 

item 18 key (1943011328 EXTENT_ITEM 134217728) itemoff 15523 itemsize 53
		refs 1 gen 7 flags DATA
		extent data backref root FS_TREE objectid 258 offset 0 count 1
item 19 key (2177892352 EXTENT_ITEM 134217728) itemoff 15470 itemsize 53
		refs 1 gen 7 flags DATA
		extent data backref root FS_TREE objectid 258 offset 134217728 count 1
item 20 key (2177892352 BLOCK_GROUP_ITEM 1073741824) itemoff 15446 itemsize 24
		block group used 180355072 chunk_objectid 256 flags DATA
item 21 key (2312110080 EXTENT_ITEM 46137344) itemoff 15393 itemsize 53
		refs 1 gen 7 flags DATA
		extent data backref root FS_TREE objectid 258 offset 268435456 count 1

So we have 3 extents 1 of which is in bg 1 and the other 2 in bg2. The 2 extents in bg2 are merged but 
since the 2nd bg is not contiguous to the first hence no merging.  

Here comes the requirement why the bg must be contiguous.

If I modify my test case with slightly different write offsets such that bg1 
is indeed filled and the next extent gets allocated to in bg2, which is adjacent then 
the bug is reproduced: 

mkfs.btrfs -f /dev/vdc

mount -onodatasum,nospace_cache /dev/vdc /media/scratch/
xfs_io -f -c "pwrite 0 800m" /media/scratch/file1 && sync
xfs_io -f -c "pwrite 0 224m" /media/scratch/file2 && sync
xfs_io -f -c "pwrite 224m 76m" /media/scratch/file2 && sync
umount /media/scratch

mount -odiscard /dev/vdc /media/scratch
rm -f /media/scratch/file2 && sync
trace-cmd show

umount /media/scratch

The 3 extents being created and subsequently deleted are: 

sync-799   [000] ....   313.938048: btrfs_update_block_group: Pinning 1943011328-2077229055
sync-799   [000] ....   313.938073: btrfs_update_block_group: Pinning 2077229056-2177892351 <- BG1 ends
sync-799   [000] ....   313.938116: btrfs_update_block_group: Pinning 2177892352-2257584127 <- BG2 begins

But we only get 1 discard request:

sync-798   [003] ....   154.077897: btrfs_finish_extent_commit: Discarding 1943011328-2257584127 (len: 314572800) <- this is the request passed to btrfs_discard_extent
sync-798   [003] ....   154.077901: btrfs_discard_extent: Discarding 234881024 length for bytenr: 1943011328 <- this is the actual range being discarded inside the for loop. 

So the bug is genuine I will test whether your patch fixes it and report back. 

> Furthermore, this bug is not that reliably observed, as if the whole
> block group is empty, there will be another trim for that block group.

Not only because of this, mainly because of the contiguousness requirement. 

> 
> So the most obvious way to find this missing trim needs to delete large
> extents at block group boundary without empting involved block groups.
> 
> [FIX]
> - Allow __btrfs_map_block_for_discard() to modify @length parameter
>   btrfs_map_block() uses its @length paramter to notify the caller how
>   many bytes are mapped in current call.
>   With __btrfs_map_block_for_discard() also modifing the @length,
>   btrfs_discard_extent() now understands when to do extra trim.
> 
> - Call btrfs_map_block() in a loop until we hit the range end
>   Since we now know how many bytes are mapped each time, we can iterate
>   through each block group boundary and issue correct trim for each
>   range.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

<snip>

  reply	other threads:[~2019-10-23 15:41 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-23 13:57 [PATCH v3 0/2] btrfs: trim: Fix a bug certain range may not be trimmed properly Qu Wenruo
2019-10-23 13:57 ` [PATCH v3 1/2] btrfs: volumes: Use more straightforward way to calculate map length Qu Wenruo
2019-10-24  8:11   ` Johannes Thumshirn
2019-10-23 13:57 ` [PATCH v3 2/2] btrfs: extent-tree: Ensure we trim ranges across block group boundary Qu Wenruo
2019-10-23 15:41   ` Nikolay Borisov [this message]
2019-10-24  0:35     ` Qu Wenruo
2019-10-24  9:11   ` Nikolay Borisov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1322eb4c-5d7a-29c0-befc-952a012f1bcc@suse.com \
    --to=nborisov@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox