Stray 4k extents with slow buffered writes

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Stray 4k extents with slow buffered writes
@ 2016-03-03 12:28 Holger Hoffstätte
  2016-03-03 18:33 ` Liu Bo
  2016-03-03 20:55 ` Chris Mason
  0 siblings, 2 replies; 9+ messages in thread
From: Holger Hoffstätte @ 2016-03-03 12:28 UTC (permalink / raw)
  To: linux-btrfs


Here's an observation that is not a bug (as in data corruption), just
somewhat odd and unnecessary behaviour. It could be considered  a
performance or scalability bug.

I've noticed that slow slow buffered writes create a huge number of
unnecessary 4k sized extents. At first I wrote it off as odd buffering
behaviour of the application (a download manager), but it can be easily
reproduced. For example:

holger>wget --limit-rate=1m https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.5-rc6.tar.xz
(..downloads with 1 MB/s..)
holger>sync
holger>filefrag -ek linux-4.5-rc6.tar.xz 
Filesystem type is: 9123683e
File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   14219:  230807476.. 230821695:  14220:            
   1:    14220..   14223:  230838148.. 230838151:      4:  230821696:
   2:    14224..   29215:  230822324.. 230837315:  14992:  230838152:
   3:    29216..   44207:  230838152.. 230853143:  14992:  230837316:
   4:    44208..   44211:  230869576.. 230869579:      4:  230853144:
   5:    44212..   59199:  230853968.. 230868955:  14988:  230869580:
   6:    59200..   74191:  230869588.. 230884579:  14992:  230868956:
   7:    74192..   74195:  230898332.. 230898335:      4:  230884580:
   8:    74196..   86291:  230885620.. 230897715:  12096:  230898336: last,eof
linux-4.5-rc6.tar.xz: 9 extents found

Slower writes will generate even more extents; another ~200MB file
had >900 extents.

As expected defragment will collapse these stray extents with their
successors:

holger>btrfs filesystem defragment linux-4.5-rc6.tar.xz                                              
holger>sync
holger>filefrag -ek linux-4.5-rc6.tar.xz           
Filesystem type is: 9123683e
File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   14219:  230807476.. 230821695:  14220:            
   1:    14220..   29215:  230922128.. 230937123:  14996:  230821696:
   2:    29216..   44207:  230838152.. 230853143:  14992:  230937124:
   3:    44208..   59199:  230937124.. 230952115:  14992:  230853144:
   4:    59200..   74191:  230869588.. 230884579:  14992:  230952116:
   5:    74192..   86291:  230952116.. 230964215:  12100:  230884580: last,eof
linux-4.5-rc6.tar.xz: 6 extents found

The obviously page-sized 4k extents happen to coincide with the 30s tx commit
(2 * ~15 MB at 1 MB/s). It looks like a benign race, as if the last dirty page
gets special treatment instead of being merged into wherever it should go.
That just seems wasteful to me.

Anyone got an idea?

-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 12:28 Stray 4k extents with slow buffered writes Holger Hoffstätte
@ 2016-03-03 18:33 ` Liu Bo
  2016-03-03 19:53   ` Holger Hoffstätte
  2016-03-03 20:55 ` Chris Mason
  1 sibling, 1 reply; 9+ messages in thread
From: Liu Bo @ 2016-03-03 18:33 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-btrfs

On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote:
> 
> Here's an observation that is not a bug (as in data corruption), just
> somewhat odd and unnecessary behaviour. It could be considered  a
> performance or scalability bug.
> 
> I've noticed that slow slow buffered writes create a huge number of
> unnecessary 4k sized extents. At first I wrote it off as odd buffering
> behaviour of the application (a download manager), but it can be easily
> reproduced. For example:
> 
> holger>wget --limit-rate=1m https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.5-rc6.tar.xz
> (..downloads with 1 MB/s..)
> holger>sync
> holger>filefrag -ek linux-4.5-rc6.tar.xz 
> Filesystem type is: 9123683e
> File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..   14219:  230807476.. 230821695:  14220:            
>    1:    14220..   14223:  230838148.. 230838151:      4:  230821696:
>    2:    14224..   29215:  230822324.. 230837315:  14992:  230838152:
>    3:    29216..   44207:  230838152.. 230853143:  14992:  230837316:
>    4:    44208..   44211:  230869576.. 230869579:      4:  230853144:
>    5:    44212..   59199:  230853968.. 230868955:  14988:  230869580:
>    6:    59200..   74191:  230869588.. 230884579:  14992:  230868956:
>    7:    74192..   74195:  230898332.. 230898335:      4:  230884580:
>    8:    74196..   86291:  230885620.. 230897715:  12096:  230898336: last,eof
> linux-4.5-rc6.tar.xz: 9 extents found
> 
> Slower writes will generate even more extents; another ~200MB file
> had >900 extents.
> 
> As expected defragment will collapse these stray extents with their
> successors:
> 
> holger>btrfs filesystem defragment linux-4.5-rc6.tar.xz                                              
> holger>sync
> holger>filefrag -ek linux-4.5-rc6.tar.xz           
> Filesystem type is: 9123683e
> File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..   14219:  230807476.. 230821695:  14220:            
>    1:    14220..   29215:  230922128.. 230937123:  14996:  230821696:
>    2:    29216..   44207:  230838152.. 230853143:  14992:  230937124:
>    3:    44208..   59199:  230937124.. 230952115:  14992:  230853144:
>    4:    59200..   74191:  230869588.. 230884579:  14992:  230952116:
>    5:    74192..   86291:  230952116.. 230964215:  12100:  230884580: last,eof
> linux-4.5-rc6.tar.xz: 6 extents found
> 
> The obviously page-sized 4k extents happen to coincide with the 30s tx commit
> (2 * ~15 MB at 1 MB/s). It looks like a benign race, as if the last dirty page
> gets special treatment instead of being merged into wherever it should go.
> That just seems wasteful to me.
> 
> Anyone got an idea?

On a new fresh btrfs, I cannot reproduce the fragmented layout with "wget --limit-rate=1m",

[root@10-11-17-236 btrfs]# filefrag -v -b linux-4.5-rc6.tar.xz             
Filesystem type is: 9123683e
File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize
1024)
 ext logical physical expected length flags
   0       0   143744            5264 
   1    5264   149008           35884 
   2   41148   220848   184892      4 
   3   41152   184896   220852  35948 
   4   77100   220852   220844   9192 eof
linux-4.5-rc6.tar.xz: 4 extents found

My mount point has,
/dev/loop0 /mnt/btrfs btrfs rw,seclabel,relatime,space_cache,subvolid=5,subvol=/ 0 0

And I'm using 4.5.0-rc4.

# btrfs fi show /mnt/btrfs
Label: none  uuid: 599d68ae-b874-4db1-a3c3-33c4b783bfdd
        Total devices 1 FS bytes used 94.62MiB
        devid    1 size 2.00GiB used 436.75MiB path /dev/loop0

# btrfs fi df /mnt/btrfs
Data, single: total=216.00MiB, used=94.40MiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=102.38MiB, used=208.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

Can you gather your mount options and 'btrfs fi show/df' output?

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 18:33 ` Liu Bo
@ 2016-03-03 19:53   ` Holger Hoffstätte
  2016-03-03 20:47     ` Austin S. Hemmelgarn
  2016-03-04 12:17     ` Duncan
  0 siblings, 2 replies; 9+ messages in thread
From: Holger Hoffstätte @ 2016-03-03 19:53 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

On 03/03/16 19:33, Liu Bo wrote:
> On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote:
(..)
>> I've noticed that slow slow buffered writes create a huge number of
>> unnecessary 4k sized extents. At first I wrote it off as odd buffering
>> behaviour of the application (a download manager), but it can be easily
>> reproduced. For example:
> 
> On a new fresh btrfs, I cannot reproduce the fragmented layout with "wget --limit-rate=1m",

For better effect lower the bandwidth, 100k or so.

> [root@10-11-17-236 btrfs]# filefrag -v -b linux-4.5-rc6.tar.xz             
> Filesystem type is: 9123683e
> File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize
> 1024)
>  ext logical physical expected length flags
>    0       0   143744            5264 
>    1    5264   149008           35884 
>    2   41148   220848   184892      4

So you also have one, after ~35 MB. See below.

>    3   41152   184896   220852  35948 
>    4   77100   220852   220844   9192 eof
> linux-4.5-rc6.tar.xz: 4 extents found

No sync? filefrag is a notorious liar. ;)

It changes things because you likely have a higher value set for
vm/dirty_expire_centisecs or dirty_bytes explicitly configured; I have
it set to 1000 (10s) to prevent large writebacks from choking everything.
The default is probably still 30s aka 3000.

I understand that I should get smaller extents overall, but not the stray
4k sized ones in regular intervals.

> Can you gather your mount options and 'btrfs fi show/df' output?

I can reproduce that on another machine/drive where it also initially
didn't show the 4k extents in a parallel-running filefrag, but did
after a sync (when the extents were written). That was surprising.

Anyway, it's just an external scratch drive..the mount options really
don't matter much:

$mount | grep sdf
/dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)

$btrfs fi df /mnt/usb
Data, single: total=4.00GiB, used=3.31GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.00GiB, used=4.45MiB
GlobalReserve, single: total=16.00MiB, used=0.00B

$btrfs fi show /mnt/usb
Label: 'Test'  uuid: 1d37a067-5b7d-4dcf-b2c1-7c5745b9c7a5
	Total devices 1 FS bytes used 3.32GiB
	devid    1 size 111.79GiB used 5.03GiB path /dev/sdf1

I then remounted with -ocommit=300 and set dirty_expire_centisecs=10000
(100s). That results in a single large extent, even after sync, so
writeback expiry and commit definitely play a part.

Here is what it looks like when both dirty_expire and commit are set
to very low 5s:

$filefrag -ek linux-4.4.4.tar.bz2
Filesystem type is: 9123683e
File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    5199:  227197920.. 227203119:   5200:            
   1:     5200..    5203:  227169600.. 227169603:      4:  227203120:
   2:     5204..   15407:  227203124.. 227213327:  10204:  227169604:
   3:    15408..   20623:  227213332.. 227218547:   5216:  227213328:
   4:    20624..   20627:  227169604.. 227169607:      4:  227218548:
   5:    20628..   30831:  227218552.. 227228755:  10204:  227169608:
   6:    30832..   36047:  227228760.. 227233975:   5216:  227228756:
   7:    36048..   36051:  227169608.. 227169611:      4:  227233976:
   8:    36052..   41263:  227233980.. 227239191:   5212:  227169612:
   9:    41264..   46479:  227271164.. 227276379:   5216:  227239192:
  10:    46480..   46483:  227239196.. 227239199:      4:  227276380:
  11:    46484..   51695:  227276384.. 227281595:   5212:  227239200:
  12:    51696..   61903:  227281600.. 227291807:  10208:  227281596:
  13:    61904..   61907:  227239200.. 227239203:      4:  227291808:
  14:    61908..   67119:  227291812.. 227297023:   5212:  227239204:
  15:    67120..   77327:  227297028.. 227307235:  10208:  227297024:
  16:    77328..   77331:  227239204.. 227239207:      4:  227307236:
  17:    77332..   82543:  227307240.. 227312451:   5212:  227239208:
  18:    82544..   92751:  227312456.. 227322663:  10208:  227312452:
  19:    92752..   92755:  227239208.. 227239211:      4:  227322664:
  20:    92756..   97967:  227322668.. 227327879:   5212:  227239212:
  21:    97968..  102547:  227239212.. 227243791:   4580:  227327880: last,eof
linux-4.4.4.tar.bz2: 22 extents found

There's definitely a pattern here.

Out of curiosity I also tried the above run with autodefrag enabled, and
that helped a little bit: it merges those 4k extents into 256k-sized ones
with the adjacent followup extent. That was nice, but still a bit unexpected
since we've been told autodefrag is for random writes.
It also doesn't really explain the original behaviour.

I guess I need to add autodefrag everywhere now. :)

Thanks,
Holger


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 19:53   ` Holger Hoffstätte
@ 2016-03-03 20:47     ` Austin S. Hemmelgarn
  2016-03-03 21:50       ` Holger Hoffstätte
  2016-03-04 12:17     ` Duncan
  1 sibling, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-03 20:47 UTC (permalink / raw)
  To: Holger Hoffstätte, bo.li.liu; +Cc: linux-btrfs

On 2016-03-03 14:53, Holger Hoffstätte wrote:
> On 03/03/16 19:33, Liu Bo wrote:
>> On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote:
> (..)
>>> I've noticed that slow slow buffered writes create a huge number of
>>> unnecessary 4k sized extents. At first I wrote it off as odd buffering
>>> behaviour of the application (a download manager), but it can be easily
>>> reproduced. For example:
>>
>> On a new fresh btrfs, I cannot reproduce the fragmented layout with "wget --limit-rate=1m",
>
> For better effect lower the bandwidth, 100k or so.
>
>> [root@10-11-17-236 btrfs]# filefrag -v -b linux-4.5-rc6.tar.xz
>> Filesystem type is: 9123683e
>> File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize
>> 1024)
>>   ext logical physical expected length flags
>>     0       0   143744            5264
>>     1    5264   149008           35884
>>     2   41148   220848   184892      4
>
> So you also have one, after ~35 MB. See below.
>
>>     3   41152   184896   220852  35948
>>     4   77100   220852   220844   9192 eof
>> linux-4.5-rc6.tar.xz: 4 extents found
>
> No sync? filefrag is a notorious liar. ;)
>
> It changes things because you likely have a higher value set for
> vm/dirty_expire_centisecs or dirty_bytes explicitly configured; I have
> it set to 1000 (10s) to prevent large writebacks from choking everything.
> The default is probably still 30s aka 3000.
Last I looked (about a month ago), the default was still 3000.
>
> I understand that I should get smaller extents overall, but not the stray
> 4k sized ones in regular intervals.
>
>> Can you gather your mount options and 'btrfs fi show/df' output?
>
> I can reproduce that on another machine/drive where it also initially
> didn't show the 4k extents in a parallel-running filefrag, but did
> after a sync (when the extents were written). That was surprising.
>
> Anyway, it's just an external scratch drive..the mount options really
> don't matter much:
>
> $mount | grep sdf
> /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
Do you still see the same behavior with the old space_cache format? 
This appears to be an issue of space management and allocation, so this 
may be playing a part.
>
> $btrfs fi df /mnt/usb
> Data, single: total=4.00GiB, used=3.31GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=1.00GiB, used=4.45MiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
>
> $btrfs fi show /mnt/usb
> Label: 'Test'  uuid: 1d37a067-5b7d-4dcf-b2c1-7c5745b9c7a5
> 	Total devices 1 FS bytes used 3.32GiB
> 	devid    1 size 111.79GiB used 5.03GiB path /dev/sdf1
>
> I then remounted with -ocommit=300 and set dirty_expire_centisecs=10000
> (100s). That results in a single large extent, even after sync, so
> writeback expiry and commit definitely play a part.
>
> Here is what it looks like when both dirty_expire and commit are set
> to very low 5s:
I'd be somewhat curious to see if something similar happens on other 
filesystems with such low writeback timeouts.  My thought in this case 
is that the issue is that BTRFS's allocator isn't smart enough to try 
and merge new extents into existing ones when possible.
>
> $filefrag -ek linux-4.4.4.tar.bz2
> Filesystem type is: 9123683e
> File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..    5199:  227197920.. 227203119:   5200:
>     1:     5200..    5203:  227169600.. 227169603:      4:  227203120:
>     2:     5204..   15407:  227203124.. 227213327:  10204:  227169604:
>     3:    15408..   20623:  227213332.. 227218547:   5216:  227213328:
>     4:    20624..   20627:  227169604.. 227169607:      4:  227218548:
>     5:    20628..   30831:  227218552.. 227228755:  10204:  227169608:
>     6:    30832..   36047:  227228760.. 227233975:   5216:  227228756:
>     7:    36048..   36051:  227169608.. 227169611:      4:  227233976:
>     8:    36052..   41263:  227233980.. 227239191:   5212:  227169612:
>     9:    41264..   46479:  227271164.. 227276379:   5216:  227239192:
>    10:    46480..   46483:  227239196.. 227239199:      4:  227276380:
>    11:    46484..   51695:  227276384.. 227281595:   5212:  227239200:
>    12:    51696..   61903:  227281600.. 227291807:  10208:  227281596:
>    13:    61904..   61907:  227239200.. 227239203:      4:  227291808:
>    14:    61908..   67119:  227291812.. 227297023:   5212:  227239204:
>    15:    67120..   77327:  227297028.. 227307235:  10208:  227297024:
>    16:    77328..   77331:  227239204.. 227239207:      4:  227307236:
>    17:    77332..   82543:  227307240.. 227312451:   5212:  227239208:
>    18:    82544..   92751:  227312456.. 227322663:  10208:  227312452:
>    19:    92752..   92755:  227239208.. 227239211:      4:  227322664:
>    20:    92756..   97967:  227322668.. 227327879:   5212:  227239212:
>    21:    97968..  102547:  227239212.. 227243791:   4580:  227327880: last,eof
> linux-4.4.4.tar.bz2: 22 extents found
>
> There's definitely a pattern here.
What I find particularly interesting here is that the small extents 
appear to be packed out of order into the spaces being left between the 
bigger ones.  For something that you don't need super fast access to, 
this is actually a good thing because it reduces free space 
fragmentation, but BTRFS has no way of knowing whether this trade off is 
worth it for that particular file.
>
> Out of curiosity I also tried the above run with autodefrag enabled, and
> that helped a little bit: it merges those 4k extents into 256k-sized ones
> with the adjacent followup extent. That was nice, but still a bit unexpected
> since we've been told autodefrag is for random writes.
> It also doesn't really explain the original behaviour.
>
> I guess I need to add autodefrag everywhere now. :)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 20:47     ` Austin S. Hemmelgarn
@ 2016-03-03 21:50       ` Holger Hoffstätte
  2016-03-03 22:13         ` Liu Bo
  0 siblings, 1 reply; 9+ messages in thread
From: Holger Hoffstätte @ 2016-03-03 21:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

On 03/03/16 21:47, Austin S. Hemmelgarn wrote:
>> $mount | grep sdf
>> /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> Do you still see the same behavior with the old space_cache format?
> This appears to be an issue of space management and allocation, so
> this may be playing a part.

I just did the clear_cache,space_cache=v1 dance. Now a download with
bandwidth-limit=1M, dirty_expire=20s, commit=30 and *no* autodefrag
first ended up looking like this:

$filefrag -ek linux-4.5-rc6.tar.xz 
Filesystem type is: 9123683e
File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    7427:  227197920.. 227205347:   7428:            
   1:     7428..   33027:  227205348.. 227230947:  25600:            
   2:    33028..   53011:  227271164.. 227291147:  19984:  227230948:
   3:    53012..   72995:  227291148.. 227311131:  19984:            
   4:    72996..   86291:  227311132.. 227324427:  13296:             last,eof
linux-4.5-rc6.tar.xz: 2 extents found

Yay! But wait, there's more!

$sync
$filefrag -ek linux-4.5-rc6.tar.xz
Filesystem type is: 9123683e
File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    7423:  227197920.. 227205343:   7424:            
   1:     7424..    7427:  227169600.. 227169603:      4:  227205344:
   2:     7428..   33023:  227205348.. 227230943:  25596:  227169604:
   3:    33024..   33027:  227169604.. 227169607:      4:  227230944:
   4:    33028..   53007:  227271164.. 227291143:  19980:  227169608:
   5:    53008..   53011:  227230948.. 227230951:      4:  227291144:
   6:    53012..   72991:  227291148.. 227311127:  19980:  227230952:
   7:    72992..   72995:  227230952.. 227230955:      4:  227311128:
   8:    72996..   86291:  227311132.. 227324427:  13296:  227230956: last,eof
linux-4.5-rc6.tar.xz: 9 extents found

Now I'm like ¯\(ツ)/¯

With autodefrag the same happens, though it then eventually does the
merging from 4k -> 256k. I went searching for that hardcoded 256k value
and found it as default in ioctl.c:btrfs_defrag_file() when no threshold
has been passed, as is the case for autodefrag. I'll try to increase that
and see how much I can destroy.

Also, rsync with --bwlimit=1m does _not_ seem to create files like this:

$rsync (..)
$filefrag -ek linux-4.4.4.tar.bz2 
Filesystem type is: 9123683e
File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    4095:  227197920.. 227202015:   4096:            
   1:     4096..   25599:  227202016.. 227223519:  21504:            
   2:    25600..   51199:  227271164.. 227296763:  25600:  227223520:
   3:    51200..   76799:  227296764.. 227322363:  25600:            
   4:    76800..  102547:  227322364.. 227348111:  25748:             last,eof
linux-4.4.4.tar.bz2: 2 extents found

Which looks exactly as one would expect, probably - as Chris' mail
just explained - it doesn't use O_APPEND, whereas wget apparently does.

> I'd be somewhat curious to see if something similar happens on other
> filesystems with such low writeback timeouts.  My thought in this
> case is that the issue is that BTRFS's allocator isn't smart enough
> to try and merge new extents into existing ones when possible.

ext4 creates 1-2 extents, regardless of method.

Holger


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 21:50       ` Holger Hoffstätte
@ 2016-03-03 22:13         ` Liu Bo
  2016-03-04  1:37           ` Liu Bo
  0 siblings, 1 reply; 9+ messages in thread
From: Liu Bo @ 2016-03-03 22:13 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Thu, Mar 03, 2016 at 10:50:58PM +0100, Holger Hoffstätte wrote:
> On 03/03/16 21:47, Austin S. Hemmelgarn wrote:
> >> $mount | grep sdf
> >> /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> > Do you still see the same behavior with the old space_cache format?
> > This appears to be an issue of space management and allocation, so
> > this may be playing a part.
> 
> I just did the clear_cache,space_cache=v1 dance. Now a download with
> bandwidth-limit=1M, dirty_expire=20s, commit=30 and *no* autodefrag
> first ended up looking like this:
> 
> $filefrag -ek linux-4.5-rc6.tar.xz 
> Filesystem type is: 9123683e
> File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    7427:  227197920.. 227205347:   7428:            
>    1:     7428..   33027:  227205348.. 227230947:  25600:            
>    2:    33028..   53011:  227271164.. 227291147:  19984:  227230948:
>    3:    53012..   72995:  227291148.. 227311131:  19984:            
>    4:    72996..   86291:  227311132.. 227324427:  13296:             last,eof
> linux-4.5-rc6.tar.xz: 2 extents found
> 
> Yay! But wait, there's more!
> 
> $sync
> $filefrag -ek linux-4.5-rc6.tar.xz
> Filesystem type is: 9123683e
> File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    7423:  227197920.. 227205343:   7424:            
>    1:     7424..    7427:  227169600.. 227169603:      4:  227205344:
>    2:     7428..   33023:  227205348.. 227230943:  25596:  227169604:
>    3:    33024..   33027:  227169604.. 227169607:      4:  227230944:
>    4:    33028..   53007:  227271164.. 227291143:  19980:  227169608:
>    5:    53008..   53011:  227230948.. 227230951:      4:  227291144:
>    6:    53012..   72991:  227291148.. 227311127:  19980:  227230952:
>    7:    72992..   72995:  227230952.. 227230955:      4:  227311128:
>    8:    72996..   86291:  227311132.. 227324427:  13296:  227230956: last,eof
> linux-4.5-rc6.tar.xz: 9 extents found
> 
> Now I'm like ¯\(ツ)/¯

Yeah, after sync, I also get this file layout.

> 
> With autodefrag the same happens, though it then eventually does the
> merging from 4k -> 256k. I went searching for that hardcoded 256k value
> and found it as default in ioctl.c:btrfs_defrag_file() when no threshold
> has been passed, as is the case for autodefrag. I'll try to increase that
> and see how much I can destroy.
> 
> Also, rsync with --bwlimit=1m does _not_ seem to create files like this:
> 
> $rsync (..)
> $filefrag -ek linux-4.4.4.tar.bz2 
> Filesystem type is: 9123683e
> File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    4095:  227197920.. 227202015:   4096:            
>    1:     4096..   25599:  227202016.. 227223519:  21504:            
>    2:    25600..   51199:  227271164.. 227296763:  25600:  227223520:
>    3:    51200..   76799:  227296764.. 227322363:  25600:            
>    4:    76800..  102547:  227322364.. 227348111:  25748:             last,eof
> linux-4.4.4.tar.bz2: 2 extents found
> 
> Which looks exactly as one would expect, probably - as Chris' mail
> just explained - it doesn't use O_APPEND, whereas wget apparently does.

Interesting, my strace log shows wget doesn't open the file with O_APPEND.

open("linux-4.5-rc6.tar.xz", O_WRONLY|O_CREAT|O_EXCL, 0666) = 4

Thanks,

-liubo

> 
> > I'd be somewhat curious to see if something similar happens on other
> > filesystems with such low writeback timeouts.  My thought in this
> > case is that the issue is that BTRFS's allocator isn't smart enough
> > to try and merge new extents into existing ones when possible.
> 
> ext4 creates 1-2 extents, regardless of method.
> 
> Holger
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 22:13         ` Liu Bo
@ 2016-03-04  1:37           ` Liu Bo
  0 siblings, 0 replies; 9+ messages in thread
From: Liu Bo @ 2016-03-04  1:37 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: Austin S. Hemmelgarn, linux-btrfs, Chris Mason

On Thu, Mar 03, 2016 at 02:13:09PM -0800, Liu Bo wrote:
> On Thu, Mar 03, 2016 at 10:50:58PM +0100, Holger Hoffstätte wrote:
> > On 03/03/16 21:47, Austin S. Hemmelgarn wrote:
> > >> $mount | grep sdf
> > >> /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> > > Do you still see the same behavior with the old space_cache format?
> > > This appears to be an issue of space management and allocation, so
> > > this may be playing a part.
> > 
> > I just did the clear_cache,space_cache=v1 dance. Now a download with
> > bandwidth-limit=1M, dirty_expire=20s, commit=30 and *no* autodefrag
> > first ended up looking like this:
> > 
> > $filefrag -ek linux-4.5-rc6.tar.xz 
> > Filesystem type is: 9123683e
> > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..    7427:  227197920.. 227205347:   7428:            
> >    1:     7428..   33027:  227205348.. 227230947:  25600:            
> >    2:    33028..   53011:  227271164.. 227291147:  19984:  227230948:
> >    3:    53012..   72995:  227291148.. 227311131:  19984:            
> >    4:    72996..   86291:  227311132.. 227324427:  13296:             last,eof
> > linux-4.5-rc6.tar.xz: 2 extents found
> > 
> > Yay! But wait, there's more!
> > 
> > $sync
> > $filefrag -ek linux-4.5-rc6.tar.xz
> > Filesystem type is: 9123683e
> > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..    7423:  227197920.. 227205343:   7424:            
> >    1:     7424..    7427:  227169600.. 227169603:      4:  227205344:
> >    2:     7428..   33023:  227205348.. 227230943:  25596:  227169604:
> >    3:    33024..   33027:  227169604.. 227169607:      4:  227230944:
> >    4:    33028..   53007:  227271164.. 227291143:  19980:  227169608:
> >    5:    53008..   53011:  227230948.. 227230951:      4:  227291144:
> >    6:    53012..   72991:  227291148.. 227311127:  19980:  227230952:
> >    7:    72992..   72995:  227230952.. 227230955:      4:  227311128:
> >    8:    72996..   86291:  227311132.. 227324427:  13296:  227230956: last,eof
> > linux-4.5-rc6.tar.xz: 9 extents found
> > 
> > Now I'm like ¯\(ツ)/¯
> 
> Yeah, after sync, I also get this file layout.

OK...I think I've found why we get this weird layout, it's because btrfs
applies COW for overwrites while ext4 just updates it in place.

Here is my filefrag output after sync,

# !filefrag                                                                     
filefrag -vb /mnt/btrfs/linux-4.5-rc6.tar.xz   
Filesystem type is: 9123683e
File size of /mnt/btrfs/linux-4.5-rc6.tar.xz is 88362576 (86292 blocks,
blocksize 1024)
 ext logical physical expected length flags
   0       0    12352            5020 
   1    5020    17376    17372      4 
   2    5024   133504    17380  30908 
   3   35932   195296   164412      4 
   4   35936   164416   195300  30876 
   5   66812   195300   195292  19480 eof
/mnt/btrfs/linux-4.5-rc6.tar.xz: 6 extents found

And the output of btrfs_dirty_pages, I grep for the first 4k single extent,
# trace-cmd report -i /tmp/trace.dat | grep "dirty_page" | grep $((5020 << 10)) -A 2 -B 2
wget-29482 [003] 783746.039682: bprint: btrfs_dirty_pages: page start 5124096 end 5132287
wget-29482 [003] 783746.039771: bprint: btrfs_dirty_pages: page start 5128192 end 5144575
wget-29482 [003] 783746.263238: bprint: btrfs_dirty_pages: page start 5140480 end 5148671
wget-29482 [003] 783746.263304: bprint: btrfs_dirty_pages: page start 5144576 end 5160959
wget-29482 [003] 783746.263546: bprint: btrfs_dirty_pages: page start 5156864 end 5165055


So it turns out to be that wget writes the data as an overlapped way,
extent [5140480, 4096) is written twice, and the second write to the
extent can trigger a COW write when the first write to the extent has
finish the endio.

With mount -onodatacow,

# !filefrag                                                                     
filefrag -vb /mnt/btrfs/linux-4.5-rc6.tar.xz    
Filesystem type is: 9123683e
File size of /mnt/btrfs/linux-4.5-rc6.tar.xz is 88362576 (86292 blocks,
blocksize 1024)
 ext logical physical expected length flags
   0       0    12416            5292 
   1    5292   133504    17708  35872 
   2   41164   169376           30880 
   3   72044   200256           14248 eof
/mnt/btrfs/linux-4.5-rc6.tar.xz: 2 extents found


Anyway it's not due to any btrfs allocator bug (although I was thinking it
was and trying to find it out...).

Thanks,

-liubo

> 
> > 
> > With autodefrag the same happens, though it then eventually does the
> > merging from 4k -> 256k. I went searching for that hardcoded 256k value
> > and found it as default in ioctl.c:btrfs_defrag_file() when no threshold
> > has been passed, as is the case for autodefrag. I'll try to increase that
> > and see how much I can destroy.
> > 
> > Also, rsync with --bwlimit=1m does _not_ seem to create files like this:
> > 
> > $rsync (..)
> > $filefrag -ek linux-4.4.4.tar.bz2 
> > Filesystem type is: 9123683e
> > File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..    4095:  227197920.. 227202015:   4096:            
> >    1:     4096..   25599:  227202016.. 227223519:  21504:            
> >    2:    25600..   51199:  227271164.. 227296763:  25600:  227223520:
> >    3:    51200..   76799:  227296764.. 227322363:  25600:            
> >    4:    76800..  102547:  227322364.. 227348111:  25748:             last,eof
> > linux-4.4.4.tar.bz2: 2 extents found
> > 
> > Which looks exactly as one would expect, probably - as Chris' mail
> > just explained - it doesn't use O_APPEND, whereas wget apparently does.
> 
> Interesting, my strace log shows wget doesn't open the file with O_APPEND.
> 
> open("linux-4.5-rc6.tar.xz", O_WRONLY|O_CREAT|O_EXCL, 0666) = 4
> 
> Thanks,
> 
> -liubo
> 
> > 
> > > I'd be somewhat curious to see if something similar happens on other
> > > filesystems with such low writeback timeouts.  My thought in this
> > > case is that the issue is that BTRFS's allocator isn't smart enough
> > > to try and merge new extents into existing ones when possible.
> > 
> > ext4 creates 1-2 extents, regardless of method.
> > 
> > Holger
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 19:53   ` Holger Hoffstätte
  2016-03-03 20:47     ` Austin S. Hemmelgarn
@ 2016-03-04 12:17     ` Duncan
  1 sibling, 0 replies; 9+ messages in thread
From: Duncan @ 2016-03-04 12:17 UTC (permalink / raw)
  To: linux-btrfs

Holger Hoffstätte posted on Thu, 03 Mar 2016 20:53:57 +0100 as excerpted:

> It changes things because you likely have a higher value set for
> vm/dirty_expire_centisecs or dirty_bytes explicitly configured; I have
> it set to 1000 (10s) to prevent large writebacks from choking
> everything.
> The default is probably still 30s aka 3000.

[This is a comment on write-cache management strategy in the above 
context, not on the buffered write 4k extents issue.]

Interesting strategy.

I left the expire (aka foreground priority) timeout at its default 30 
seconds, here, and actually upped the writeback (background priority) 
timeout to 10 seconds, from its default 5.

I don't like large writebacks choking things either, but I chose to use 
the size knobs (after all, the descriptor at issue here is "large", not 
"old") to deal with that, not the time knobs.  As a result, I set the 
background ratio to 1% (it defaults to 5%), and foreground ratio to 3% 
(it defaults to 10%).

With 16 gig of RAM, that's ~160 MiB background, about five seconds worth 
at the 30 MiB/sec random write that's a reasonable if perhaps a bit 
pessimistic estimate for spinning rust, ~ half a GiB foreground (15 
seconds worth).  Tho obviously as I'm at 1%, I'd have to switch to the 
bytes instead of ratio knobs if I were to up my memory or want to go 
lower or wanted more precise knobs.

It just seems to me more logical to fiddle the size knob if I'm worried 
about the /size/ of the writeback, and with size now dealt with and off 
the table, I decided doubling the low priority background writeback 
timeout to 10 seconds was equally logical, giving writes additional time 
to accumulate, for better write combining and elevator ordering, before 
actually starting low priority writeback, since if sizes started 
ballooning the lower size thresholds would trigger writeback anyway.  And 
with size already dealt with, the higher priority foreground expiry 
default of 30 seconds was fine, so I didn't change that at all.

And, it seems to me, that might be part of your problem, since you're 
forcing high priority writeback at only 10 seconds, even if it's under 
the 4K block size, thus forcing COW rewrites when more data is appended 
to those blocks, with COW predictably playing havoc with your extents, 
now that you're forcing sub-4K writes to storage at high priority so fast 
they have little chance to be consolidated into whole 4K block writes.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stray 4k extents with slow buffered writes
  2016-03-03 12:28 Stray 4k extents with slow buffered writes Holger Hoffstätte
  2016-03-03 18:33 ` Liu Bo
@ 2016-03-03 20:55 ` Chris Mason
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Mason @ 2016-03-03 20:55 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-btrfs

On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote:
> 
> Here's an observation that is not a bug (as in data corruption), just
> somewhat odd and unnecessary behaviour. It could be considered  a
> performance or scalability bug.
> 
> I've noticed that slow slow buffered writes create a huge number of
> unnecessary 4k sized extents. At first I wrote it off as odd buffering
> behaviour of the application (a download manager), but it can be easily
> reproduced. For example:

We saw this here with slowly appending log files.  The basic problem is
the VM triggers dirty writeback on the tail end of the file and either
starts on the last page or our clustering code pulls in the last page.

It leads to latencies because we start writing the last page in the file
to disk and the application has to wait for the IO to finish before it
can append to the file again.

I'll get our patches in for the next merge window, it's basically just:
don't write the last incomplete page in the file if we're O_APPEND and
it isn't a data integrity writeback.  We may want to rework it to drop
the O_APPEND check.

-chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-03-04 12:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-03 12:28 Stray 4k extents with slow buffered writes Holger Hoffstätte
2016-03-03 18:33 ` Liu Bo
2016-03-03 19:53   ` Holger Hoffstätte
2016-03-03 20:47     ` Austin S. Hemmelgarn
2016-03-03 21:50       ` Holger Hoffstätte
2016-03-03 22:13         ` Liu Bo
2016-03-04  1:37           ` Liu Bo
2016-03-04 12:17     ` Duncan
2016-03-03 20:55 ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).