* Stray 4k extents with slow buffered writes @ 2016-03-03 12:28 Holger Hoffstätte 2016-03-03 18:33 ` Liu Bo 2016-03-03 20:55 ` Chris Mason 0 siblings, 2 replies; 9+ messages in thread From: Holger Hoffstätte @ 2016-03-03 12:28 UTC (permalink / raw) To: linux-btrfs Here's an observation that is not a bug (as in data corruption), just somewhat odd and unnecessary behaviour. It could be considered a performance or scalability bug. I've noticed that slow slow buffered writes create a huge number of unnecessary 4k sized extents. At first I wrote it off as odd buffering behaviour of the application (a download manager), but it can be easily reproduced. For example: holger>wget --limit-rate=1m https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.5-rc6.tar.xz (..downloads with 1 MB/s..) holger>sync holger>filefrag -ek linux-4.5-rc6.tar.xz Filesystem type is: 9123683e File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 14219: 230807476.. 230821695: 14220: 1: 14220.. 14223: 230838148.. 230838151: 4: 230821696: 2: 14224.. 29215: 230822324.. 230837315: 14992: 230838152: 3: 29216.. 44207: 230838152.. 230853143: 14992: 230837316: 4: 44208.. 44211: 230869576.. 230869579: 4: 230853144: 5: 44212.. 59199: 230853968.. 230868955: 14988: 230869580: 6: 59200.. 74191: 230869588.. 230884579: 14992: 230868956: 7: 74192.. 74195: 230898332.. 230898335: 4: 230884580: 8: 74196.. 86291: 230885620.. 230897715: 12096: 230898336: last,eof linux-4.5-rc6.tar.xz: 9 extents found Slower writes will generate even more extents; another ~200MB file had >900 extents. As expected defragment will collapse these stray extents with their successors: holger>btrfs filesystem defragment linux-4.5-rc6.tar.xz holger>sync holger>filefrag -ek linux-4.5-rc6.tar.xz Filesystem type is: 9123683e File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 14219: 230807476.. 230821695: 14220: 1: 14220.. 29215: 230922128.. 230937123: 14996: 230821696: 2: 29216.. 44207: 230838152.. 230853143: 14992: 230937124: 3: 44208.. 59199: 230937124.. 230952115: 14992: 230853144: 4: 59200.. 74191: 230869588.. 230884579: 14992: 230952116: 5: 74192.. 86291: 230952116.. 230964215: 12100: 230884580: last,eof linux-4.5-rc6.tar.xz: 6 extents found The obviously page-sized 4k extents happen to coincide with the 30s tx commit (2 * ~15 MB at 1 MB/s). It looks like a benign race, as if the last dirty page gets special treatment instead of being merged into wherever it should go. That just seems wasteful to me. Anyone got an idea? -h ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 12:28 Stray 4k extents with slow buffered writes Holger Hoffstätte @ 2016-03-03 18:33 ` Liu Bo 2016-03-03 19:53 ` Holger Hoffstätte 2016-03-03 20:55 ` Chris Mason 1 sibling, 1 reply; 9+ messages in thread From: Liu Bo @ 2016-03-03 18:33 UTC (permalink / raw) To: Holger Hoffstätte; +Cc: linux-btrfs On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote: > > Here's an observation that is not a bug (as in data corruption), just > somewhat odd and unnecessary behaviour. It could be considered a > performance or scalability bug. > > I've noticed that slow slow buffered writes create a huge number of > unnecessary 4k sized extents. At first I wrote it off as odd buffering > behaviour of the application (a download manager), but it can be easily > reproduced. For example: > > holger>wget --limit-rate=1m https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.5-rc6.tar.xz > (..downloads with 1 MB/s..) > holger>sync > holger>filefrag -ek linux-4.5-rc6.tar.xz > Filesystem type is: 9123683e > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 14219: 230807476.. 230821695: 14220: > 1: 14220.. 14223: 230838148.. 230838151: 4: 230821696: > 2: 14224.. 29215: 230822324.. 230837315: 14992: 230838152: > 3: 29216.. 44207: 230838152.. 230853143: 14992: 230837316: > 4: 44208.. 44211: 230869576.. 230869579: 4: 230853144: > 5: 44212.. 59199: 230853968.. 230868955: 14988: 230869580: > 6: 59200.. 74191: 230869588.. 230884579: 14992: 230868956: > 7: 74192.. 74195: 230898332.. 230898335: 4: 230884580: > 8: 74196.. 86291: 230885620.. 230897715: 12096: 230898336: last,eof > linux-4.5-rc6.tar.xz: 9 extents found > > Slower writes will generate even more extents; another ~200MB file > had >900 extents. > > As expected defragment will collapse these stray extents with their > successors: > > holger>btrfs filesystem defragment linux-4.5-rc6.tar.xz > holger>sync > holger>filefrag -ek linux-4.5-rc6.tar.xz > Filesystem type is: 9123683e > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 14219: 230807476.. 230821695: 14220: > 1: 14220.. 29215: 230922128.. 230937123: 14996: 230821696: > 2: 29216.. 44207: 230838152.. 230853143: 14992: 230937124: > 3: 44208.. 59199: 230937124.. 230952115: 14992: 230853144: > 4: 59200.. 74191: 230869588.. 230884579: 14992: 230952116: > 5: 74192.. 86291: 230952116.. 230964215: 12100: 230884580: last,eof > linux-4.5-rc6.tar.xz: 6 extents found > > The obviously page-sized 4k extents happen to coincide with the 30s tx commit > (2 * ~15 MB at 1 MB/s). It looks like a benign race, as if the last dirty page > gets special treatment instead of being merged into wherever it should go. > That just seems wasteful to me. > > Anyone got an idea? On a new fresh btrfs, I cannot reproduce the fragmented layout with "wget --limit-rate=1m", [root@10-11-17-236 btrfs]# filefrag -v -b linux-4.5-rc6.tar.xz Filesystem type is: 9123683e File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize 1024) ext logical physical expected length flags 0 0 143744 5264 1 5264 149008 35884 2 41148 220848 184892 4 3 41152 184896 220852 35948 4 77100 220852 220844 9192 eof linux-4.5-rc6.tar.xz: 4 extents found My mount point has, /dev/loop0 /mnt/btrfs btrfs rw,seclabel,relatime,space_cache,subvolid=5,subvol=/ 0 0 And I'm using 4.5.0-rc4. # btrfs fi show /mnt/btrfs Label: none uuid: 599d68ae-b874-4db1-a3c3-33c4b783bfdd Total devices 1 FS bytes used 94.62MiB devid 1 size 2.00GiB used 436.75MiB path /dev/loop0 # btrfs fi df /mnt/btrfs Data, single: total=216.00MiB, used=94.40MiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=102.38MiB, used=208.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B Can you gather your mount options and 'btrfs fi show/df' output? Thanks, -liubo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 18:33 ` Liu Bo @ 2016-03-03 19:53 ` Holger Hoffstätte 2016-03-03 20:47 ` Austin S. Hemmelgarn 2016-03-04 12:17 ` Duncan 0 siblings, 2 replies; 9+ messages in thread From: Holger Hoffstätte @ 2016-03-03 19:53 UTC (permalink / raw) To: bo.li.liu; +Cc: linux-btrfs On 03/03/16 19:33, Liu Bo wrote: > On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote: (..) >> I've noticed that slow slow buffered writes create a huge number of >> unnecessary 4k sized extents. At first I wrote it off as odd buffering >> behaviour of the application (a download manager), but it can be easily >> reproduced. For example: > > On a new fresh btrfs, I cannot reproduce the fragmented layout with "wget --limit-rate=1m", For better effect lower the bandwidth, 100k or so. > [root@10-11-17-236 btrfs]# filefrag -v -b linux-4.5-rc6.tar.xz > Filesystem type is: 9123683e > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize > 1024) > ext logical physical expected length flags > 0 0 143744 5264 > 1 5264 149008 35884 > 2 41148 220848 184892 4 So you also have one, after ~35 MB. See below. > 3 41152 184896 220852 35948 > 4 77100 220852 220844 9192 eof > linux-4.5-rc6.tar.xz: 4 extents found No sync? filefrag is a notorious liar. ;) It changes things because you likely have a higher value set for vm/dirty_expire_centisecs or dirty_bytes explicitly configured; I have it set to 1000 (10s) to prevent large writebacks from choking everything. The default is probably still 30s aka 3000. I understand that I should get smaller extents overall, but not the stray 4k sized ones in regular intervals. > Can you gather your mount options and 'btrfs fi show/df' output? I can reproduce that on another machine/drive where it also initially didn't show the 4k extents in a parallel-running filefrag, but did after a sync (when the extents were written). That was surprising. Anyway, it's just an external scratch drive..the mount options really don't matter much: $mount | grep sdf /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) $btrfs fi df /mnt/usb Data, single: total=4.00GiB, used=3.31GiB System, single: total=32.00MiB, used=16.00KiB Metadata, single: total=1.00GiB, used=4.45MiB GlobalReserve, single: total=16.00MiB, used=0.00B $btrfs fi show /mnt/usb Label: 'Test' uuid: 1d37a067-5b7d-4dcf-b2c1-7c5745b9c7a5 Total devices 1 FS bytes used 3.32GiB devid 1 size 111.79GiB used 5.03GiB path /dev/sdf1 I then remounted with -ocommit=300 and set dirty_expire_centisecs=10000 (100s). That results in a single large extent, even after sync, so writeback expiry and commit definitely play a part. Here is what it looks like when both dirty_expire and commit are set to very low 5s: $filefrag -ek linux-4.4.4.tar.bz2 Filesystem type is: 9123683e File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 5199: 227197920.. 227203119: 5200: 1: 5200.. 5203: 227169600.. 227169603: 4: 227203120: 2: 5204.. 15407: 227203124.. 227213327: 10204: 227169604: 3: 15408.. 20623: 227213332.. 227218547: 5216: 227213328: 4: 20624.. 20627: 227169604.. 227169607: 4: 227218548: 5: 20628.. 30831: 227218552.. 227228755: 10204: 227169608: 6: 30832.. 36047: 227228760.. 227233975: 5216: 227228756: 7: 36048.. 36051: 227169608.. 227169611: 4: 227233976: 8: 36052.. 41263: 227233980.. 227239191: 5212: 227169612: 9: 41264.. 46479: 227271164.. 227276379: 5216: 227239192: 10: 46480.. 46483: 227239196.. 227239199: 4: 227276380: 11: 46484.. 51695: 227276384.. 227281595: 5212: 227239200: 12: 51696.. 61903: 227281600.. 227291807: 10208: 227281596: 13: 61904.. 61907: 227239200.. 227239203: 4: 227291808: 14: 61908.. 67119: 227291812.. 227297023: 5212: 227239204: 15: 67120.. 77327: 227297028.. 227307235: 10208: 227297024: 16: 77328.. 77331: 227239204.. 227239207: 4: 227307236: 17: 77332.. 82543: 227307240.. 227312451: 5212: 227239208: 18: 82544.. 92751: 227312456.. 227322663: 10208: 227312452: 19: 92752.. 92755: 227239208.. 227239211: 4: 227322664: 20: 92756.. 97967: 227322668.. 227327879: 5212: 227239212: 21: 97968.. 102547: 227239212.. 227243791: 4580: 227327880: last,eof linux-4.4.4.tar.bz2: 22 extents found There's definitely a pattern here. Out of curiosity I also tried the above run with autodefrag enabled, and that helped a little bit: it merges those 4k extents into 256k-sized ones with the adjacent followup extent. That was nice, but still a bit unexpected since we've been told autodefrag is for random writes. It also doesn't really explain the original behaviour. I guess I need to add autodefrag everywhere now. :) Thanks, Holger ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 19:53 ` Holger Hoffstätte @ 2016-03-03 20:47 ` Austin S. Hemmelgarn 2016-03-03 21:50 ` Holger Hoffstätte 2016-03-04 12:17 ` Duncan 1 sibling, 1 reply; 9+ messages in thread From: Austin S. Hemmelgarn @ 2016-03-03 20:47 UTC (permalink / raw) To: Holger Hoffstätte, bo.li.liu; +Cc: linux-btrfs On 2016-03-03 14:53, Holger Hoffstätte wrote: > On 03/03/16 19:33, Liu Bo wrote: >> On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote: > (..) >>> I've noticed that slow slow buffered writes create a huge number of >>> unnecessary 4k sized extents. At first I wrote it off as odd buffering >>> behaviour of the application (a download manager), but it can be easily >>> reproduced. For example: >> >> On a new fresh btrfs, I cannot reproduce the fragmented layout with "wget --limit-rate=1m", > > For better effect lower the bandwidth, 100k or so. > >> [root@10-11-17-236 btrfs]# filefrag -v -b linux-4.5-rc6.tar.xz >> Filesystem type is: 9123683e >> File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize >> 1024) >> ext logical physical expected length flags >> 0 0 143744 5264 >> 1 5264 149008 35884 >> 2 41148 220848 184892 4 > > So you also have one, after ~35 MB. See below. > >> 3 41152 184896 220852 35948 >> 4 77100 220852 220844 9192 eof >> linux-4.5-rc6.tar.xz: 4 extents found > > No sync? filefrag is a notorious liar. ;) > > It changes things because you likely have a higher value set for > vm/dirty_expire_centisecs or dirty_bytes explicitly configured; I have > it set to 1000 (10s) to prevent large writebacks from choking everything. > The default is probably still 30s aka 3000. Last I looked (about a month ago), the default was still 3000. > > I understand that I should get smaller extents overall, but not the stray > 4k sized ones in regular intervals. > >> Can you gather your mount options and 'btrfs fi show/df' output? > > I can reproduce that on another machine/drive where it also initially > didn't show the 4k extents in a parallel-running filefrag, but did > after a sync (when the extents were written). That was surprising. > > Anyway, it's just an external scratch drive..the mount options really > don't matter much: > > $mount | grep sdf > /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Do you still see the same behavior with the old space_cache format? This appears to be an issue of space management and allocation, so this may be playing a part. > > $btrfs fi df /mnt/usb > Data, single: total=4.00GiB, used=3.31GiB > System, single: total=32.00MiB, used=16.00KiB > Metadata, single: total=1.00GiB, used=4.45MiB > GlobalReserve, single: total=16.00MiB, used=0.00B > > $btrfs fi show /mnt/usb > Label: 'Test' uuid: 1d37a067-5b7d-4dcf-b2c1-7c5745b9c7a5 > Total devices 1 FS bytes used 3.32GiB > devid 1 size 111.79GiB used 5.03GiB path /dev/sdf1 > > I then remounted with -ocommit=300 and set dirty_expire_centisecs=10000 > (100s). That results in a single large extent, even after sync, so > writeback expiry and commit definitely play a part. > > Here is what it looks like when both dirty_expire and commit are set > to very low 5s: I'd be somewhat curious to see if something similar happens on other filesystems with such low writeback timeouts. My thought in this case is that the issue is that BTRFS's allocator isn't smart enough to try and merge new extents into existing ones when possible. > > $filefrag -ek linux-4.4.4.tar.bz2 > Filesystem type is: 9123683e > File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 5199: 227197920.. 227203119: 5200: > 1: 5200.. 5203: 227169600.. 227169603: 4: 227203120: > 2: 5204.. 15407: 227203124.. 227213327: 10204: 227169604: > 3: 15408.. 20623: 227213332.. 227218547: 5216: 227213328: > 4: 20624.. 20627: 227169604.. 227169607: 4: 227218548: > 5: 20628.. 30831: 227218552.. 227228755: 10204: 227169608: > 6: 30832.. 36047: 227228760.. 227233975: 5216: 227228756: > 7: 36048.. 36051: 227169608.. 227169611: 4: 227233976: > 8: 36052.. 41263: 227233980.. 227239191: 5212: 227169612: > 9: 41264.. 46479: 227271164.. 227276379: 5216: 227239192: > 10: 46480.. 46483: 227239196.. 227239199: 4: 227276380: > 11: 46484.. 51695: 227276384.. 227281595: 5212: 227239200: > 12: 51696.. 61903: 227281600.. 227291807: 10208: 227281596: > 13: 61904.. 61907: 227239200.. 227239203: 4: 227291808: > 14: 61908.. 67119: 227291812.. 227297023: 5212: 227239204: > 15: 67120.. 77327: 227297028.. 227307235: 10208: 227297024: > 16: 77328.. 77331: 227239204.. 227239207: 4: 227307236: > 17: 77332.. 82543: 227307240.. 227312451: 5212: 227239208: > 18: 82544.. 92751: 227312456.. 227322663: 10208: 227312452: > 19: 92752.. 92755: 227239208.. 227239211: 4: 227322664: > 20: 92756.. 97967: 227322668.. 227327879: 5212: 227239212: > 21: 97968.. 102547: 227239212.. 227243791: 4580: 227327880: last,eof > linux-4.4.4.tar.bz2: 22 extents found > > There's definitely a pattern here. What I find particularly interesting here is that the small extents appear to be packed out of order into the spaces being left between the bigger ones. For something that you don't need super fast access to, this is actually a good thing because it reduces free space fragmentation, but BTRFS has no way of knowing whether this trade off is worth it for that particular file. > > Out of curiosity I also tried the above run with autodefrag enabled, and > that helped a little bit: it merges those 4k extents into 256k-sized ones > with the adjacent followup extent. That was nice, but still a bit unexpected > since we've been told autodefrag is for random writes. > It also doesn't really explain the original behaviour. > > I guess I need to add autodefrag everywhere now. :) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 20:47 ` Austin S. Hemmelgarn @ 2016-03-03 21:50 ` Holger Hoffstätte 2016-03-03 22:13 ` Liu Bo 0 siblings, 1 reply; 9+ messages in thread From: Holger Hoffstätte @ 2016-03-03 21:50 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: linux-btrfs On 03/03/16 21:47, Austin S. Hemmelgarn wrote: >> $mount | grep sdf >> /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > Do you still see the same behavior with the old space_cache format? > This appears to be an issue of space management and allocation, so > this may be playing a part. I just did the clear_cache,space_cache=v1 dance. Now a download with bandwidth-limit=1M, dirty_expire=20s, commit=30 and *no* autodefrag first ended up looking like this: $filefrag -ek linux-4.5-rc6.tar.xz Filesystem type is: 9123683e File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 7427: 227197920.. 227205347: 7428: 1: 7428.. 33027: 227205348.. 227230947: 25600: 2: 33028.. 53011: 227271164.. 227291147: 19984: 227230948: 3: 53012.. 72995: 227291148.. 227311131: 19984: 4: 72996.. 86291: 227311132.. 227324427: 13296: last,eof linux-4.5-rc6.tar.xz: 2 extents found Yay! But wait, there's more! $sync $filefrag -ek linux-4.5-rc6.tar.xz Filesystem type is: 9123683e File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 7423: 227197920.. 227205343: 7424: 1: 7424.. 7427: 227169600.. 227169603: 4: 227205344: 2: 7428.. 33023: 227205348.. 227230943: 25596: 227169604: 3: 33024.. 33027: 227169604.. 227169607: 4: 227230944: 4: 33028.. 53007: 227271164.. 227291143: 19980: 227169608: 5: 53008.. 53011: 227230948.. 227230951: 4: 227291144: 6: 53012.. 72991: 227291148.. 227311127: 19980: 227230952: 7: 72992.. 72995: 227230952.. 227230955: 4: 227311128: 8: 72996.. 86291: 227311132.. 227324427: 13296: 227230956: last,eof linux-4.5-rc6.tar.xz: 9 extents found Now I'm like ¯\(ツ)/¯ With autodefrag the same happens, though it then eventually does the merging from 4k -> 256k. I went searching for that hardcoded 256k value and found it as default in ioctl.c:btrfs_defrag_file() when no threshold has been passed, as is the case for autodefrag. I'll try to increase that and see how much I can destroy. Also, rsync with --bwlimit=1m does _not_ seem to create files like this: $rsync (..) $filefrag -ek linux-4.4.4.tar.bz2 Filesystem type is: 9123683e File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 4095: 227197920.. 227202015: 4096: 1: 4096.. 25599: 227202016.. 227223519: 21504: 2: 25600.. 51199: 227271164.. 227296763: 25600: 227223520: 3: 51200.. 76799: 227296764.. 227322363: 25600: 4: 76800.. 102547: 227322364.. 227348111: 25748: last,eof linux-4.4.4.tar.bz2: 2 extents found Which looks exactly as one would expect, probably - as Chris' mail just explained - it doesn't use O_APPEND, whereas wget apparently does. > I'd be somewhat curious to see if something similar happens on other > filesystems with such low writeback timeouts. My thought in this > case is that the issue is that BTRFS's allocator isn't smart enough > to try and merge new extents into existing ones when possible. ext4 creates 1-2 extents, regardless of method. Holger ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 21:50 ` Holger Hoffstätte @ 2016-03-03 22:13 ` Liu Bo 2016-03-04 1:37 ` Liu Bo 0 siblings, 1 reply; 9+ messages in thread From: Liu Bo @ 2016-03-03 22:13 UTC (permalink / raw) To: Holger Hoffstätte; +Cc: Austin S. Hemmelgarn, linux-btrfs On Thu, Mar 03, 2016 at 10:50:58PM +0100, Holger Hoffstätte wrote: > On 03/03/16 21:47, Austin S. Hemmelgarn wrote: > >> $mount | grep sdf > >> /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > Do you still see the same behavior with the old space_cache format? > > This appears to be an issue of space management and allocation, so > > this may be playing a part. > > I just did the clear_cache,space_cache=v1 dance. Now a download with > bandwidth-limit=1M, dirty_expire=20s, commit=30 and *no* autodefrag > first ended up looking like this: > > $filefrag -ek linux-4.5-rc6.tar.xz > Filesystem type is: 9123683e > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 7427: 227197920.. 227205347: 7428: > 1: 7428.. 33027: 227205348.. 227230947: 25600: > 2: 33028.. 53011: 227271164.. 227291147: 19984: 227230948: > 3: 53012.. 72995: 227291148.. 227311131: 19984: > 4: 72996.. 86291: 227311132.. 227324427: 13296: last,eof > linux-4.5-rc6.tar.xz: 2 extents found > > Yay! But wait, there's more! > > $sync > $filefrag -ek linux-4.5-rc6.tar.xz > Filesystem type is: 9123683e > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 7423: 227197920.. 227205343: 7424: > 1: 7424.. 7427: 227169600.. 227169603: 4: 227205344: > 2: 7428.. 33023: 227205348.. 227230943: 25596: 227169604: > 3: 33024.. 33027: 227169604.. 227169607: 4: 227230944: > 4: 33028.. 53007: 227271164.. 227291143: 19980: 227169608: > 5: 53008.. 53011: 227230948.. 227230951: 4: 227291144: > 6: 53012.. 72991: 227291148.. 227311127: 19980: 227230952: > 7: 72992.. 72995: 227230952.. 227230955: 4: 227311128: > 8: 72996.. 86291: 227311132.. 227324427: 13296: 227230956: last,eof > linux-4.5-rc6.tar.xz: 9 extents found > > Now I'm like ¯\(ツ)/¯ Yeah, after sync, I also get this file layout. > > With autodefrag the same happens, though it then eventually does the > merging from 4k -> 256k. I went searching for that hardcoded 256k value > and found it as default in ioctl.c:btrfs_defrag_file() when no threshold > has been passed, as is the case for autodefrag. I'll try to increase that > and see how much I can destroy. > > Also, rsync with --bwlimit=1m does _not_ seem to create files like this: > > $rsync (..) > $filefrag -ek linux-4.4.4.tar.bz2 > Filesystem type is: 9123683e > File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 4095: 227197920.. 227202015: 4096: > 1: 4096.. 25599: 227202016.. 227223519: 21504: > 2: 25600.. 51199: 227271164.. 227296763: 25600: 227223520: > 3: 51200.. 76799: 227296764.. 227322363: 25600: > 4: 76800.. 102547: 227322364.. 227348111: 25748: last,eof > linux-4.4.4.tar.bz2: 2 extents found > > Which looks exactly as one would expect, probably - as Chris' mail > just explained - it doesn't use O_APPEND, whereas wget apparently does. Interesting, my strace log shows wget doesn't open the file with O_APPEND. open("linux-4.5-rc6.tar.xz", O_WRONLY|O_CREAT|O_EXCL, 0666) = 4 Thanks, -liubo > > > I'd be somewhat curious to see if something similar happens on other > > filesystems with such low writeback timeouts. My thought in this > > case is that the issue is that BTRFS's allocator isn't smart enough > > to try and merge new extents into existing ones when possible. > > ext4 creates 1-2 extents, regardless of method. > > Holger > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 22:13 ` Liu Bo @ 2016-03-04 1:37 ` Liu Bo 0 siblings, 0 replies; 9+ messages in thread From: Liu Bo @ 2016-03-04 1:37 UTC (permalink / raw) To: Holger Hoffstätte; +Cc: Austin S. Hemmelgarn, linux-btrfs, Chris Mason On Thu, Mar 03, 2016 at 02:13:09PM -0800, Liu Bo wrote: > On Thu, Mar 03, 2016 at 10:50:58PM +0100, Holger Hoffstätte wrote: > > On 03/03/16 21:47, Austin S. Hemmelgarn wrote: > > >> $mount | grep sdf > > >> /dev/sdf1 on /mnt/usb type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > > Do you still see the same behavior with the old space_cache format? > > > This appears to be an issue of space management and allocation, so > > > this may be playing a part. > > > > I just did the clear_cache,space_cache=v1 dance. Now a download with > > bandwidth-limit=1M, dirty_expire=20s, commit=30 and *no* autodefrag > > first ended up looking like this: > > > > $filefrag -ek linux-4.5-rc6.tar.xz > > Filesystem type is: 9123683e > > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) > > ext: logical_offset: physical_offset: length: expected: flags: > > 0: 0.. 7427: 227197920.. 227205347: 7428: > > 1: 7428.. 33027: 227205348.. 227230947: 25600: > > 2: 33028.. 53011: 227271164.. 227291147: 19984: 227230948: > > 3: 53012.. 72995: 227291148.. 227311131: 19984: > > 4: 72996.. 86291: 227311132.. 227324427: 13296: last,eof > > linux-4.5-rc6.tar.xz: 2 extents found > > > > Yay! But wait, there's more! > > > > $sync > > $filefrag -ek linux-4.5-rc6.tar.xz > > Filesystem type is: 9123683e > > File size of linux-4.5-rc6.tar.xz is 88362576 (86292 blocks of 1024 bytes) > > ext: logical_offset: physical_offset: length: expected: flags: > > 0: 0.. 7423: 227197920.. 227205343: 7424: > > 1: 7424.. 7427: 227169600.. 227169603: 4: 227205344: > > 2: 7428.. 33023: 227205348.. 227230943: 25596: 227169604: > > 3: 33024.. 33027: 227169604.. 227169607: 4: 227230944: > > 4: 33028.. 53007: 227271164.. 227291143: 19980: 227169608: > > 5: 53008.. 53011: 227230948.. 227230951: 4: 227291144: > > 6: 53012.. 72991: 227291148.. 227311127: 19980: 227230952: > > 7: 72992.. 72995: 227230952.. 227230955: 4: 227311128: > > 8: 72996.. 86291: 227311132.. 227324427: 13296: 227230956: last,eof > > linux-4.5-rc6.tar.xz: 9 extents found > > > > Now I'm like ¯\(ツ)/¯ > > Yeah, after sync, I also get this file layout. OK...I think I've found why we get this weird layout, it's because btrfs applies COW for overwrites while ext4 just updates it in place. Here is my filefrag output after sync, # !filefrag filefrag -vb /mnt/btrfs/linux-4.5-rc6.tar.xz Filesystem type is: 9123683e File size of /mnt/btrfs/linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize 1024) ext logical physical expected length flags 0 0 12352 5020 1 5020 17376 17372 4 2 5024 133504 17380 30908 3 35932 195296 164412 4 4 35936 164416 195300 30876 5 66812 195300 195292 19480 eof /mnt/btrfs/linux-4.5-rc6.tar.xz: 6 extents found And the output of btrfs_dirty_pages, I grep for the first 4k single extent, # trace-cmd report -i /tmp/trace.dat | grep "dirty_page" | grep $((5020 << 10)) -A 2 -B 2 wget-29482 [003] 783746.039682: bprint: btrfs_dirty_pages: page start 5124096 end 5132287 wget-29482 [003] 783746.039771: bprint: btrfs_dirty_pages: page start 5128192 end 5144575 wget-29482 [003] 783746.263238: bprint: btrfs_dirty_pages: page start 5140480 end 5148671 wget-29482 [003] 783746.263304: bprint: btrfs_dirty_pages: page start 5144576 end 5160959 wget-29482 [003] 783746.263546: bprint: btrfs_dirty_pages: page start 5156864 end 5165055 So it turns out to be that wget writes the data as an overlapped way, extent [5140480, 4096) is written twice, and the second write to the extent can trigger a COW write when the first write to the extent has finish the endio. With mount -onodatacow, # !filefrag filefrag -vb /mnt/btrfs/linux-4.5-rc6.tar.xz Filesystem type is: 9123683e File size of /mnt/btrfs/linux-4.5-rc6.tar.xz is 88362576 (86292 blocks, blocksize 1024) ext logical physical expected length flags 0 0 12416 5292 1 5292 133504 17708 35872 2 41164 169376 30880 3 72044 200256 14248 eof /mnt/btrfs/linux-4.5-rc6.tar.xz: 2 extents found Anyway it's not due to any btrfs allocator bug (although I was thinking it was and trying to find it out...). Thanks, -liubo > > > > > With autodefrag the same happens, though it then eventually does the > > merging from 4k -> 256k. I went searching for that hardcoded 256k value > > and found it as default in ioctl.c:btrfs_defrag_file() when no threshold > > has been passed, as is the case for autodefrag. I'll try to increase that > > and see how much I can destroy. > > > > Also, rsync with --bwlimit=1m does _not_ seem to create files like this: > > > > $rsync (..) > > $filefrag -ek linux-4.4.4.tar.bz2 > > Filesystem type is: 9123683e > > File size of linux-4.4.4.tar.bz2 is 105008928 (102548 blocks of 1024 bytes) > > ext: logical_offset: physical_offset: length: expected: flags: > > 0: 0.. 4095: 227197920.. 227202015: 4096: > > 1: 4096.. 25599: 227202016.. 227223519: 21504: > > 2: 25600.. 51199: 227271164.. 227296763: 25600: 227223520: > > 3: 51200.. 76799: 227296764.. 227322363: 25600: > > 4: 76800.. 102547: 227322364.. 227348111: 25748: last,eof > > linux-4.4.4.tar.bz2: 2 extents found > > > > Which looks exactly as one would expect, probably - as Chris' mail > > just explained - it doesn't use O_APPEND, whereas wget apparently does. > > Interesting, my strace log shows wget doesn't open the file with O_APPEND. > > open("linux-4.5-rc6.tar.xz", O_WRONLY|O_CREAT|O_EXCL, 0666) = 4 > > Thanks, > > -liubo > > > > > > I'd be somewhat curious to see if something similar happens on other > > > filesystems with such low writeback timeouts. My thought in this > > > case is that the issue is that BTRFS's allocator isn't smart enough > > > to try and merge new extents into existing ones when possible. > > > > ext4 creates 1-2 extents, regardless of method. > > > > Holger > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 19:53 ` Holger Hoffstätte 2016-03-03 20:47 ` Austin S. Hemmelgarn @ 2016-03-04 12:17 ` Duncan 1 sibling, 0 replies; 9+ messages in thread From: Duncan @ 2016-03-04 12:17 UTC (permalink / raw) To: linux-btrfs Holger Hoffstätte posted on Thu, 03 Mar 2016 20:53:57 +0100 as excerpted: > It changes things because you likely have a higher value set for > vm/dirty_expire_centisecs or dirty_bytes explicitly configured; I have > it set to 1000 (10s) to prevent large writebacks from choking > everything. > The default is probably still 30s aka 3000. [This is a comment on write-cache management strategy in the above context, not on the buffered write 4k extents issue.] Interesting strategy. I left the expire (aka foreground priority) timeout at its default 30 seconds, here, and actually upped the writeback (background priority) timeout to 10 seconds, from its default 5. I don't like large writebacks choking things either, but I chose to use the size knobs (after all, the descriptor at issue here is "large", not "old") to deal with that, not the time knobs. As a result, I set the background ratio to 1% (it defaults to 5%), and foreground ratio to 3% (it defaults to 10%). With 16 gig of RAM, that's ~160 MiB background, about five seconds worth at the 30 MiB/sec random write that's a reasonable if perhaps a bit pessimistic estimate for spinning rust, ~ half a GiB foreground (15 seconds worth). Tho obviously as I'm at 1%, I'd have to switch to the bytes instead of ratio knobs if I were to up my memory or want to go lower or wanted more precise knobs. It just seems to me more logical to fiddle the size knob if I'm worried about the /size/ of the writeback, and with size now dealt with and off the table, I decided doubling the low priority background writeback timeout to 10 seconds was equally logical, giving writes additional time to accumulate, for better write combining and elevator ordering, before actually starting low priority writeback, since if sizes started ballooning the lower size thresholds would trigger writeback anyway. And with size already dealt with, the higher priority foreground expiry default of 30 seconds was fine, so I didn't change that at all. And, it seems to me, that might be part of your problem, since you're forcing high priority writeback at only 10 seconds, even if it's under the 4K block size, thus forcing COW rewrites when more data is appended to those blocks, with COW predictably playing havoc with your extents, now that you're forcing sub-4K writes to storage at high priority so fast they have little chance to be consolidated into whole 4K block writes. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Stray 4k extents with slow buffered writes 2016-03-03 12:28 Stray 4k extents with slow buffered writes Holger Hoffstätte 2016-03-03 18:33 ` Liu Bo @ 2016-03-03 20:55 ` Chris Mason 1 sibling, 0 replies; 9+ messages in thread From: Chris Mason @ 2016-03-03 20:55 UTC (permalink / raw) To: Holger Hoffstätte; +Cc: linux-btrfs On Thu, Mar 03, 2016 at 01:28:29PM +0100, Holger Hoffstätte wrote: > > Here's an observation that is not a bug (as in data corruption), just > somewhat odd and unnecessary behaviour. It could be considered a > performance or scalability bug. > > I've noticed that slow slow buffered writes create a huge number of > unnecessary 4k sized extents. At first I wrote it off as odd buffering > behaviour of the application (a download manager), but it can be easily > reproduced. For example: We saw this here with slowly appending log files. The basic problem is the VM triggers dirty writeback on the tail end of the file and either starts on the last page or our clustering code pulls in the last page. It leads to latencies because we start writing the last page in the file to disk and the application has to wait for the IO to finish before it can append to the file again. I'll get our patches in for the next merge window, it's basically just: don't write the last incomplete page in the file if we're O_APPEND and it isn't a data integrity writeback. We may want to rework it to drop the O_APPEND check. -chris ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-03-04 12:17 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-03 12:28 Stray 4k extents with slow buffered writes Holger Hoffstätte 2016-03-03 18:33 ` Liu Bo 2016-03-03 19:53 ` Holger Hoffstätte 2016-03-03 20:47 ` Austin S. Hemmelgarn 2016-03-03 21:50 ` Holger Hoffstätte 2016-03-03 22:13 ` Liu Bo 2016-03-04 1:37 ` Liu Bo 2016-03-04 12:17 ` Duncan 2016-03-03 20:55 ` Chris Mason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).