is BTRFS_IOC_DEFRAG behavior optimal?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* is BTRFS_IOC_DEFRAG behavior optimal?
@ 2021-02-07 22:06 Chris Murphy
  2021-02-08 22:11 ` Goffredo Baroncelli
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2021-02-07 22:06 UTC (permalink / raw)
  To: Btrfs BTRFS

systemd-journald journals on Btrfs default to nodatacow,  upon log
rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The
result looks curious. I can't tell what the logic is from the results.

The journal file starts out being fallocated with a size of 8MB, and
as it grows there is an append of 8MB increments, also fallocated.
This leads to a filefrag -v that looks like this (ext4 and btrfs
nodatacow follow the same behavior, both are provided for reference):

ext4
https://pastebin.com/6vuufwXt

btrfs
https://pastebin.com/Y18B2m4h

Following defragment with BTRFS_IOC_DEFRAG it looks like this:
https://pastebin.com/1ufErVMs

It appears at first glance to be significantly more fragmented. Closer
inspection shows that most of the extents weren't relocated. But
what's up with the peculiar interleaving? Is this an improvement over
the original allocation?

https://pastebin.com/1ufErVMs

If I unwind the interleaving, it looks like all the extents fall into
two localities and within each locality the extents aren't that far
apart - so my guess is that this file is also not meaningfully
fragmented, in practice. Surely the drive firmware will reorder the
reads to arrive at the least amount of seeks?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-07 22:06 is BTRFS_IOC_DEFRAG behavior optimal? Chris Murphy
@ 2021-02-08 22:11 ` Goffredo Baroncelli
  2021-02-08 22:21   ` Zygo Blaxell
  2021-02-09  0:42   ` Chris Murphy
  0 siblings, 2 replies; 19+ messages in thread
From: Goffredo Baroncelli @ 2021-02-08 22:11 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 2/7/21 11:06 PM, Chris Murphy wrote:
> systemd-journald journals on Btrfs default to nodatacow,  upon log
> rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The
> result looks curious. I can't tell what the logic is from the results.
> 
> The journal file starts out being fallocated with a size of 8MB, and
> as it grows there is an append of 8MB increments, also fallocated.
> This leads to a filefrag -v that looks like this (ext4 and btrfs
> nodatacow follow the same behavior, both are provided for reference):
> 
> ext4
> https://pastebin.com/6vuufwXt
> 
> btrfs
> https://pastebin.com/Y18B2m4h
> 
> Following defragment with BTRFS_IOC_DEFRAG it looks like this:
> https://pastebin.com/1ufErVMs
> 
> It appears at first glance to be significantly more fragmented. Closer
> inspection shows that most of the extents weren't relocated. But
> what's up with the peculiar interleaving? Is this an improvement over
> the original allocation?

I am not sure how read the filefrag output: I see several lines like
[...]
    5:     1691..    1693:     125477..    125479:      3:
    6:     1694..    1694:     125480..    125480:      1:             unwritten
[...]

What means "unwritten" ? The kernel documentation [*] says:
[...]
* FIEMAP_EXTENT_UNWRITTEN
Unwritten extent - the extent is allocated but its data has not been
initialized.  This indicates the extent's data will be all zero if read
through the filesystem but the contents are undefined if read directly from
the device.
[..]
So it seems that the data didn't touch the platters (!)

My educate guess is that there is something strange in the sequence:
- write
- sync
- close log
- move log
- defrag log

May be the defrag starts before all the data reach the platters ?

For what matters, I create a file with the same fragmentation like your one

$ sudo filefrag -v data.txt
Filesystem type is: 9123683e
File size of data.txt is 25165824 (6144 blocks of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..       0:    1597171..   1597171:      1:
    1:        1..    1599:  163433285.. 163434883:   1599:    1597172:
    2:     1600..    1607:    1601255..   1601262:      8:  163434884:
    3:     1608..    1689:    1604137..   1604218:     82:    1601263:
    4:     1690..    1690:    1597484..   1597484:      1:    1604219:
    5:     1691..    1693:    1597465..   1597467:      3:    1597485:
    6:     1694..    1694:    1597966..   1597966:      1:    1597468:
    7:     1695..    1722:    1599557..   1599584:     28:    1597967:
    8:     1723..    1723:    1599211..   1599211:      1:    1599585:
    9:     1724..    1955:    1648394..   1648625:    232:    1599212:
   10:     1956..    1956:    1599695..   1599695:      1:    1648626:
   11:     1957..    2047:    1625881..   1625971:     91:    1599696:
   12:     2048..    2417:    1648804..   1649173:    370:    1625972:
   13:     2418..    2420:    1597468..   1597470:      3:    1649174:
   14:     2421..    2478:    1624667..   1624724:     58:    1597471:
   15:     2479..    2479:    1596416..   1596416:      1:    1624725:
   16:     2480..    2482:    1601045..   1601047:      3:    1596417:
   17:     2483..    2483:    1596854..   1596854:      1:    1601048:
   18:     2484..    2523:    1602715..   1602754:     40:    1596855:
   19:     2524..    2527:    1597471..   1597474:      4:    1602755:
   20:     2528..    2598:    1624725..   1624795:     71:    1597475:
   21:     2599..    2599:    1596858..   1596858:      1:    1624796:
   22:     2600..    2607:    1601263..   1601270:      8:    1596859:
   23:     2608..    2608:    1596863..   1596863:      1:    1601271:
   24:     2609..    2611:    1601271..   1601273:      3:    1596864:
   25:     2612..    2612:    1596864..   1596864:      1:    1601274:
   26:     2613..    2615:    1601274..   1601276:      3:    1596865:
   27:     2616..    2616:    1596981..   1596981:      1:    1601277:
   28:     2617..    2691:    1649174..   1649248:     75:    1596982:
   29:     2692..    2696:    1597475..   1597479:      5:    1649249:
   30:     2697..    2756:    1634995..   1635054:     60:    1597480:
   31:     2757..    2758:    1597480..   1597481:      2:    1635055:
   32:     2759..    2762:    1601351..   1601354:      4:    1597482:
   33:     2763..    2764:    1597482..   1597483:      2:    1601355:
   34:     2765..    2837:    1649249..   1649321:     73:    1597484:
   35:     2838..    2838:    1597038..   1597038:      1:    1649322:
   36:     2839..    2855:    1601538..   1601554:     17:    1597039:
   37:     2856..    2856:    1597045..   1597045:      1:    1601555:
   38:     2857..    2904:    1624547..   1624594:     48:    1597046:
   39:     2905..    2926:    1600795..   1600816:     22:    1624595:
   40:     2927..    2942:    1602034..   1602049:     16:    1600817:
   41:     2943..    2963:    1600817..   1600837:     21:    1602050:
   42:     2964..    2979:    1602183..   1602198:     16:    1600838:
   43:     2980..    3001:    1600927..   1600948:     22:    1602199:
   44:     3002..    3043:    1621164..   1621205:     42:    1600949:
   45:     3044..    3053:    1599231..   1599240:     10:    1621206:
   46:     3054..    3066:    1601952..   1601964:     13:    1599241:
   47:     3067..    3067:    1597056..   1597056:      1:    1601965:
   48:     3068..    3084:    1602375..   1602391:     17:    1597057:
   49:     3085..    3094:    1599290..   1599299:     10:    1602392:
   50:     3095..    3096:    1601355..   1601356:      2:    1599300:
   51:     3097..    3107:    1600717..   1600727:     11:    1601357:
   52:     3108..    3156:    1642892..   1642940:     49:    1600728:
   53:     3157..    3157:    1597059..   1597059:      1:    1642941:
   54:     3158..    3251:    1649322..   1649415:     94:    1597060:
   55:     3252..    3254:    1599241..   1599243:      3:    1649416:
   56:     3255..    3304:    1645466..   1645515:     50:    1599244:
   57:     3305..    3305:    1597100..   1597100:      1:    1645516:
   58:     3306..    3312:    1601357..   1601363:      7:    1597101:
   59:     3313..    3319:    1599300..   1599306:      7:    1601364:
   60:     3320..    3331:    1601611..   1601622:     12:    1599307:
   61:     3332..    3339:    1600838..   1600845:      8:    1601623:
   62:     3340..    3343:    1601419..   1601422:      4:    1600846:
   63:     3344..    3351:    1600846..   1600853:      8:    1601423:
   64:     3352..    3432:    1649416..   1649496:     81:    1600854:
   65:     3433..    3433:    1597109..   1597109:      1:    1649497:
   66:     3434..    3489:    1649497..   1649552:     56:    1597110:
   67:     3490..    3491:    1599227..   1599228:      2:    1649553:
   68:     3492..    3521:    1619348..   1619377:     30:    1599229:
   69:     3522..    3523:    1599307..   1599308:      2:    1619378:
   70:     3524..    3530:    1601688..   1601694:      7:    1599309:
   71:     3531..    3539:    1600949..   1600957:      9:    1601695:
   72:     3540..    3579:    1629356..   1629395:     40:    1600958:
   73:     3580..    3580:    1597124..   1597124:      1:    1629396:
   74:     3581..    3601:    1604219..   1604239:     21:    1597125:
   75:     3602..    3603:    1599585..   1599586:      2:    1604240:
   76:     3604..    3614:    1602636..   1602646:     11:    1599587:
   77:     3615..    3616:    1599587..   1599588:      2:    1602647:
   78:     3617..    3677:    1649553..   1649613:     61:    1599589:
   79:     3678..    3680:    1599692..   1599694:      3:    1649614:
   80:     3681..    3723:    1647818..   1647860:     43:    1599695:
   81:     3724..    3726:    1599821..   1599823:      3:    1647861:
   82:     3727..    3756:    1622218..   1622247:     30:    1599824:
   83:     3757..    3759:    1600630..   1600632:      3:    1622248:
   84:     3760..    3766:    1603288..   1603294:      7:    1600633:
   85:     3767..    3768:    1600633..   1600634:      2:    1603295:
   86:     3769..    3950:   76053306..  76053487:    182:    1600635:
   87:     3951..    3958:    1600958..   1600965:      8:   76053488:
   88:     3959..    3986:    1619921..   1619948:     28:    1600966:
   89:     3987..    3995:    1600966..   1600974:      9:    1619949:
   90:     3996..    4036:    1649614..   1649654:     41:    1600975:
   91:     4037..    4045:    1600975..   1600983:      9:    1649655:
   92:     4046..    4050:    1601423..   1601427:      5:    1600984:
   93:     4051..    4052:    1600854..   1600855:      2:    1601428:
   94:     4053..    4055:    1601555..   1601557:      3:    1600856:
   95:     4056..    4056:    1597129..   1597129:      1:    1601558:
   96:     4057..    4059:    1601745..   1601747:      3:    1597130:
   97:     4060..    4060:    1597134..   1597134:      1:    1601748:
   98:     4061..    4063:    1602050..   1602052:      3:    1597135:
   99:     4064..    4064:    1597137..   1597137:      1:    1602053:
  100:     4065..    4079:    1604297..   1604311:     15:    1597138:
  101:     4080..    4088:    1600987..   1600995:      9:    1604312:
  102:     4089..    4095:    1603295..   1603301:      7:    1600996:
  103:     4096..    4106:    1600996..   1601006:     11:    1603302:
  104:     4107..    4117:    1622600..   1622610:     11:    1601007:
  105:     4118..    4119:    1601007..   1601008:      2:    1622611:
  106:     4120..    4129:    1622611..   1622620:     10:    1601009:
  107:     4130..    4131:    1601009..   1601010:      2:    1622621:
  108:     4132..    4141:    1622621..   1622630:     10:    1601011:
  109:     4142..    4145:    1601011..   1601014:      4:    1622631:
  110:     4146..    4155:    1622986..   1622995:     10:    1601015:
  111:     4156..    4157:    1601015..   1601016:      2:    1622996:
  112:     4158..    4168:    1622996..   1623006:     11:    1601017:
  113:     4169..    4170:    1601017..   1601018:      2:    1623007:
  114:     4171..    4180:    1623007..   1623016:     10:    1601019:
  115:     4181..    4182:    1601019..   1601020:      2:    1623017:
  116:     4183..    4192:    1624473..   1624482:     10:    1601021:
  117:     4193..    4195:    1601021..   1601023:      3:    1624483:
  118:     4196..    4205:    1624796..   1624805:     10:    1601024:
  119:     4206..    4207:    1601024..   1601025:      2:    1624806:
  120:     4208..    4217:    1624806..   1624815:     10:    1601026:
  121:     4218..    4220:    1601026..   1601028:      3:    1624816:
  122:     4221..    4230:    1625972..   1625981:     10:    1601029:
  123:     4231..    4408:    1648626..   1648803:    178:    1625982:
  124:     4409..    4411:    1602199..   1602201:      3:    1648804:
  125:     4412..    4434:    1601328..   1601350:     23:    1602202:
  126:     4435..    4437:    1602647..   1602649:      3:    1601351:
  127:     4438..    4439:    1601029..   1601030:      2:    1602650:
  128:     4440..    4442:    1602755..   1602757:      3:    1601031:
  129:     4443..    4480:    1601650..   1601687:     38:    1602758:
  130:     4481..    4491:    1629530..   1629540:     11:    1601688:
  131:     4492..    4560:    1624404..   1624472:     69:    1629541:
  132:     4561..    4571:    1629541..   1629551:     11:    1624473:
  133:     4572..    4582:    1601031..   1601041:     11:    1629552:
  134:     4583..    4586:    1603302..   1603305:      4:    1601042:
  135:     4587..    4620:    1602537..   1602570:     34:    1603306:
  136:     4621..    4631:    1629716..   1629726:     11:    1602571:
  137:     4632..    4634:    1601042..   1601044:      3:    1629727:
  138:     4635..    6143:  156004864.. 156006372:   1509:    1601045: last,eof
data.txt: 139 extents found

the I tried to defrag it

$ btrfs fi defra  data.txt
$ sudo filefrag -v data.txt
Filesystem type is: 9123683e
File size of data.txt is 25165824 (6144 blocks of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..    6143:  164002967.. 164009110:   6144:             last,eof
data.txt: 1 extent found

So it seems that the defrag works

[*] https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt
> 
> https://pastebin.com/1ufErVMs
> 
> If I unwind the interleaving, it looks like all the extents fall into
> two localities and within each locality the extents aren't that far
> apart - so my guess is that this file is also not meaningfully
> fragmented, in practice. Surely the drive firmware will reorder the
> reads to arrive at the least amount of seeks?
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-08 22:11 ` Goffredo Baroncelli
@ 2021-02-08 22:21   ` Zygo Blaxell
  2021-02-09  1:05     ` Chris Murphy
  2021-02-09  0:42   ` Chris Murphy
  1 sibling, 1 reply; 19+ messages in thread
From: Zygo Blaxell @ 2021-02-08 22:21 UTC (permalink / raw)
  To: kreijack; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Feb 08, 2021 at 11:11:47PM +0100, Goffredo Baroncelli wrote:
> On 2/7/21 11:06 PM, Chris Murphy wrote:
> > systemd-journald journals on Btrfs default to nodatacow,  upon log
> > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The
> > result looks curious. I can't tell what the logic is from the results.
> > 
> > The journal file starts out being fallocated with a size of 8MB, and
> > as it grows there is an append of 8MB increments, also fallocated.
> > This leads to a filefrag -v that looks like this (ext4 and btrfs
> > nodatacow follow the same behavior, both are provided for reference):
> > 
> > ext4
> > https://pastebin.com/6vuufwXt
> > 
> > btrfs
> > https://pastebin.com/Y18B2m4h
> > 
> > Following defragment with BTRFS_IOC_DEFRAG it looks like this:
> > https://pastebin.com/1ufErVMs
> > 
> > It appears at first glance to be significantly more fragmented. Closer
> > inspection shows that most of the extents weren't relocated. But
> > what's up with the peculiar interleaving? Is this an improvement over
> > the original allocation?
> 
> I am not sure how read the filefrag output: I see several lines like
> [...]
>    5:     1691..    1693:     125477..    125479:      3:
>    6:     1694..    1694:     125480..    125480:      1:             unwritten
> [...]
> 
> What means "unwritten" ? The kernel documentation [*] says:
> [...]
> * FIEMAP_EXTENT_UNWRITTEN
> Unwritten extent - the extent is allocated but its data has not been
> initialized.  This indicates the extent's data will be all zero if read
> through the filesystem but the contents are undefined if read directly from
> the device.
> [..]
> So it seems that the data didn't touch the platters (!)
> 
> My educate guess is that there is something strange in the sequence:
> - write
> - sync
> - close log
> - move log
> - defrag log
> 
> May be the defrag starts before all the data reach the platters ?

defrag will put the file's contents back into delalloc, and it won't be
allocated until a flush (fsync, sync, or commit interval).  Defrag is
roughly equivalent to simply copying the data to a new file in btrfs,
except the logical extents are atomically updated to point to the new
location.

FIEMAP has an option flag to sync the data before returning a map.
DEFRAG has an option to start IO immediately so it will presumably be
done by the time you look at the extents with FIEMAP.

> For what matters, I create a file with the same fragmentation like your one
> 
> $ sudo filefrag -v data.txt
> Filesystem type is: 9123683e
> File size of data.txt is 25165824 (6144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:    1597171..   1597171:      1:
>    1:        1..    1599:  163433285.. 163434883:   1599:    1597172:
>    2:     1600..    1607:    1601255..   1601262:      8:  163434884:
>    3:     1608..    1689:    1604137..   1604218:     82:    1601263:
>    4:     1690..    1690:    1597484..   1597484:      1:    1604219:
>    5:     1691..    1693:    1597465..   1597467:      3:    1597485:
>    6:     1694..    1694:    1597966..   1597966:      1:    1597468:
>    7:     1695..    1722:    1599557..   1599584:     28:    1597967:
>    8:     1723..    1723:    1599211..   1599211:      1:    1599585:
>    9:     1724..    1955:    1648394..   1648625:    232:    1599212:
>   10:     1956..    1956:    1599695..   1599695:      1:    1648626:
>   11:     1957..    2047:    1625881..   1625971:     91:    1599696:
>   12:     2048..    2417:    1648804..   1649173:    370:    1625972:
>   13:     2418..    2420:    1597468..   1597470:      3:    1649174:
>   14:     2421..    2478:    1624667..   1624724:     58:    1597471:
>   15:     2479..    2479:    1596416..   1596416:      1:    1624725:
>   16:     2480..    2482:    1601045..   1601047:      3:    1596417:
>   17:     2483..    2483:    1596854..   1596854:      1:    1601048:
>   18:     2484..    2523:    1602715..   1602754:     40:    1596855:
>   19:     2524..    2527:    1597471..   1597474:      4:    1602755:
>   20:     2528..    2598:    1624725..   1624795:     71:    1597475:
>   21:     2599..    2599:    1596858..   1596858:      1:    1624796:
>   22:     2600..    2607:    1601263..   1601270:      8:    1596859:
>   23:     2608..    2608:    1596863..   1596863:      1:    1601271:
>   24:     2609..    2611:    1601271..   1601273:      3:    1596864:
>   25:     2612..    2612:    1596864..   1596864:      1:    1601274:
>   26:     2613..    2615:    1601274..   1601276:      3:    1596865:
>   27:     2616..    2616:    1596981..   1596981:      1:    1601277:
>   28:     2617..    2691:    1649174..   1649248:     75:    1596982:
>   29:     2692..    2696:    1597475..   1597479:      5:    1649249:
>   30:     2697..    2756:    1634995..   1635054:     60:    1597480:
>   31:     2757..    2758:    1597480..   1597481:      2:    1635055:
>   32:     2759..    2762:    1601351..   1601354:      4:    1597482:
>   33:     2763..    2764:    1597482..   1597483:      2:    1601355:
>   34:     2765..    2837:    1649249..   1649321:     73:    1597484:
>   35:     2838..    2838:    1597038..   1597038:      1:    1649322:
>   36:     2839..    2855:    1601538..   1601554:     17:    1597039:
>   37:     2856..    2856:    1597045..   1597045:      1:    1601555:
>   38:     2857..    2904:    1624547..   1624594:     48:    1597046:
>   39:     2905..    2926:    1600795..   1600816:     22:    1624595:
>   40:     2927..    2942:    1602034..   1602049:     16:    1600817:
>   41:     2943..    2963:    1600817..   1600837:     21:    1602050:
>   42:     2964..    2979:    1602183..   1602198:     16:    1600838:
>   43:     2980..    3001:    1600927..   1600948:     22:    1602199:
>   44:     3002..    3043:    1621164..   1621205:     42:    1600949:
>   45:     3044..    3053:    1599231..   1599240:     10:    1621206:
>   46:     3054..    3066:    1601952..   1601964:     13:    1599241:
>   47:     3067..    3067:    1597056..   1597056:      1:    1601965:
>   48:     3068..    3084:    1602375..   1602391:     17:    1597057:
>   49:     3085..    3094:    1599290..   1599299:     10:    1602392:
>   50:     3095..    3096:    1601355..   1601356:      2:    1599300:
>   51:     3097..    3107:    1600717..   1600727:     11:    1601357:
>   52:     3108..    3156:    1642892..   1642940:     49:    1600728:
>   53:     3157..    3157:    1597059..   1597059:      1:    1642941:
>   54:     3158..    3251:    1649322..   1649415:     94:    1597060:
>   55:     3252..    3254:    1599241..   1599243:      3:    1649416:
>   56:     3255..    3304:    1645466..   1645515:     50:    1599244:
>   57:     3305..    3305:    1597100..   1597100:      1:    1645516:
>   58:     3306..    3312:    1601357..   1601363:      7:    1597101:
>   59:     3313..    3319:    1599300..   1599306:      7:    1601364:
>   60:     3320..    3331:    1601611..   1601622:     12:    1599307:
>   61:     3332..    3339:    1600838..   1600845:      8:    1601623:
>   62:     3340..    3343:    1601419..   1601422:      4:    1600846:
>   63:     3344..    3351:    1600846..   1600853:      8:    1601423:
>   64:     3352..    3432:    1649416..   1649496:     81:    1600854:
>   65:     3433..    3433:    1597109..   1597109:      1:    1649497:
>   66:     3434..    3489:    1649497..   1649552:     56:    1597110:
>   67:     3490..    3491:    1599227..   1599228:      2:    1649553:
>   68:     3492..    3521:    1619348..   1619377:     30:    1599229:
>   69:     3522..    3523:    1599307..   1599308:      2:    1619378:
>   70:     3524..    3530:    1601688..   1601694:      7:    1599309:
>   71:     3531..    3539:    1600949..   1600957:      9:    1601695:
>   72:     3540..    3579:    1629356..   1629395:     40:    1600958:
>   73:     3580..    3580:    1597124..   1597124:      1:    1629396:
>   74:     3581..    3601:    1604219..   1604239:     21:    1597125:
>   75:     3602..    3603:    1599585..   1599586:      2:    1604240:
>   76:     3604..    3614:    1602636..   1602646:     11:    1599587:
>   77:     3615..    3616:    1599587..   1599588:      2:    1602647:
>   78:     3617..    3677:    1649553..   1649613:     61:    1599589:
>   79:     3678..    3680:    1599692..   1599694:      3:    1649614:
>   80:     3681..    3723:    1647818..   1647860:     43:    1599695:
>   81:     3724..    3726:    1599821..   1599823:      3:    1647861:
>   82:     3727..    3756:    1622218..   1622247:     30:    1599824:
>   83:     3757..    3759:    1600630..   1600632:      3:    1622248:
>   84:     3760..    3766:    1603288..   1603294:      7:    1600633:
>   85:     3767..    3768:    1600633..   1600634:      2:    1603295:
>   86:     3769..    3950:   76053306..  76053487:    182:    1600635:
>   87:     3951..    3958:    1600958..   1600965:      8:   76053488:
>   88:     3959..    3986:    1619921..   1619948:     28:    1600966:
>   89:     3987..    3995:    1600966..   1600974:      9:    1619949:
>   90:     3996..    4036:    1649614..   1649654:     41:    1600975:
>   91:     4037..    4045:    1600975..   1600983:      9:    1649655:
>   92:     4046..    4050:    1601423..   1601427:      5:    1600984:
>   93:     4051..    4052:    1600854..   1600855:      2:    1601428:
>   94:     4053..    4055:    1601555..   1601557:      3:    1600856:
>   95:     4056..    4056:    1597129..   1597129:      1:    1601558:
>   96:     4057..    4059:    1601745..   1601747:      3:    1597130:
>   97:     4060..    4060:    1597134..   1597134:      1:    1601748:
>   98:     4061..    4063:    1602050..   1602052:      3:    1597135:
>   99:     4064..    4064:    1597137..   1597137:      1:    1602053:
>  100:     4065..    4079:    1604297..   1604311:     15:    1597138:
>  101:     4080..    4088:    1600987..   1600995:      9:    1604312:
>  102:     4089..    4095:    1603295..   1603301:      7:    1600996:
>  103:     4096..    4106:    1600996..   1601006:     11:    1603302:
>  104:     4107..    4117:    1622600..   1622610:     11:    1601007:
>  105:     4118..    4119:    1601007..   1601008:      2:    1622611:
>  106:     4120..    4129:    1622611..   1622620:     10:    1601009:
>  107:     4130..    4131:    1601009..   1601010:      2:    1622621:
>  108:     4132..    4141:    1622621..   1622630:     10:    1601011:
>  109:     4142..    4145:    1601011..   1601014:      4:    1622631:
>  110:     4146..    4155:    1622986..   1622995:     10:    1601015:
>  111:     4156..    4157:    1601015..   1601016:      2:    1622996:
>  112:     4158..    4168:    1622996..   1623006:     11:    1601017:
>  113:     4169..    4170:    1601017..   1601018:      2:    1623007:
>  114:     4171..    4180:    1623007..   1623016:     10:    1601019:
>  115:     4181..    4182:    1601019..   1601020:      2:    1623017:
>  116:     4183..    4192:    1624473..   1624482:     10:    1601021:
>  117:     4193..    4195:    1601021..   1601023:      3:    1624483:
>  118:     4196..    4205:    1624796..   1624805:     10:    1601024:
>  119:     4206..    4207:    1601024..   1601025:      2:    1624806:
>  120:     4208..    4217:    1624806..   1624815:     10:    1601026:
>  121:     4218..    4220:    1601026..   1601028:      3:    1624816:
>  122:     4221..    4230:    1625972..   1625981:     10:    1601029:
>  123:     4231..    4408:    1648626..   1648803:    178:    1625982:
>  124:     4409..    4411:    1602199..   1602201:      3:    1648804:
>  125:     4412..    4434:    1601328..   1601350:     23:    1602202:
>  126:     4435..    4437:    1602647..   1602649:      3:    1601351:
>  127:     4438..    4439:    1601029..   1601030:      2:    1602650:
>  128:     4440..    4442:    1602755..   1602757:      3:    1601031:
>  129:     4443..    4480:    1601650..   1601687:     38:    1602758:
>  130:     4481..    4491:    1629530..   1629540:     11:    1601688:
>  131:     4492..    4560:    1624404..   1624472:     69:    1629541:
>  132:     4561..    4571:    1629541..   1629551:     11:    1624473:
>  133:     4572..    4582:    1601031..   1601041:     11:    1629552:
>  134:     4583..    4586:    1603302..   1603305:      4:    1601042:
>  135:     4587..    4620:    1602537..   1602570:     34:    1603306:
>  136:     4621..    4631:    1629716..   1629726:     11:    1602571:
>  137:     4632..    4634:    1601042..   1601044:      3:    1629727:
>  138:     4635..    6143:  156004864.. 156006372:   1509:    1601045: last,eof
> data.txt: 139 extents found
> 
> the I tried to defrag it
> 
> $ btrfs fi defra  data.txt
> $ sudo filefrag -v data.txt
> Filesystem type is: 9123683e
> File size of data.txt is 25165824 (6144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    6143:  164002967.. 164009110:   6144:             last,eof
> data.txt: 1 extent found
> 
> So it seems that the defrag works

Be very careful how you set up this test case.  If you use fallocate on
a file, it has a _permanent_ effect on the inode, and alters a lot of
normal btrfs behavior downstream.  You won't see these effects if you
just write some data to a file without using prealloc.

> [*] https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt
> > 
> > https://pastebin.com/1ufErVMs
> > 
> > If I unwind the interleaving, it looks like all the extents fall into
> > two localities and within each locality the extents aren't that far
> > apart - so my guess is that this file is also not meaningfully
> > fragmented, in practice. Surely the drive firmware will reorder the
> > reads to arrive at the least amount of seeks?
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-08 22:21   ` Zygo Blaxell
@ 2021-02-09  1:05     ` Chris Murphy
  0 siblings, 0 replies; 19+ messages in thread
From: Chris Murphy @ 2021-02-09  1:05 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Goffredo Baroncelli, Chris Murphy, Btrfs BTRFS

On Mon, Feb 8, 2021 at 3:21 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:

> defrag will put the file's contents back into delalloc, and it won't be
> allocated until a flush (fsync, sync, or commit interval).  Defrag is
> roughly equivalent to simply copying the data to a new file in btrfs,
> except the logical extents are atomically updated to point to the new
> location.

BTRFS_IOC_DEFRAG results:
https://pastebin.com/1ufErVMs

BTRFS_IOC_DEFRAG_RANGE results:
https://pastebin.com/429fZmNB

They're different.

Questions: is this a bug? it is intentional? does the interleaved
BTRFS_IOC_DEFRAG version improve things over the non-defragmented
file, which had only 3 8MB extents for a 24MB file, plus 1 4KiB block?
Should BTRFS_IOC_DEFRAG be capable of estimating fragmentation and
just do a no op in that case?

> FIEMAP has an option flag to sync the data before returning a map.
> DEFRAG has an option to start IO immediately so it will presumably be
> done by the time you look at the extents with FIEMAP.

I waited for the defrag result to settle, so the results I've posted are stable.

> Be very careful how you set up this test case.  If you use fallocate on
> a file, it has a _permanent_ effect on the inode, and alters a lot of
> normal btrfs behavior downstream.  You won't see these effects if you
> just write some data to a file without using prealloc.

OK. That might answer the idempotent question. Following
BTRFS_IOC_DEFRAG most unwritten exents are no longer present. I can't
figure out the pattern. Some of the archived journals have them,
others have one, but none have the four or more that I see in active
use journals. And then when defragged with BTRFS_IOC_DEFRAG_RANGE none
of those have unwritten extents.

Since the file is changing each time it goes through the ioctl it
makes sense what comes out the back end is different.

While BTRFS_IOC_DEFRAG_RANGE has a no op if an extent is bigger than
the -l (len=) value, I can't tell that BTRFS_IOC_DEFRAG has any sort
of no op unless there's no fragments at all *shrug*.

Maybe they should use BTRFS_IOC_DEFRAG_RANGE and specify an 8MB exent?
Because in the nodatacow case, that's what they already have and it'd
be a no op. And then for datacow case... well I don't like
unconditional write amplification on SSDs just to satisfy the HDD
case. But it'd be avoidable by just using default (nodatacow for the
journals).

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-08 22:11 ` Goffredo Baroncelli
  2021-02-08 22:21   ` Zygo Blaxell
@ 2021-02-09  0:42   ` Chris Murphy
  2021-02-09 18:13     ` Goffredo Baroncelli
  1 sibling, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2021-02-09  0:42 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Feb 8, 2021 at 3:11 PM Goffredo Baroncelli <kreijack@libero.it> wrote:
>
> On 2/7/21 11:06 PM, Chris Murphy wrote:
> > systemd-journald journals on Btrfs default to nodatacow,  upon log
> > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The
> > result looks curious. I can't tell what the logic is from the results.
> >
> > The journal file starts out being fallocated with a size of 8MB, and
> > as it grows there is an append of 8MB increments, also fallocated.
> > This leads to a filefrag -v that looks like this (ext4 and btrfs
> > nodatacow follow the same behavior, both are provided for reference):
> >
> > ext4
> > https://pastebin.com/6vuufwXt
> >
> > btrfs
> > https://pastebin.com/Y18B2m4h
> >
> > Following defragment with BTRFS_IOC_DEFRAG it looks like this:
> > https://pastebin.com/1ufErVMs
> >
> > It appears at first glance to be significantly more fragmented. Closer
> > inspection shows that most of the extents weren't relocated. But
> > what's up with the peculiar interleaving? Is this an improvement over
> > the original allocation?
>
> I am not sure how read the filefrag output: I see several lines like
> [...]
>     5:     1691..    1693:     125477..    125479:      3:
>     6:     1694..    1694:     125480..    125480:      1:             unwritten
> [...]
>
> What means "unwritten" ? The kernel documentation [*] says:


My understanding is it's an exent that's been fallocated but not yet
written to. What I don't know is whether they are possibly tripping up
BTRFS_IOC_DEFRAG. I'm not skilled enough to create a bunch of these
journal logs quickly (I'd have to just let a system run and age its
own journals, which sucks, it takes forever) and then a small program
that runs the same file through BTRFS_IOC_DEFRAG twice to see if it's
idempotent. The resulting file after one submission does not have
unwritten extents.

Another thing I'm not sure of is whether ssd vs nossd affects the
defrag results. Or datacow versus nodatacow.

Another thing I'm not sure of is if autodefrag is a better solution to
the problem. Whereby it acts as a no op when the file is nodatacow,
and does the expected thing if it's datacow. But then we'd need an
autodefrag xattr to set on the enclosing directory for these journals
because there's no reliable way to set autodefrag mount option
globally, not knowing all the work loads. It can make some workloads
worse.



> My educate guess is that there is something strange in the sequence:
> - write
> - sync
> - close log
> - move log
> - defrag log
>
> May be the defrag starts before all the data reach the platters ?

Perhaps. Attach strace to journald before --rotate, and then --rotate

https://pastebin.com/UGihfCG9

>
> For what matters, I create a file with the same fragmentation like your one
>
> $ sudo filefrag -v data.txt
> Filesystem type is: 9123683e
> File size of data.txt is 25165824 (6144 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..       0:    1597171..   1597171:      1:
>     1:        1..    1599:  163433285.. 163434883:   1599:    1597172:
>     2:     1600..    1607:    1601255..   1601262:      8:  163434884:
>     3:     1608..    1689:    1604137..   1604218:     82:    1601263:
>     4:     1690..    1690:    1597484..   1597484:      1:    1604219:
>     5:     1691..    1693:    1597465..   1597467:      3:    1597485:
>     6:     1694..    1694:    1597966..   1597966:      1:    1597468:
>     7:     1695..    1722:    1599557..   1599584:     28:    1597967:
>     8:     1723..    1723:    1599211..   1599211:      1:    1599585:
>     9:     1724..    1955:    1648394..   1648625:    232:    1599212:
>    10:     1956..    1956:    1599695..   1599695:      1:    1648626:
>    11:     1957..    2047:    1625881..   1625971:     91:    1599696:
>    12:     2048..    2417:    1648804..   1649173:    370:    1625972:
>    13:     2418..    2420:    1597468..   1597470:      3:    1649174:
>    14:     2421..    2478:    1624667..   1624724:     58:    1597471:
>    15:     2479..    2479:    1596416..   1596416:      1:    1624725:
>    16:     2480..    2482:    1601045..   1601047:      3:    1596417:
>    17:     2483..    2483:    1596854..   1596854:      1:    1601048:
>    18:     2484..    2523:    1602715..   1602754:     40:    1596855:
>    19:     2524..    2527:    1597471..   1597474:      4:    1602755:
>    20:     2528..    2598:    1624725..   1624795:     71:    1597475:
>    21:     2599..    2599:    1596858..   1596858:      1:    1624796:
>    22:     2600..    2607:    1601263..   1601270:      8:    1596859:
>    23:     2608..    2608:    1596863..   1596863:      1:    1601271:
>    24:     2609..    2611:    1601271..   1601273:      3:    1596864:
>    25:     2612..    2612:    1596864..   1596864:      1:    1601274:
>    26:     2613..    2615:    1601274..   1601276:      3:    1596865:
>    27:     2616..    2616:    1596981..   1596981:      1:    1601277:
>    28:     2617..    2691:    1649174..   1649248:     75:    1596982:
>    29:     2692..    2696:    1597475..   1597479:      5:    1649249:
>    30:     2697..    2756:    1634995..   1635054:     60:    1597480:
>    31:     2757..    2758:    1597480..   1597481:      2:    1635055:
>    32:     2759..    2762:    1601351..   1601354:      4:    1597482:
>    33:     2763..    2764:    1597482..   1597483:      2:    1601355:
>    34:     2765..    2837:    1649249..   1649321:     73:    1597484:
>    35:     2838..    2838:    1597038..   1597038:      1:    1649322:
>    36:     2839..    2855:    1601538..   1601554:     17:    1597039:
>    37:     2856..    2856:    1597045..   1597045:      1:    1601555:
>    38:     2857..    2904:    1624547..   1624594:     48:    1597046:
>    39:     2905..    2926:    1600795..   1600816:     22:    1624595:
>    40:     2927..    2942:    1602034..   1602049:     16:    1600817:
>    41:     2943..    2963:    1600817..   1600837:     21:    1602050:
>    42:     2964..    2979:    1602183..   1602198:     16:    1600838:
>    43:     2980..    3001:    1600927..   1600948:     22:    1602199:
>    44:     3002..    3043:    1621164..   1621205:     42:    1600949:
>    45:     3044..    3053:    1599231..   1599240:     10:    1621206:
>    46:     3054..    3066:    1601952..   1601964:     13:    1599241:
>    47:     3067..    3067:    1597056..   1597056:      1:    1601965:
>    48:     3068..    3084:    1602375..   1602391:     17:    1597057:
>    49:     3085..    3094:    1599290..   1599299:     10:    1602392:
>    50:     3095..    3096:    1601355..   1601356:      2:    1599300:
>    51:     3097..    3107:    1600717..   1600727:     11:    1601357:
>    52:     3108..    3156:    1642892..   1642940:     49:    1600728:
>    53:     3157..    3157:    1597059..   1597059:      1:    1642941:
>    54:     3158..    3251:    1649322..   1649415:     94:    1597060:
>    55:     3252..    3254:    1599241..   1599243:      3:    1649416:
>    56:     3255..    3304:    1645466..   1645515:     50:    1599244:
>    57:     3305..    3305:    1597100..   1597100:      1:    1645516:
>    58:     3306..    3312:    1601357..   1601363:      7:    1597101:
>    59:     3313..    3319:    1599300..   1599306:      7:    1601364:
>    60:     3320..    3331:    1601611..   1601622:     12:    1599307:
>    61:     3332..    3339:    1600838..   1600845:      8:    1601623:
>    62:     3340..    3343:    1601419..   1601422:      4:    1600846:
>    63:     3344..    3351:    1600846..   1600853:      8:    1601423:
>    64:     3352..    3432:    1649416..   1649496:     81:    1600854:
>    65:     3433..    3433:    1597109..   1597109:      1:    1649497:
>    66:     3434..    3489:    1649497..   1649552:     56:    1597110:
>    67:     3490..    3491:    1599227..   1599228:      2:    1649553:
>    68:     3492..    3521:    1619348..   1619377:     30:    1599229:
>    69:     3522..    3523:    1599307..   1599308:      2:    1619378:
>    70:     3524..    3530:    1601688..   1601694:      7:    1599309:
>    71:     3531..    3539:    1600949..   1600957:      9:    1601695:
>    72:     3540..    3579:    1629356..   1629395:     40:    1600958:
>    73:     3580..    3580:    1597124..   1597124:      1:    1629396:
>    74:     3581..    3601:    1604219..   1604239:     21:    1597125:
>    75:     3602..    3603:    1599585..   1599586:      2:    1604240:
>    76:     3604..    3614:    1602636..   1602646:     11:    1599587:
>    77:     3615..    3616:    1599587..   1599588:      2:    1602647:
>    78:     3617..    3677:    1649553..   1649613:     61:    1599589:
>    79:     3678..    3680:    1599692..   1599694:      3:    1649614:
>    80:     3681..    3723:    1647818..   1647860:     43:    1599695:
>    81:     3724..    3726:    1599821..   1599823:      3:    1647861:
>    82:     3727..    3756:    1622218..   1622247:     30:    1599824:
>    83:     3757..    3759:    1600630..   1600632:      3:    1622248:
>    84:     3760..    3766:    1603288..   1603294:      7:    1600633:
>    85:     3767..    3768:    1600633..   1600634:      2:    1603295:
>    86:     3769..    3950:   76053306..  76053487:    182:    1600635:
>    87:     3951..    3958:    1600958..   1600965:      8:   76053488:
>    88:     3959..    3986:    1619921..   1619948:     28:    1600966:
>    89:     3987..    3995:    1600966..   1600974:      9:    1619949:
>    90:     3996..    4036:    1649614..   1649654:     41:    1600975:
>    91:     4037..    4045:    1600975..   1600983:      9:    1649655:
>    92:     4046..    4050:    1601423..   1601427:      5:    1600984:
>    93:     4051..    4052:    1600854..   1600855:      2:    1601428:
>    94:     4053..    4055:    1601555..   1601557:      3:    1600856:
>    95:     4056..    4056:    1597129..   1597129:      1:    1601558:
>    96:     4057..    4059:    1601745..   1601747:      3:    1597130:
>    97:     4060..    4060:    1597134..   1597134:      1:    1601748:
>    98:     4061..    4063:    1602050..   1602052:      3:    1597135:
>    99:     4064..    4064:    1597137..   1597137:      1:    1602053:
>   100:     4065..    4079:    1604297..   1604311:     15:    1597138:
>   101:     4080..    4088:    1600987..   1600995:      9:    1604312:
>   102:     4089..    4095:    1603295..   1603301:      7:    1600996:
>   103:     4096..    4106:    1600996..   1601006:     11:    1603302:
>   104:     4107..    4117:    1622600..   1622610:     11:    1601007:
>   105:     4118..    4119:    1601007..   1601008:      2:    1622611:
>   106:     4120..    4129:    1622611..   1622620:     10:    1601009:
>   107:     4130..    4131:    1601009..   1601010:      2:    1622621:
>   108:     4132..    4141:    1622621..   1622630:     10:    1601011:
>   109:     4142..    4145:    1601011..   1601014:      4:    1622631:
>   110:     4146..    4155:    1622986..   1622995:     10:    1601015:
>   111:     4156..    4157:    1601015..   1601016:      2:    1622996:
>   112:     4158..    4168:    1622996..   1623006:     11:    1601017:
>   113:     4169..    4170:    1601017..   1601018:      2:    1623007:
>   114:     4171..    4180:    1623007..   1623016:     10:    1601019:
>   115:     4181..    4182:    1601019..   1601020:      2:    1623017:
>   116:     4183..    4192:    1624473..   1624482:     10:    1601021:
>   117:     4193..    4195:    1601021..   1601023:      3:    1624483:
>   118:     4196..    4205:    1624796..   1624805:     10:    1601024:
>   119:     4206..    4207:    1601024..   1601025:      2:    1624806:
>   120:     4208..    4217:    1624806..   1624815:     10:    1601026:
>   121:     4218..    4220:    1601026..   1601028:      3:    1624816:
>   122:     4221..    4230:    1625972..   1625981:     10:    1601029:
>   123:     4231..    4408:    1648626..   1648803:    178:    1625982:
>   124:     4409..    4411:    1602199..   1602201:      3:    1648804:
>   125:     4412..    4434:    1601328..   1601350:     23:    1602202:
>   126:     4435..    4437:    1602647..   1602649:      3:    1601351:
>   127:     4438..    4439:    1601029..   1601030:      2:    1602650:
>   128:     4440..    4442:    1602755..   1602757:      3:    1601031:
>   129:     4443..    4480:    1601650..   1601687:     38:    1602758:
>   130:     4481..    4491:    1629530..   1629540:     11:    1601688:
>   131:     4492..    4560:    1624404..   1624472:     69:    1629541:
>   132:     4561..    4571:    1629541..   1629551:     11:    1624473:
>   133:     4572..    4582:    1601031..   1601041:     11:    1629552:
>   134:     4583..    4586:    1603302..   1603305:      4:    1601042:
>   135:     4587..    4620:    1602537..   1602570:     34:    1603306:
>   136:     4621..    4631:    1629716..   1629726:     11:    1602571:
>   137:     4632..    4634:    1601042..   1601044:      3:    1629727:
>   138:     4635..    6143:  156004864.. 156006372:   1509:    1601045: last,eof
> data.txt: 139 extents found
>
> the I tried to defrag it
>
> $ btrfs fi defra  data.txt
> $ sudo filefrag -v data.txt
> Filesystem type is: 9123683e
> File size of data.txt is 25165824 (6144 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..    6143:  164002967.. 164009110:   6144:             last,eof
> data.txt: 1 extent found
>
> So it seems that the defrag works

I get different results between BTRFS_IOC_DEFRAG which is what
systemd-journald uses, and BTRFS_IOC_DEFRAG_RANGE which is what 'btrfs
fi defrag' is using with a default len of 32M.

Another question about BTRFS_IOC_DEFRAG is if it's intended to be
minimalist? Does it have a way to estimate fragmentation and just not
do anything? Because the journald nodatacow journals are not
meaningfully fragmented. They are the same on ext4 and on Btrfs - it's
(so far) always 8MB extents, directly related to each fallocate grow
of the journal file. This kind of faux-fragmentation I think is minor
even on a HDD because it's the same on ext4 and XFS and no one
complains there (as far as I'm aware).


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-09  0:42   ` Chris Murphy
@ 2021-02-09 18:13     ` Goffredo Baroncelli
  2021-02-09 19:01       ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Goffredo Baroncelli @ 2021-02-09 18:13 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2/9/21 1:42 AM, Chris Murphy wrote:
> Perhaps. Attach strace to journald before --rotate, and then --rotate
> 
> https://pastebin.com/UGihfCG9

I looked to this strace.

in line 115: it is called a ioctl(<BTRFS-DEFRAG>)
in line 123: it is called a ioctl(<BTRFS-DEFRAG>)

However the two descriptors for which the defrag is invoked are never sync-ed before.

I was expecting is to see a sync (flush the data on the platters) and then a
ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace.

I wrote a script (see below) which basically:
- create a fragmented file
- run filefrag on it
- optionally sync the file             <-----
- run btrfs fi defrag on it
- run filefrag on it

If I don't perform the sync, the defrag is ineffective. But if I sync the
file BEFORE doing the defrag, I got only one extent.
Now my hypothesis is: the journal log files are bad de-fragmented because these
are not sync-ed before.
This could be tested quite easily putting an fsync() before the
ioctl(<BTRFS_DEFRAG>).

Any thought ?

Regards
Goffredo

-----

$ cat test.py

import os, time, sys

def create_file(nf):
     """
         Create a fragmented file
     """

     # the data below are from a real case
     data= [(0, 0), (1, 1599), (1600, 1607), (1608, 1689), (1690, 1690),
     (1691, 1693), (1694, 1694), (1695, 1722), (1723, 1723), (1724, 1955),
     (1956, 1956), (1957, 2047), (2048, 2417), (2418, 2420), (2421, 2478),
     (2479, 2479), (2480, 2482), (2483, 2483), (2484, 2523), (2524, 2527),
     (2528, 2598), (2599, 2599), (2600, 2607), (2608, 2608), (2609, 2611),
     (2612, 2612), (2613, 2615), (2616, 2616), (2617, 2691), (2692, 2696)]

     blocksize=4096

     f = os.open(fn, os.O_RDWR+os.O_TRUNC+os.O_CREAT)
     os.close(f)
     ldata = len(data)
     i = 1
     f = os.open(fn, os.O_RDWR)
     while i < ldata:
         (from_, to_) = data[ldata - i -1]
         l = (to_ - from_  + 1) * blocksize
         pos = from_ * blocksize

         os.lseek(f, pos, os.SEEK_SET)

         os.write(f, b"X"*l)
         i += 2
     os.fsync(f)
     os.fsync(f)
     os.close(f)
     os.system("sync")
     os.system("sync")
     print("sleep 5s")
     #time.sleep(5)
     os.system("sync")
     os.system("sync")

     i = 0
     f = os.open(fn, os.O_RDWR)
     while i < ldata:
         (from_, to_) = data[ldata - i -1]
         l = (to_ - from_  + 1) * blocksize
         pos = from_ * blocksize

         os.lseek(f, pos, os.SEEK_SET)

         os.write(f, b"X"*l)
         i += 2

     os.close(f)

def test_without_sync(fn):
     create_file(fn)

     print("\nCreated fragmented file")
     os.system("sudo filefrag -v "+fn)
     print("\nStart defrag without sync\n", end="")
     os.system("btrfs fi defra "+fn)
     print("End defrag")
     os.system("sync")
     os.system("sync")
     print("End sync")
     os.system("sudo filefrag -v "+fn)

def test_with_sync(fn):
     create_file(fn)

     print("\nCreated fragmented file")
     os.system("sync")
     os.system("sync")
     os.system("sudo filefrag -v "+fn)
     print("\nStart defrag with sync\n", end="")
     os.system("btrfs fi defra "+fn)
     print("End defrag")
     os.system("sync")
     os.system("sync")
     print("End sync")
     os.system("sudo filefrag -v "+fn)





fn = sys.argv[1]
assert(len(fn))
os.system("sudo true") # to start sudo
test_without_sync(fn)
test_with_sync(fn)

-----

$ python3 test.py /mnt/btrfs-raid1/home/ghigo/data.txt
sleep 5s

Created fragmented file
Filesystem type is: 9123683e
File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..       0:    1596416..   1596416:      1:
    1:        1..    1599:          0..      1598:   1599:    1596417: unknown_loc,delalloc
    2:     1600..    1607:    1597465..   1597472:      8:       1599:
    3:     1608..    1689:          0..        81:     82:    1597473: unknown_loc,delalloc
    4:     1690..    1690:    1596854..   1596854:      1:         82:
    5:     1691..    1693:          0..         2:      3:    1596855: unknown_loc,delalloc
    6:     1694..    1694:    1596858..   1596858:      1:          3:
    7:     1695..    1722:          0..        27:     28:    1596859: unknown_loc,delalloc
    8:     1723..    1723:    1596863..   1596863:      1:         28:
    9:     1724..    1955:          0..       231:    232:    1596864: unknown_loc,delalloc
   10:     1956..    1956:    1596864..   1596864:      1:        232:
   11:     1957..    2047:          0..        90:     91:    1596865: unknown_loc,delalloc
   12:     2048..    2417:    1648394..   1648763:    370:         91:
   13:     2418..    2420:          0..         2:      3:    1648764: unknown_loc,delalloc
   14:     2421..    2478:    1600795..   1600852:     58:          3:
   15:     2479..    2479:          0..         0:      1:    1600853: unknown_loc,delalloc
   16:     2480..    2482:    1597473..   1597475:      3:          1:
   17:     2483..    2483:          0..         0:      1:    1597476: unknown_loc,delalloc
   18:     2484..    2523:    1600927..   1600966:     40:          1:
   19:     2524..    2527:          0..         3:      4:    1600967: unknown_loc,delalloc
   20:     2528..    2598:    1624667..   1624737:     71:          4:
   21:     2599..    2599:          0..         0:      1:    1624738: unknown_loc,delalloc
   22:     2600..    2607:    1597476..   1597483:      8:          1:
   23:     2608..    2608:          0..         0:      1:    1597484: unknown_loc,delalloc
   24:     2609..    2611:    1599231..   1599233:      3:          1:
   25:     2612..    2612:          0..         0:      1:    1599234: unknown_loc,delalloc
   26:     2613..    2615:    1599234..   1599236:      3:          1:
   27:     2616..    2616:          0..         0:      1:    1599237: unknown_loc,delalloc
   28:     2617..    2691:    1624738..   1624812:     75:          1:
   29:     2692..    2696:          0..         4:      5:    1624813: last,unknown_loc,delalloc,eof
/mnt/btrfs-raid1/home/ghigo/data.txt: 30 extents found

Start defrag without sync
End defrag
End sync
Filesystem type is: 9123683e
File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..       0:    1596416..   1596416:      1:
    1:        1..    1599:  163433285.. 163434883:   1599:    1596417:
    2:     1600..    1607:    1597465..   1597472:      8:  163434884:
    3:     1608..    1689:    1604137..   1604218:     82:    1597473:
    4:     1690..    1690:    1596854..   1596854:      1:    1604219:
    5:     1691..    1693:    1599237..   1599239:      3:    1596855:
    6:     1694..    1694:    1596858..   1596858:      1:    1599240:
    7:     1695..    1722:    1599557..   1599584:     28:    1596859:
    8:     1723..    1723:    1596863..   1596863:      1:    1599585:
    9:     1724..    1955:    1651669..   1651900:    232:    1596864:
   10:     1956..    1956:    1596864..   1596864:      1:    1651901:
   11:     1957..    2047:    1850859..   1850949:     91:    1596865:
   12:     2048..    2417:    1648394..   1648763:    370:    1850950:
   13:     2418..    2420:    1599240..   1599242:      3:    1648764:
   14:     2421..    2478:    1600795..   1600852:     58:    1599243:
   15:     2479..    2479:    1596981..   1596981:      1:    1600853:
   16:     2480..    2482:    1597473..   1597475:      3:    1596982:
   17:     2483..    2483:    1597038..   1597038:      1:    1597476:
   18:     2484..    2523:    1600927..   1600966:     40:    1597039:
   19:     2524..    2527:    1599290..   1599293:      4:    1600967:
   20:     2528..    2598:    1624667..   1624737:     71:    1599294:
   21:     2599..    2599:    1597045..   1597045:      1:    1624738:
   22:     2600..    2607:    1597476..   1597483:      8:    1597046:
   23:     2608..    2608:    1597056..   1597056:      1:    1597484:
   24:     2609..    2611:    1599231..   1599233:      3:    1597057:
   25:     2612..    2612:    1597059..   1597059:      1:    1599234:
   26:     2613..    2615:    1599234..   1599236:      3:    1597060:
   27:     2616..    2616:    1597100..   1597100:      1:    1599237:
   28:     2617..    2691:    1624738..   1624812:     75:    1597101:
   29:     2692..    2696:    1599294..   1599298:      5:    1624813: last,eof
/mnt/btrfs-raid1/home/ghigo/data.txt: 30 extents found
sleep 5s

Created fragmented file
Filesystem type is: 9123683e
File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..       0:    1597109..   1597109:      1:
    1:        1..    1599:          0..      1598:   1599:    1597110: unknown_loc,delalloc
    2:     1600..    1607:    1599299..   1599306:      8:       1599:
    3:     1608..    1689:          0..        81:     82:    1599307: unknown_loc,delalloc
    4:     1690..    1690:    1597124..   1597124:      1:         82:
    5:     1691..    1693:          0..         2:      3:    1597125: unknown_loc,delalloc
    6:     1694..    1694:    1597129..   1597129:      1:          3:
    7:     1695..    1722:          0..        27:     28:    1597130: unknown_loc,delalloc
    8:     1723..    1723:    1597134..   1597134:      1:         28:
    9:     1724..    1955:          0..       231:    232:    1597135: unknown_loc,delalloc
   10:     1956..    1956:    1597137..   1597137:      1:        232:
   11:     1957..    2047:          0..        90:     91:    1597138: unknown_loc,delalloc
   12:     2048..    2417:   88373891..  88374260:    370:         91:
   13:     2418..    2420:          0..         2:      3:   88374261: unknown_loc,delalloc
   14:     2421..    2478:    1600987..   1601044:     58:          3:
   15:     2479..    2479:          0..         0:      1:    1601045: unknown_loc,delalloc
   16:     2480..    2482:    1599585..   1599587:      3:          1:
   17:     2483..    2483:          0..         0:      1:    1599588: unknown_loc,delalloc
   18:     2484..    2523:    1601650..   1601689:     40:          1:
   19:     2524..    2527:          0..         3:      4:    1601690: unknown_loc,delalloc
   20:     2528..    2598:    1625881..   1625951:     71:          4:
   21:     2599..    2599:          0..         0:      1:    1625952: unknown_loc,delalloc
   22:     2600..    2607:    1600717..   1600724:      8:          1:
   23:     2608..    2608:          0..         0:      1:    1600725: unknown_loc,delalloc
   24:     2609..    2611:    1599692..   1599694:      3:          1:
   25:     2612..    2612:          0..         0:      1:    1599695: unknown_loc,delalloc
   26:     2613..    2615:    1599821..   1599823:      3:          1:
   27:     2616..    2616:          0..         0:      1:    1599824: unknown_loc,delalloc
   28:     2617..    2691:    1629466..   1629540:     75:          1:
   29:     2692..    2696:          0..         4:      5:    1629541: last,unknown_loc,delalloc,eof
/mnt/btrfs-raid1/home/ghigo/data.txt: 30 extents found

Start defrag with sync
End defrag
End sync
Filesystem type is: 9123683e
File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..    2696:  163503187.. 163505883:   2697:             last,eof
/mnt/btrfs-raid1/home/ghigo/data.txt: 1 extent found



----

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-09 18:13     ` Goffredo Baroncelli
@ 2021-02-09 19:01       ` Chris Murphy
  2021-02-09 19:45         ` Goffredo Baroncelli
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2021-02-09 19:01 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS

On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli <kreijack@inwind.it> wrote:
>
> On 2/9/21 1:42 AM, Chris Murphy wrote:
> > Perhaps. Attach strace to journald before --rotate, and then --rotate
> >
> > https://pastebin.com/UGihfCG9
>
> I looked to this strace.
>
> in line 115: it is called a ioctl(<BTRFS-DEFRAG>)
> in line 123: it is called a ioctl(<BTRFS-DEFRAG>)
>
> However the two descriptors for which the defrag is invoked are never sync-ed before.
>
> I was expecting is to see a sync (flush the data on the platters) and then a
> ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace.
>
> I wrote a script (see below) which basically:
> - create a fragmented file
> - run filefrag on it
> - optionally sync the file             <-----
> - run btrfs fi defrag on it
> - run filefrag on it
>
> If I don't perform the sync, the defrag is ineffective. But if I sync the
> file BEFORE doing the defrag, I got only one extent.
> Now my hypothesis is: the journal log files are bad de-fragmented because these
> are not sync-ed before.
> This could be tested quite easily putting an fsync() before the
> ioctl(<BTRFS_DEFRAG>).
>
> Any thought ?

No idea. If it's a full sync then it could be expensive on either
slower devices or heavier workloads. On the one hand, there's no point
of doing an ineffective defrag so maybe the defrag ioctl should  just
do the sync first? On the other hand, this would effectively make the
defrag ioctl a full file system sync which might be unexpected. It's a
set of tradeoffs and I don't know what the expectation is.

What about fdatasync() on the journal file rather than a full sync?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-09 19:01       ` Chris Murphy
@ 2021-02-09 19:45         ` Goffredo Baroncelli
  2021-02-09 20:26           ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Goffredo Baroncelli @ 2021-02-09 19:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2/9/21 8:01 PM, Chris Murphy wrote:
> On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli <kreijack@inwind.it> wrote:
>>
>> On 2/9/21 1:42 AM, Chris Murphy wrote:
>>> Perhaps. Attach strace to journald before --rotate, and then --rotate
>>>
>>> https://pastebin.com/UGihfCG9
>>
>> I looked to this strace.
>>
>> in line 115: it is called a ioctl(<BTRFS-DEFRAG>)
>> in line 123: it is called a ioctl(<BTRFS-DEFRAG>)
>>
>> However the two descriptors for which the defrag is invoked are never sync-ed before.
>>
>> I was expecting is to see a sync (flush the data on the platters) and then a
>> ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace.
>>
>> I wrote a script (see below) which basically:
>> - create a fragmented file
>> - run filefrag on it
>> - optionally sync the file             <-----
>> - run btrfs fi defrag on it
>> - run filefrag on it
>>
>> If I don't perform the sync, the defrag is ineffective. But if I sync the
>> file BEFORE doing the defrag, I got only one extent.
>> Now my hypothesis is: the journal log files are bad de-fragmented because these
>> are not sync-ed before.
>> This could be tested quite easily putting an fsync() before the
>> ioctl(<BTRFS_DEFRAG>).
>>
>> Any thought ?
> 
> No idea. If it's a full sync then it could be expensive on either
> slower devices or heavier workloads. On the one hand, there's no point
> of doing an ineffective defrag so maybe the defrag ioctl should  just
> do the sync first? On the other hand, this would effectively make the
> defrag ioctl a full file system sync which might be unexpected. It's a
> set of tradeoffs and I don't know what the expectation is.
> 
> What about fdatasync() on the journal file rather than a full sync?

I tried a fsync(2) call, and the results is the same.
Only after reading your reply I realized that I used a sync(2), when
I meant to use fsync(2).

I update my python test code
----
import os, time, sys

def create_file(nf):
     """
         Create a fragmented file
     """

     # the data below are from a real case
     data= [(0, 0), (1, 1599), (1600, 1607), (1608, 1689), (1690, 1690),
     (1691, 1693), (1694, 1694), (1695, 1722), (1723, 1723), (1724, 1955),
     (1956, 1956), (1957, 2047), (2048, 2417), (2418, 2420), (2421, 2478),
     (2479, 2479), (2480, 2482), (2483, 2483), (2484, 2523), (2524, 2527),
     (2528, 2598), (2599, 2599), (2600, 2607), (2608, 2608), (2609, 2611),
     (2612, 2612), (2613, 2615), (2616, 2616), (2617, 2691), (2692, 2696)]

     blocksize=4096

     # write the odd extents...

     f = os.open(fn, os.O_RDWR+os.O_TRUNC+os.O_CREAT)
     os.close(f)
     ldata = len(data)
     i = 1
     f = os.open(fn, os.O_RDWR)
     while i < ldata:
         (from_, to_) = data[ldata - i -1]
         l = (to_ - from_  + 1) * blocksize
         pos = from_ * blocksize

         os.lseek(f, pos, os.SEEK_SET)

         os.write(f, b"X"*l)
         i += 2

     # ... sync and then write the even extents
     os.fsync(f)
     os.close(f)

     i = 0
     f = os.open(fn, os.O_RDWR)
     while i < ldata:
         (from_, to_) = data[ldata - i -1]
         l = (to_ - from_  + 1) * blocksize
         pos = from_ * blocksize

         os.lseek(f, pos, os.SEEK_SET)

         os.write(f, b"X"*l)
         i += 2

     os.close(f)

def fsync(nf):
     f = os.open(nf, os.O_RDWR)
     os.fsync(f)
     os.close(f)

def test_without_sync(fn):
     create_file(fn)

     print("\nCreated fragmented file")
     os.system("sudo filefrag -v "+fn)
     print("\nStart defrag without sync\n", end="")
     os.system("btrfs fi defra "+fn)
     print("End defrag")
     fsync(fn)
     print("End sync")
     os.system("sudo filefrag -v "+fn)

def test_with_sync(fn):
     create_file(fn)

     print("\nCreated fragmented file")
     fsync(fn)
     os.system("sudo filefrag -v "+fn)
     print("\nStart defrag with sync\n", end="")
     os.system("btrfs fi defra "+fn)
     print("End defrag")
     fsync(fn)
     print("End sync")
     os.system("sudo filefrag -v "+fn)





fn = sys.argv[1]
assert(len(fn))
os.system("sudo true") # to start sudo
test_without_sync(fn)
test_with_sync(fn)
----

> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-09 19:45         ` Goffredo Baroncelli
@ 2021-02-09 20:26           ` Chris Murphy
  2021-02-10  6:37             ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2021-02-09 20:26 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS

On Tue, Feb 9, 2021 at 12:45 PM Goffredo Baroncelli <kreijack@inwind.it> wrote:
>
> On 2/9/21 8:01 PM, Chris Murphy wrote:
> > On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli <kreijack@inwind.it> wrote:
> >>
> >> On 2/9/21 1:42 AM, Chris Murphy wrote:
> >>> Perhaps. Attach strace to journald before --rotate, and then --rotate
> >>>
> >>> https://pastebin.com/UGihfCG9
> >>
> >> I looked to this strace.
> >>
> >> in line 115: it is called a ioctl(<BTRFS-DEFRAG>)
> >> in line 123: it is called a ioctl(<BTRFS-DEFRAG>)
> >>
> >> However the two descriptors for which the defrag is invoked are never sync-ed before.
> >>
> >> I was expecting is to see a sync (flush the data on the platters) and then a
> >> ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace.
> >>
> >> I wrote a script (see below) which basically:
> >> - create a fragmented file
> >> - run filefrag on it
> >> - optionally sync the file             <-----
> >> - run btrfs fi defrag on it
> >> - run filefrag on it
> >>
> >> If I don't perform the sync, the defrag is ineffective. But if I sync the
> >> file BEFORE doing the defrag, I got only one extent.
> >> Now my hypothesis is: the journal log files are bad de-fragmented because these
> >> are not sync-ed before.
> >> This could be tested quite easily putting an fsync() before the
> >> ioctl(<BTRFS_DEFRAG>).
> >>
> >> Any thought ?
> >
> > No idea. If it's a full sync then it could be expensive on either
> > slower devices or heavier workloads. On the one hand, there's no point
> > of doing an ineffective defrag so maybe the defrag ioctl should  just
> > do the sync first? On the other hand, this would effectively make the
> > defrag ioctl a full file system sync which might be unexpected. It's a
> > set of tradeoffs and I don't know what the expectation is.
> >
> > What about fdatasync() on the journal file rather than a full sync?
>
> I tried a fsync(2) call, and the results is the same.
> Only after reading your reply I realized that I used a sync(2), when
> I meant to use fsync(2).
>
> I update my python test code

Ok fsync should be least costly of the three.

The three unique things about systemd-journald that might be factors:

* nodatacow file
* fallocated file in 8MB increments multiple times up to 128M
* BTRFS_IOC_DEFRAG, whereas btrfs-progs uses BTRFS_IOC_DEFRAG_RANGE

So maybe it's all explained by lack of fsync, I'm not sure. But the
commit that added this doesn't show any form of sync.

https://github.com/systemd/systemd/commit/f27a386430cc7a27ebd06899d93310fb3bd4cee7



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-09 20:26           ` Chris Murphy
@ 2021-02-10  6:37             ` Chris Murphy
  2021-02-10 19:14               ` Goffredo Baroncelli
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2021-02-10  6:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, Btrfs BTRFS

This is an active (but idle) system.journal file. That is, it's open
but not being written to. I did a sync right before this:

https://pastebin.com/jHh5tfpe

And then: btrfs fi defrag -l 8M system.journal

https://pastebin.com/Kq1GjJuh

Looks like most of it was a no op. So it seems btrfs in this case is
not confused by so many small extent items, it know they are
contiguous?

It doesn't answer the question what the "too small" threshold is for
BTRFS_IOC_DEFRAG, which is what sd-journald is using, though.

Another sync, and then, 'journalctl --rotate' and the resulting
archived file is now:

https://pastebin.com/aqac0dRj

These are not the same results between the two ioctls for the same
file, and not the same result as what you get with -l 32M (which I do
get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result
is peculiar, but I don't think we can say it's ineffective, it might
be an intentional no op either because it's nodatacow or it sees that
these many extents are mostly contiguous and not worth defragmenting
(which would be good for keeping write amplification down).

So I don't know, maybe it's not wrong.

--
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-10  6:37             ` Chris Murphy
@ 2021-02-10 19:14               ` Goffredo Baroncelli
  2021-02-11  0:19                 ` Chris Murphy
  2021-02-11  3:08                 ` kreijack
  0 siblings, 2 replies; 19+ messages in thread
From: Goffredo Baroncelli @ 2021-02-10 19:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Hi Chris,

it seems that systemd-journald is more smart/complex than I thought:

1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
closes the files, it mark again these as COW then defrag [1]

2) looking at the code, I suspect that systemd-journald closes the
file asynchronously [2]. This means that looking at the "live" journal
is not sufficient. In fact:

/var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
[...]
--------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
--------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
--------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
---------------C----- user-1000.journal
---------------C----- system.journal

The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be
"closed", the NOCOW flag will be removed and a defragmentation will start.

Now my journals have few (2 or 3 extents). But I saw cases where the extents
of the more recent files are hundreds, but after few "journalct --rotate" the older files become less
fragmented.

[1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383
[2] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687

On 2/10/21 7:37 AM, Chris Murphy wrote:
> This is an active (but idle) system.journal file. That is, it's open
> but not being written to. I did a sync right before this:
> 
> https://pastebin.com/jHh5tfpe
> 
> And then: btrfs fi defrag -l 8M system.journal
> 
> https://pastebin.com/Kq1GjJuh
> 
> Looks like most of it was a no op. So it seems btrfs in this case is
> not confused by so many small extent items, it know they are
> contiguous?
> 
> It doesn't answer the question what the "too small" threshold is for
> BTRFS_IOC_DEFRAG, which is what sd-journald is using, though.
> 
> Another sync, and then, 'journalctl --rotate' and the resulting
> archived file is now:
> 
> https://pastebin.com/aqac0dRj
> 
> These are not the same results between the two ioctls for the same
> file, and not the same result as what you get with -l 32M (which I do
> get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result
> is peculiar, but I don't think we can say it's ineffective, it might
> be an intentional no op either because it's nodatacow or it sees that
> these many extents are mostly contiguous and not worth defragmenting
> (which would be good for keeping write amplification down).
> 
> So I don't know, maybe it's not wrong.
> 
> --
> Chris Murphy
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-10 19:14               ` Goffredo Baroncelli
@ 2021-02-11  0:19                 ` Chris Murphy
  2021-02-11  3:08                 ` kreijack
  1 sibling, 0 replies; 19+ messages in thread
From: Chris Murphy @ 2021-02-11  0:19 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS

On Wed, Feb 10, 2021 at 12:14 PM Goffredo Baroncelli <kreijack@inwind.it> wrote:
>
> Hi Chris,
>
> it seems that systemd-journald is more smart/complex than I thought:
>
> 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> closes the files, it mark again these as COW then defrag [1]

Found that in commit 11689d2a021d95a8447d938180e0962cd9439763 from 2015.

But archived journals are still all nocow for me on systemd 247. Is it
because the enclosing directory has file attribute 'C' ?

Another example:

Active journal "system.journal" INODE_ITEM contains
        sequence 4515 flags 0x13(NODATASUM|NODATACOW|PREALLOC)

7 day old archived journal "systemd.journal" INODE_ITEM shows:
        sequence 227 flags 0x13(NODATASUM|NODATACOW|PREALLOC)

So if it ever was COW, it flipped to NOCOW before the defrag. Is it expected?


and also this archived file's INODE_ITEM shows
        generation 1748644 transid 1760983 size 16777216 nbytes 16777216

with EXTENT_ITEMs show
        generation 1755533 type 1 (regular)
        generation 1753668 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1753989 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1753526 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1755531 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1755531 type 2 (prealloc)

file tree output for this file
https://pastebin.com/6uDFNDdd


> 2) looking at the code, I suspect that systemd-journald closes the
> file asynchronously [2]. This means that looking at the "live" journal
> is not sufficient. In fact:
>
> /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> [...]
> --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> ---------------C----- user-1000.journal
> ---------------C----- system.journal
>
> The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be
> "closed", the NOCOW flag will be removed and a defragmentation will start.
>
> Now my journals have few (2 or 3 extents). But I saw cases where the extents
> of the more recent files are hundreds, but after few "journalct --rotate" the older files become less
> fragmented.

Josef explained to me that BTRFS_IOC_DEFRAG is pretty simple and just
dirties extents it considers too small, and they end up just going
through the normal write path, along with anything else pending. And
also that fsync() will set the extents on disk so that the defrag
ioctl know what to dirty, but that ordinarily it's not required and
might have to do with the interleaving write pattern for the journals.

I'm not sure what this ioctl considers big enough that it's worth just
leaving alone. But in any case it sounds like the current write
workload at the time of defrag could affect the allocation, unlike
BTRFS_IOC_DEFRAG_RANGE which has a few knobs to control the outcome.
Or maybe the knobs just influence the outcome. Not sure.

If the device is HDD, it might be nice if the nodatacow journals are
datacow again so they could be compressed. But my evaluation shows
that nodatacow journals stick to an 8MB extent pattern, correlating to
fallocated append as they grow. It's not significantly fragmented to
start out with, whether HDD or SSD.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-10 19:14               ` Goffredo Baroncelli
  2021-02-11  0:19                 ` Chris Murphy
@ 2021-02-11  3:08                 ` kreijack
  2021-02-11  3:13                   ` Zygo Blaxell
  1 sibling, 1 reply; 19+ messages in thread
From: kreijack @ 2021-02-11  3:08 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS

On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote:
> Hi Chris,
> 
> it seems that systemd-journald is more smart/complex than I thought:
> 
> 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> closes the files, it mark again these as COW then defrag [1]
> 
> 2) looking at the code, I suspect that systemd-journald closes the
> file asynchronously [2]. This means that looking at the "live" journal
> is not sufficient. In fact:
> 
> /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> [...]
> --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> ---------------C----- user-1000.journal
> ---------------C----- system.journal
> 
> The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be
> "closed", the NOCOW flag will be removed and a defragmentation will start.

Wait what?

> Now my journals have few (2 or 3 extents). But I saw cases where the extents
> of the more recent files are hundreds, but after few "journalct --rotate" the older files become less
> fragmented.
> 
> [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383

That line doesn't work, and systemd ignores the error.

The NOCOW flag cannot be set or cleared unless the file is empty.
This is checked in btrfs_ioctl_setflags.

This is not something that can be changed easily--if the NOCOW bit is
cleared on a non-empty file, btrfs data read code will expect csums
that aren't present on disk because they were written while the file was
NODATASUM, and the reads will fail pretty badly.  The entire file would
have to have csums added or removed at the same time as the flag change
(or all nodatacow file reads take a performance hit looking for csums
that may or may not be present).

At file close, the systemd should copy the data to a new file with no
special attributes and discard or recycle the old inode.  This copy
will be mostly contiguous and have desirable properties like csums and
compression, and will have iops equivalent to btrfs fi defrag.

> [2] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687
> 
> On 2/10/21 7:37 AM, Chris Murphy wrote:
> > This is an active (but idle) system.journal file. That is, it's open
> > but not being written to. I did a sync right before this:
> > 
> > https://pastebin.com/jHh5tfpe
> > 
> > And then: btrfs fi defrag -l 8M system.journal
> > 
> > https://pastebin.com/Kq1GjJuh
> > 
> > Looks like most of it was a no op. So it seems btrfs in this case is
> > not confused by so many small extent items, it know they are
> > contiguous?
> > 
> > It doesn't answer the question what the "too small" threshold is for
> > BTRFS_IOC_DEFRAG, which is what sd-journald is using, though.
> > 
> > Another sync, and then, 'journalctl --rotate' and the resulting
> > archived file is now:
> > 
> > https://pastebin.com/aqac0dRj
> > 
> > These are not the same results between the two ioctls for the same
> > file, and not the same result as what you get with -l 32M (which I do
> > get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result
> > is peculiar, but I don't think we can say it's ineffective, it might
> > be an intentional no op either because it's nodatacow or it sees that
> > these many extents are mostly contiguous and not worth defragmenting
> > (which would be good for keeping write amplification down).
> > 
> > So I don't know, maybe it's not wrong.
> > 
> > --
> > Chris Murphy
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-11  3:08                 ` kreijack
@ 2021-02-11  3:13                   ` Zygo Blaxell
  2021-02-11  3:39                     ` Chris Murphy
  2021-02-11  3:52                     ` Chris Murphy
  0 siblings, 2 replies; 19+ messages in thread
From: Zygo Blaxell @ 2021-02-11  3:13 UTC (permalink / raw)
  To: kreijack; +Cc: Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4866 bytes --]

Sorry, I busted my mail client.  That was from me.  :-P

On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreijack@inwind.it wrote:
> On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote:
> > Hi Chris,
> > 
> > it seems that systemd-journald is more smart/complex than I thought:
> > 
> > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> > closes the files, it mark again these as COW then defrag [1]
> > 
> > 2) looking at the code, I suspect that systemd-journald closes the
> > file asynchronously [2]. This means that looking at the "live" journal
> > is not sufficient. In fact:
> > 
> > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> > [...]
> > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> > ---------------C----- user-1000.journal
> > ---------------C----- system.journal
> > 
> > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be
> > "closed", the NOCOW flag will be removed and a defragmentation will start.
> 
> Wait what?
> 
> > Now my journals have few (2 or 3 extents). But I saw cases where the extents
> > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less
> > fragmented.
> > 
> > [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383
> 
> That line doesn't work, and systemd ignores the error.
> 
> The NOCOW flag cannot be set or cleared unless the file is empty.
> This is checked in btrfs_ioctl_setflags.
> 
> This is not something that can be changed easily--if the NOCOW bit is
> cleared on a non-empty file, btrfs data read code will expect csums
> that aren't present on disk because they were written while the file was
> NODATASUM, and the reads will fail pretty badly.  The entire file would
> have to have csums added or removed at the same time as the flag change
> (or all nodatacow file reads take a performance hit looking for csums
> that may or may not be present).
> 
> At file close, the systemd should copy the data to a new file with no
> special attributes and discard or recycle the old inode.  This copy
> will be mostly contiguous and have desirable properties like csums and
> compression, and will have iops equivalent to btrfs fi defrag.
> 
> > [2] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687
> > 
> > On 2/10/21 7:37 AM, Chris Murphy wrote:
> > > This is an active (but idle) system.journal file. That is, it's open
> > > but not being written to. I did a sync right before this:
> > > 
> > > https://pastebin.com/jHh5tfpe
> > > 
> > > And then: btrfs fi defrag -l 8M system.journal
> > > 
> > > https://pastebin.com/Kq1GjJuh
> > > 
> > > Looks like most of it was a no op. So it seems btrfs in this case is
> > > not confused by so many small extent items, it know they are
> > > contiguous?
> > > 
> > > It doesn't answer the question what the "too small" threshold is for
> > > BTRFS_IOC_DEFRAG, which is what sd-journald is using, though.
> > > 
> > > Another sync, and then, 'journalctl --rotate' and the resulting
> > > archived file is now:
> > > 
> > > https://pastebin.com/aqac0dRj
> > > 
> > > These are not the same results between the two ioctls for the same
> > > file, and not the same result as what you get with -l 32M (which I do
> > > get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result
> > > is peculiar, but I don't think we can say it's ineffective, it might
> > > be an intentional no op either because it's nodatacow or it sees that
> > > these many extents are mostly contiguous and not worth defragmenting
> > > (which would be good for keeping write amplification down).
> > > 
> > > So I don't know, maybe it's not wrong.
> > > 
> > > --
> > > Chris Murphy
> > > 
> > 
> > 
> > -- 
> > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-11  3:13                   ` Zygo Blaxell
@ 2021-02-11  3:39                     ` Chris Murphy
  2021-02-11  6:12                       ` Zygo Blaxell
  2021-02-11  3:52                     ` Chris Murphy
  1 sibling, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2021-02-11  3:39 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Goffredo Baroncelli, Chris Murphy, Btrfs BTRFS

On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> Sorry, I busted my mail client.  That was from me.  :-P
>
> On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreijack@inwind.it wrote:
> > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote:
> > > Hi Chris,
> > >
> > > it seems that systemd-journald is more smart/complex than I thought:
> > >
> > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> > > closes the files, it mark again these as COW then defrag [1]
> > >
> > > 2) looking at the code, I suspect that systemd-journald closes the
> > > file asynchronously [2]. This means that looking at the "live" journal
> > > is not sufficient. In fact:
> > >
> > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> > > [...]
> > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> > > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> > > ---------------C----- user-1000.journal
> > > ---------------C----- system.journal
> > >
> > > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be
> > > "closed", the NOCOW flag will be removed and a defragmentation will start.
> >
> > Wait what?
> >
> > > Now my journals have few (2 or 3 extents). But I saw cases where the extents
> > > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less
> > > fragmented.
> > >
> > > [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383
> >
> > That line doesn't work, and systemd ignores the error.
> >
> > The NOCOW flag cannot be set or cleared unless the file is empty.
> > This is checked in btrfs_ioctl_setflags.
> >
> > This is not something that can be changed easily--if the NOCOW bit is
> > cleared on a non-empty file, btrfs data read code will expect csums
> > that aren't present on disk because they were written while the file was
> > NODATASUM, and the reads will fail pretty badly.  The entire file would
> > have to have csums added or removed at the same time as the flag change
> > (or all nodatacow file reads take a performance hit looking for csums
> > that may or may not be present).
> >
> > At file close, the systemd should copy the data to a new file with no
> > special attributes and discard or recycle the old inode.  This copy
> > will be mostly contiguous and have desirable properties like csums and
> > compression, and will have iops equivalent to btrfs fi defrag.

Journals implement their own checksumming. Yeah, if there's
corruption, Btrfs raid can't do a transparent fixup. But the whole
journal isn't lost, just the affected record. *shrug* I think if (a)
nodatacow and/or (b) SSD, just leave it alone. Why add more writes?

In particular the nodatacow case where I'm seeing consistently the
file made from multiples of 8MB contiguous blocks, even on HDD the
seek latency here can't be worth defraging the file.

I think defrag makes sense (a) datacow journals, i.e. the default
nodatacow is inhibited (b) HDD. In that case the fragmentation is
quite considerable, hundreds to thousands of extents. It's
sufficiently bad that it'd be probably be better if they were
defragmented automatically with a trigger that tests for number of
non-contiguous small blocks that somehow cheaply estimates latency
reading all of them. Since the files are interleaved, doing something
like "systemctl status dbus" might actually read many blocks even if
the result isn't a whole heck of alot of visible data.

But on SSD, cow or nocow, and HDD nocow - I think just leave them alone.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-11  3:39                     ` Chris Murphy
@ 2021-02-11  6:12                       ` Zygo Blaxell
  2021-02-11  8:46                         ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Zygo Blaxell @ 2021-02-11  6:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, Btrfs BTRFS

On Wed, Feb 10, 2021 at 08:39:12PM -0700, Chris Murphy wrote:
> On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > Sorry, I busted my mail client.  That was from me.  :-P
> >
> > On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreijack@inwind.it wrote:
> > > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote:
> > > > Hi Chris,
> > > >
> > > > it seems that systemd-journald is more smart/complex than I thought:
> > > >
> > > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> > > > closes the files, it mark again these as COW then defrag [1]
> > > >
> > > > 2) looking at the code, I suspect that systemd-journald closes the
> > > > file asynchronously [2]. This means that looking at the "live" journal
> > > > is not sufficient. In fact:
> > > >
> > > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> > > > [...]
> > > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> > > > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> > > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> > > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> > > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> > > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> > > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> > > > ---------------C----- user-1000.journal
> > > > ---------------C----- system.journal
> > > >
> > > > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be
> > > > "closed", the NOCOW flag will be removed and a defragmentation will start.
> > >
> > > Wait what?
> > >
> > > > Now my journals have few (2 or 3 extents). But I saw cases where the extents
> > > > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less
> > > > fragmented.
> > > >
> > > > [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383
> > >
> > > That line doesn't work, and systemd ignores the error.
> > >
> > > The NOCOW flag cannot be set or cleared unless the file is empty.
> > > This is checked in btrfs_ioctl_setflags.
> > >
> > > This is not something that can be changed easily--if the NOCOW bit is
> > > cleared on a non-empty file, btrfs data read code will expect csums
> > > that aren't present on disk because they were written while the file was
> > > NODATASUM, and the reads will fail pretty badly.  The entire file would
> > > have to have csums added or removed at the same time as the flag change
> > > (or all nodatacow file reads take a performance hit looking for csums
> > > that may or may not be present).
> > >
> > > At file close, the systemd should copy the data to a new file with no
> > > special attributes and discard or recycle the old inode.  This copy
> > > will be mostly contiguous and have desirable properties like csums and
> > > compression, and will have iops equivalent to btrfs fi defrag.
> 
> Journals implement their own checksumming. 

Yeah, Lennart said the same thing six years ago.

I'm using btrfs data csums to detect disk failures (the most important
benefit being that we can stop buying SSD models where silent data
corruption is a problem).  On our systems that have systemd journals,
the journals are pretty big--10% of the writable media.  That's 10%
of the media where defects can hide undetected without csums.

Checking journal csums with a separate tool is crazy.  We used to do
that with git and svn and archive files and media files and a hundred
database formats with ext4, and it was the equivalent of a full time
employee's job trying to figure out where all the chaos was coming from
when a bad disk model came through the fleet.  Never again.

Now btrfs scrub just sends us an email telling us which disk models
are garbage, we stop buying them, and now all the hardware that we buy
(more than once) just works.

If I had to, I'd remove the FS_NOCOW_FL flag support from my kernels to
prevent applications from breaking that.

> Yeah, if there's
> corruption, Btrfs raid can't do a transparent fixup. But the whole
> journal isn't lost, just the affected record. *shrug* I think if (a)
> nodatacow and/or (b) SSD, just leave it alone. Why add more writes?

Well, I'm trying to guess the original intent here.  There are comments
in the systemd git history talking about getting btrfs features back by
turning off nodatacow as systemd closes the journal file.  We can assume
that the existing code turning off the FS_NOCOW_FL bit was intended
to restore data csums (which implies at least reading all the data),
but nobody noticed it doesn't work.  The defrag command that follows
implies an intended copy of the data.  Though with this code it's hard
to tell what's bug, what's intent, and what's cargo cult programming.

If we want the data compressed (and who doesn't?  journal data compresses
8:1 with btrfs zstd) then we'll always need to make a copy at close.
Because systemd used prealloc, the copy is necessarily to a new inode,
as there's no way to re-enable compression on an inode once prealloc
is used (this has deep disk-format reasons, but not as deep as the
nodatacow ones).

If we don't care about compression or datasums, then keep the file
nodatacow and do nothing at close.  The defrag isn't needed and the
FS_NOCOW_FL flag change doesn't work.

> In particular the nodatacow case where I'm seeing consistently the
> file made from multiples of 8MB contiguous blocks, even on HDD the
> seek latency here can't be worth defraging the file.
> 
> I think defrag makes sense (a) datacow journals, i.e. the default
> nodatacow is inhibited (b) HDD.

It makes sense for SSD too.  It's 4K extents, so the metadata and small-IO
overheads will be non-trivial even on SSD.  Deleting or truncating datacow
journal files will put a lot of tiny free space holes into the filesystem.
It will flood the next commit with delayed refs and push up latency.

> In that case the fragmentation is
> quite considerable, hundreds to thousands of extents. It's
> sufficiently bad that it'd be probably be better if they were
> defragmented automatically with a trigger that tests for number of
> non-contiguous small blocks that somehow cheaply estimates latency
> reading all of them. 

Yeah it would be nice of autodefrag could be made to not suck.

Even systemd running defrag_range after writing every 128K-512K would
be so much better than no defrag at all or autodefrag.  Short bursts of
latency, and a small but not unreasonable target extent size.

> Since the files are interleaved, doing something
> like "systemctl status dbus" might actually read many blocks even if
> the result isn't a whole heck of alot of visible data.
> 
> But on SSD, cow or nocow, and HDD nocow - I think just leave them alone.
> 
> -- 
> Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-11  6:12                       ` Zygo Blaxell
@ 2021-02-11  8:46                         ` Chris Murphy
  2021-02-13  0:16                           ` Zygo Blaxell
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2021-02-11  8:46 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Goffredo Baroncelli, Btrfs BTRFS

On Wed, Feb 10, 2021 at 11:12 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:

>
> If we want the data compressed (and who doesn't?  journal data compresses
> 8:1 with btrfs zstd) then we'll always need to make a copy at close.
> Because systemd used prealloc, the copy is necessarily to a new inode,
> as there's no way to re-enable compression on an inode once prealloc
> is used (this has deep disk-format reasons, but not as deep as the
> nodatacow ones).

Pretty sure sd-journald still fallocates when datacow by touching
/etc/tmpfiles.d/journal-nocow.conf

And I know for sure those datacow files do compress on rotation.

Preallocated datacow might not be so bad if it weren't for that one
damn header or indexing block, whatever the proper term is, that
sd-journald hammers every time it fsyncs. I don't know if I wanna know
what it means to snapshot a datacow file that's prealloc. But in
theory if the same blocks weren't all being hammered, a preallocated
file shouldn't fragment like hell if each prealloc block gets just one
write.

> If we don't care about compression or datasums, then keep the file
> nodatacow and do nothing at close.  The defrag isn't needed and the
> FS_NOCOW_FL flag change doesn't work.

Agreed.

> It makes sense for SSD too.  It's 4K extents, so the metadata and small-IO
> overheads will be non-trivial even on SSD.  Deleting or truncating datacow
> journal files will put a lot of tiny free space holes into the filesystem.
> It will flood the next commit with delayed refs and push up latency.

I haven't seen meaningful latency on a single journal file, datacow
and heavily fragmented, on ssd. But to test on more than one file at a
time i need to revert the defrag commits, and build systemd, and let a
bunch of journals accumulate somehow. If I dump too much data
artificially to try and mimic aging, I know I will get nowhere near as
many of those 4KiB extents. So I dunno.

>
> > In that case the fragmentation is
> > quite considerable, hundreds to thousands of extents. It's
> > sufficiently bad that it'd be probably be better if they were
> > defragmented automatically with a trigger that tests for number of
> > non-contiguous small blocks that somehow cheaply estimates latency
> > reading all of them.
>
> Yeah it would be nice of autodefrag could be made to not suck.

It triggers on inserts, not appends. So it doesn't do anything for the
sd-journald case.

I would think the active journals are the one more likely to get
searched for recent events than archived journals. So in the datacow
case, you only get relief once it's rotated. It'd be nice to find an
decent, not necessarily perfect, way for them to not get so fragmented
in the first place. Or just defrag once a file has 16M of
non-contiguous extents.

Estimating extents though is another issue, especially with compression enabled.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-11  8:46                         ` Chris Murphy
@ 2021-02-13  0:16                           ` Zygo Blaxell
  0 siblings, 0 replies; 19+ messages in thread
From: Zygo Blaxell @ 2021-02-13  0:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, Btrfs BTRFS

On Thu, Feb 11, 2021 at 01:46:07AM -0700, Chris Murphy wrote:
> On Wed, Feb 10, 2021 at 11:12 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> 
> 
> >
> > If we want the data compressed (and who doesn't?  journal data compresses
> > 8:1 with btrfs zstd) then we'll always need to make a copy at close.
> > Because systemd used prealloc, the copy is necessarily to a new inode,
> > as there's no way to re-enable compression on an inode once prealloc
> > is used (this has deep disk-format reasons, but not as deep as the
> > nodatacow ones).
> 
> Pretty sure sd-journald still fallocates when datacow by touching
> /etc/tmpfiles.d/journal-nocow.conf

Fallocate on datacow just wastes space and CPU time if the application
is not doing sequential 4K writes with no overwrites (sequential keeps
the metadata at bounded size, otherwise it grows too).  Datacow takes
precedence over fallocate.  It works only when you're overwriting a
prealloc block with a data block for the first time, and after that
it's just datacow with compress disabled and a reference to a big
extent that doesn't go away until the last block is overwritten.

I think fallocate on datacow should be deprecated and removed from btrfs.
Fixing it doesn't seem to be possible without Pyrrhic time or space costs.
On the other hand, it does have that one working use case, and I could be
convinced to back down if someone shows me one example of an application
in the wild that is using fallocate + datacow on btrfs correctly.

> And I know for sure those datacow files do compress on rotation.

Hmmm...OK, I missed that defrag can force compression in a prealloc file
because it bypasses the inode check for prealloc (same for reflinks,
you can reflink a compressed extent into a prealloc file if you wrote
the extent in a non-prealloc file).  It is only normal compressed writes
directly to the inode that are blocked by prealloc.

So we can keep the inode, but compression still only happens by making
a copy of all the data with defrag.  If the data is still in page cache
then we can skip the read, at least.

> Preallocated datacow might not be so bad if it weren't for that one
> damn header or indexing block, whatever the proper term is, that
> sd-journald hammers every time it fsyncs. 

It seems to write every other block more than once too.

> I don't know if I wanna know
> what it means to snapshot a datacow file that's prealloc. 

The first subvol to write to the prealloc data blocks gets to write
in-place.  All others get datacow, just like nodatacow files when they
have a reflink.

It is basically the same as the nodatacow extent-sharing check, except
competing prealloc refs can be ignored (they will read as zero, and if
they are written they will do their own extent-sharing check and notice
they have lost the race to use the allocated block).

> But in
> theory if the same blocks weren't all being hammered, a preallocated
> file shouldn't fragment like hell if each prealloc block gets just one
> write.

That is the key, each block must have only one 4K write, ever.  Writing 2x
adjacent 2K blocks seems to count as 2 writes even if they are 4K aligned
and there is no flush or commit in between.

> > If we don't care about compression or datasums, then keep the file
> > nodatacow and do nothing at close.  The defrag isn't needed and the
> > FS_NOCOW_FL flag change doesn't work.
> 
> Agreed.
> 
> 
> > It makes sense for SSD too.  It's 4K extents, so the metadata and small-IO
> > overheads will be non-trivial even on SSD.  Deleting or truncating datacow
> > journal files will put a lot of tiny free space holes into the filesystem.
> > It will flood the next commit with delayed refs and push up latency.
> 
> I haven't seen meaningful latency on a single journal file, datacow
> and heavily fragmented, on ssd. 

Someone pushed back last time I proposed simply letting datacow be
datacow, citing high latency on NVME devices.  I'm not sure what
"meaningful" latency is...journalctl takes a crazy long time to start
up compared to, say, 'tail -F' or 'less'.

I've always assumed journald's file format was an interim thing that would
have been deprecated and replaced years ago (you know you've failed to
design a file format when 'less' is winning races against you).  I never
started using it, so I've never investigated what's really wrong with it
(or what compelling advantage offsets the problems it seems to have).

> But to test on more than one file at a
> time i need to revert the defrag commits, and build systemd, and let a
> bunch of journals accumulate somehow. If I dump too much data
> artificially to try and mimic aging, I know I will get nowhere near as
> many of those 4KiB extents. So I dunno.

Something like:

	while :; do
		date > /dev/kmsg
		date >> logfile
		sync logfile
	done

should be the worst case for both journald and a plaintext logfile.
Maybe needs a 'sleep 1' to space things out for journald.

> > > In that case the fragmentation is
> > > quite considerable, hundreds to thousands of extents. It's
> > > sufficiently bad that it'd be probably be better if they were
> > > defragmented automatically with a trigger that tests for number of
> > > non-contiguous small blocks that somehow cheaply estimates latency
> > > reading all of them.
> >
> > Yeah it would be nice of autodefrag could be made to not suck.
> 
> It triggers on inserts, not appends. So it doesn't do anything for the
> sd-journald case.

Appends are probably where autodefrag is most useful, and also cheapest
(the cold data is more likely to still be in page cache for appends than
it is for mid-file inserts), and also really common (lots of programs
have log files).  It would be nice if autodefrag could be configured to
do those and nothing else--I might even be able to use it then.

> I would think the active journals are the one more likely to get
> searched for recent events than archived journals. So in the datacow
> case, you only get relief once it's rotated. It'd be nice to find an
> decent, not necessarily perfect, way for them to not get so fragmented
> in the first place. Or just defrag once a file has 16M of
> non-contiguous extents.

Or run defrag_range on the tail of the file every time the file grows
by 128K.  Huge extents aren't required to get OK performance, we only
need to avoid tiny extents because they are cripplingly slow.  64K is
almost an OOM better than 4K for sequential reading over SATA.  128K
isn't much bigger and would line up nicely with compressed extent size.

> Estimating extents though is another issue, especially with compression enabled.

Shouldn't be necessary.  Either it's nodatacow and the extent sizes are
all 8M (or whatever size you requested in fallocate), or it's datacow
and the extent size is always 4K (or you have truly huge journal data
volumes and none of this matters because even datacow will give good
extent sizes on a firehose of data).  There will not be compression if
there are no 8K single-commit writes (have to save at least 4K per write,
or btrfs won't be able to compress).

> 
> -- 
> Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: is BTRFS_IOC_DEFRAG behavior optimal?
  2021-02-11  3:13                   ` Zygo Blaxell
  2021-02-11  3:39                     ` Chris Murphy
@ 2021-02-11  3:52                     ` Chris Murphy
  1 sibling, 0 replies; 19+ messages in thread
From: Chris Murphy @ 2021-02-11  3:52 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Goffredo Baroncelli, Chris Murphy, Btrfs BTRFS

On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> > At file close, the systemd should copy the data to a new file with no
> > special attributes and discard or recycle the old inode.  This copy
> > will be mostly contiguous and have desirable properties like csums and
> > compression, and will have iops equivalent to btrfs fi defrag.

Or switch to a cow-friendly format that's no worse on overwriting file
systems, but improves things on Btrfs and ZFS. RocksDB does well.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-02-13  0:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-02-07 22:06 is BTRFS_IOC_DEFRAG behavior optimal? Chris Murphy
2021-02-08 22:11 ` Goffredo Baroncelli
2021-02-08 22:21   ` Zygo Blaxell
2021-02-09  1:05     ` Chris Murphy
2021-02-09  0:42   ` Chris Murphy
2021-02-09 18:13     ` Goffredo Baroncelli
2021-02-09 19:01       ` Chris Murphy
2021-02-09 19:45         ` Goffredo Baroncelli
2021-02-09 20:26           ` Chris Murphy
2021-02-10  6:37             ` Chris Murphy
2021-02-10 19:14               ` Goffredo Baroncelli
2021-02-11  0:19                 ` Chris Murphy
2021-02-11  3:08                 ` kreijack
2021-02-11  3:13                   ` Zygo Blaxell
2021-02-11  3:39                     ` Chris Murphy
2021-02-11  6:12                       ` Zygo Blaxell
2021-02-11  8:46                         ` Chris Murphy
2021-02-13  0:16                           ` Zygo Blaxell
2021-02-11  3:52                     ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).