* is BTRFS_IOC_DEFRAG behavior optimal? @ 2021-02-07 22:06 Chris Murphy 2021-02-08 22:11 ` Goffredo Baroncelli 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2021-02-07 22:06 UTC (permalink / raw) To: Btrfs BTRFS systemd-journald journals on Btrfs default to nodatacow, upon log rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The result looks curious. I can't tell what the logic is from the results. The journal file starts out being fallocated with a size of 8MB, and as it grows there is an append of 8MB increments, also fallocated. This leads to a filefrag -v that looks like this (ext4 and btrfs nodatacow follow the same behavior, both are provided for reference): ext4 https://pastebin.com/6vuufwXt btrfs https://pastebin.com/Y18B2m4h Following defragment with BTRFS_IOC_DEFRAG it looks like this: https://pastebin.com/1ufErVMs It appears at first glance to be significantly more fragmented. Closer inspection shows that most of the extents weren't relocated. But what's up with the peculiar interleaving? Is this an improvement over the original allocation? https://pastebin.com/1ufErVMs If I unwind the interleaving, it looks like all the extents fall into two localities and within each locality the extents aren't that far apart - so my guess is that this file is also not meaningfully fragmented, in practice. Surely the drive firmware will reorder the reads to arrive at the least amount of seeks? -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-07 22:06 is BTRFS_IOC_DEFRAG behavior optimal? Chris Murphy @ 2021-02-08 22:11 ` Goffredo Baroncelli 2021-02-08 22:21 ` Zygo Blaxell 2021-02-09 0:42 ` Chris Murphy 0 siblings, 2 replies; 19+ messages in thread From: Goffredo Baroncelli @ 2021-02-08 22:11 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS On 2/7/21 11:06 PM, Chris Murphy wrote: > systemd-journald journals on Btrfs default to nodatacow, upon log > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The > result looks curious. I can't tell what the logic is from the results. > > The journal file starts out being fallocated with a size of 8MB, and > as it grows there is an append of 8MB increments, also fallocated. > This leads to a filefrag -v that looks like this (ext4 and btrfs > nodatacow follow the same behavior, both are provided for reference): > > ext4 > https://pastebin.com/6vuufwXt > > btrfs > https://pastebin.com/Y18B2m4h > > Following defragment with BTRFS_IOC_DEFRAG it looks like this: > https://pastebin.com/1ufErVMs > > It appears at first glance to be significantly more fragmented. Closer > inspection shows that most of the extents weren't relocated. But > what's up with the peculiar interleaving? Is this an improvement over > the original allocation? I am not sure how read the filefrag output: I see several lines like [...] 5: 1691.. 1693: 125477.. 125479: 3: 6: 1694.. 1694: 125480.. 125480: 1: unwritten [...] What means "unwritten" ? The kernel documentation [*] says: [...] * FIEMAP_EXTENT_UNWRITTEN Unwritten extent - the extent is allocated but its data has not been initialized. This indicates the extent's data will be all zero if read through the filesystem but the contents are undefined if read directly from the device. [..] So it seems that the data didn't touch the platters (!) My educate guess is that there is something strange in the sequence: - write - sync - close log - move log - defrag log May be the defrag starts before all the data reach the platters ? For what matters, I create a file with the same fragmentation like your one $ sudo filefrag -v data.txt Filesystem type is: 9123683e File size of data.txt is 25165824 (6144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 1597171.. 1597171: 1: 1: 1.. 1599: 163433285.. 163434883: 1599: 1597172: 2: 1600.. 1607: 1601255.. 1601262: 8: 163434884: 3: 1608.. 1689: 1604137.. 1604218: 82: 1601263: 4: 1690.. 1690: 1597484.. 1597484: 1: 1604219: 5: 1691.. 1693: 1597465.. 1597467: 3: 1597485: 6: 1694.. 1694: 1597966.. 1597966: 1: 1597468: 7: 1695.. 1722: 1599557.. 1599584: 28: 1597967: 8: 1723.. 1723: 1599211.. 1599211: 1: 1599585: 9: 1724.. 1955: 1648394.. 1648625: 232: 1599212: 10: 1956.. 1956: 1599695.. 1599695: 1: 1648626: 11: 1957.. 2047: 1625881.. 1625971: 91: 1599696: 12: 2048.. 2417: 1648804.. 1649173: 370: 1625972: 13: 2418.. 2420: 1597468.. 1597470: 3: 1649174: 14: 2421.. 2478: 1624667.. 1624724: 58: 1597471: 15: 2479.. 2479: 1596416.. 1596416: 1: 1624725: 16: 2480.. 2482: 1601045.. 1601047: 3: 1596417: 17: 2483.. 2483: 1596854.. 1596854: 1: 1601048: 18: 2484.. 2523: 1602715.. 1602754: 40: 1596855: 19: 2524.. 2527: 1597471.. 1597474: 4: 1602755: 20: 2528.. 2598: 1624725.. 1624795: 71: 1597475: 21: 2599.. 2599: 1596858.. 1596858: 1: 1624796: 22: 2600.. 2607: 1601263.. 1601270: 8: 1596859: 23: 2608.. 2608: 1596863.. 1596863: 1: 1601271: 24: 2609.. 2611: 1601271.. 1601273: 3: 1596864: 25: 2612.. 2612: 1596864.. 1596864: 1: 1601274: 26: 2613.. 2615: 1601274.. 1601276: 3: 1596865: 27: 2616.. 2616: 1596981.. 1596981: 1: 1601277: 28: 2617.. 2691: 1649174.. 1649248: 75: 1596982: 29: 2692.. 2696: 1597475.. 1597479: 5: 1649249: 30: 2697.. 2756: 1634995.. 1635054: 60: 1597480: 31: 2757.. 2758: 1597480.. 1597481: 2: 1635055: 32: 2759.. 2762: 1601351.. 1601354: 4: 1597482: 33: 2763.. 2764: 1597482.. 1597483: 2: 1601355: 34: 2765.. 2837: 1649249.. 1649321: 73: 1597484: 35: 2838.. 2838: 1597038.. 1597038: 1: 1649322: 36: 2839.. 2855: 1601538.. 1601554: 17: 1597039: 37: 2856.. 2856: 1597045.. 1597045: 1: 1601555: 38: 2857.. 2904: 1624547.. 1624594: 48: 1597046: 39: 2905.. 2926: 1600795.. 1600816: 22: 1624595: 40: 2927.. 2942: 1602034.. 1602049: 16: 1600817: 41: 2943.. 2963: 1600817.. 1600837: 21: 1602050: 42: 2964.. 2979: 1602183.. 1602198: 16: 1600838: 43: 2980.. 3001: 1600927.. 1600948: 22: 1602199: 44: 3002.. 3043: 1621164.. 1621205: 42: 1600949: 45: 3044.. 3053: 1599231.. 1599240: 10: 1621206: 46: 3054.. 3066: 1601952.. 1601964: 13: 1599241: 47: 3067.. 3067: 1597056.. 1597056: 1: 1601965: 48: 3068.. 3084: 1602375.. 1602391: 17: 1597057: 49: 3085.. 3094: 1599290.. 1599299: 10: 1602392: 50: 3095.. 3096: 1601355.. 1601356: 2: 1599300: 51: 3097.. 3107: 1600717.. 1600727: 11: 1601357: 52: 3108.. 3156: 1642892.. 1642940: 49: 1600728: 53: 3157.. 3157: 1597059.. 1597059: 1: 1642941: 54: 3158.. 3251: 1649322.. 1649415: 94: 1597060: 55: 3252.. 3254: 1599241.. 1599243: 3: 1649416: 56: 3255.. 3304: 1645466.. 1645515: 50: 1599244: 57: 3305.. 3305: 1597100.. 1597100: 1: 1645516: 58: 3306.. 3312: 1601357.. 1601363: 7: 1597101: 59: 3313.. 3319: 1599300.. 1599306: 7: 1601364: 60: 3320.. 3331: 1601611.. 1601622: 12: 1599307: 61: 3332.. 3339: 1600838.. 1600845: 8: 1601623: 62: 3340.. 3343: 1601419.. 1601422: 4: 1600846: 63: 3344.. 3351: 1600846.. 1600853: 8: 1601423: 64: 3352.. 3432: 1649416.. 1649496: 81: 1600854: 65: 3433.. 3433: 1597109.. 1597109: 1: 1649497: 66: 3434.. 3489: 1649497.. 1649552: 56: 1597110: 67: 3490.. 3491: 1599227.. 1599228: 2: 1649553: 68: 3492.. 3521: 1619348.. 1619377: 30: 1599229: 69: 3522.. 3523: 1599307.. 1599308: 2: 1619378: 70: 3524.. 3530: 1601688.. 1601694: 7: 1599309: 71: 3531.. 3539: 1600949.. 1600957: 9: 1601695: 72: 3540.. 3579: 1629356.. 1629395: 40: 1600958: 73: 3580.. 3580: 1597124.. 1597124: 1: 1629396: 74: 3581.. 3601: 1604219.. 1604239: 21: 1597125: 75: 3602.. 3603: 1599585.. 1599586: 2: 1604240: 76: 3604.. 3614: 1602636.. 1602646: 11: 1599587: 77: 3615.. 3616: 1599587.. 1599588: 2: 1602647: 78: 3617.. 3677: 1649553.. 1649613: 61: 1599589: 79: 3678.. 3680: 1599692.. 1599694: 3: 1649614: 80: 3681.. 3723: 1647818.. 1647860: 43: 1599695: 81: 3724.. 3726: 1599821.. 1599823: 3: 1647861: 82: 3727.. 3756: 1622218.. 1622247: 30: 1599824: 83: 3757.. 3759: 1600630.. 1600632: 3: 1622248: 84: 3760.. 3766: 1603288.. 1603294: 7: 1600633: 85: 3767.. 3768: 1600633.. 1600634: 2: 1603295: 86: 3769.. 3950: 76053306.. 76053487: 182: 1600635: 87: 3951.. 3958: 1600958.. 1600965: 8: 76053488: 88: 3959.. 3986: 1619921.. 1619948: 28: 1600966: 89: 3987.. 3995: 1600966.. 1600974: 9: 1619949: 90: 3996.. 4036: 1649614.. 1649654: 41: 1600975: 91: 4037.. 4045: 1600975.. 1600983: 9: 1649655: 92: 4046.. 4050: 1601423.. 1601427: 5: 1600984: 93: 4051.. 4052: 1600854.. 1600855: 2: 1601428: 94: 4053.. 4055: 1601555.. 1601557: 3: 1600856: 95: 4056.. 4056: 1597129.. 1597129: 1: 1601558: 96: 4057.. 4059: 1601745.. 1601747: 3: 1597130: 97: 4060.. 4060: 1597134.. 1597134: 1: 1601748: 98: 4061.. 4063: 1602050.. 1602052: 3: 1597135: 99: 4064.. 4064: 1597137.. 1597137: 1: 1602053: 100: 4065.. 4079: 1604297.. 1604311: 15: 1597138: 101: 4080.. 4088: 1600987.. 1600995: 9: 1604312: 102: 4089.. 4095: 1603295.. 1603301: 7: 1600996: 103: 4096.. 4106: 1600996.. 1601006: 11: 1603302: 104: 4107.. 4117: 1622600.. 1622610: 11: 1601007: 105: 4118.. 4119: 1601007.. 1601008: 2: 1622611: 106: 4120.. 4129: 1622611.. 1622620: 10: 1601009: 107: 4130.. 4131: 1601009.. 1601010: 2: 1622621: 108: 4132.. 4141: 1622621.. 1622630: 10: 1601011: 109: 4142.. 4145: 1601011.. 1601014: 4: 1622631: 110: 4146.. 4155: 1622986.. 1622995: 10: 1601015: 111: 4156.. 4157: 1601015.. 1601016: 2: 1622996: 112: 4158.. 4168: 1622996.. 1623006: 11: 1601017: 113: 4169.. 4170: 1601017.. 1601018: 2: 1623007: 114: 4171.. 4180: 1623007.. 1623016: 10: 1601019: 115: 4181.. 4182: 1601019.. 1601020: 2: 1623017: 116: 4183.. 4192: 1624473.. 1624482: 10: 1601021: 117: 4193.. 4195: 1601021.. 1601023: 3: 1624483: 118: 4196.. 4205: 1624796.. 1624805: 10: 1601024: 119: 4206.. 4207: 1601024.. 1601025: 2: 1624806: 120: 4208.. 4217: 1624806.. 1624815: 10: 1601026: 121: 4218.. 4220: 1601026.. 1601028: 3: 1624816: 122: 4221.. 4230: 1625972.. 1625981: 10: 1601029: 123: 4231.. 4408: 1648626.. 1648803: 178: 1625982: 124: 4409.. 4411: 1602199.. 1602201: 3: 1648804: 125: 4412.. 4434: 1601328.. 1601350: 23: 1602202: 126: 4435.. 4437: 1602647.. 1602649: 3: 1601351: 127: 4438.. 4439: 1601029.. 1601030: 2: 1602650: 128: 4440.. 4442: 1602755.. 1602757: 3: 1601031: 129: 4443.. 4480: 1601650.. 1601687: 38: 1602758: 130: 4481.. 4491: 1629530.. 1629540: 11: 1601688: 131: 4492.. 4560: 1624404.. 1624472: 69: 1629541: 132: 4561.. 4571: 1629541.. 1629551: 11: 1624473: 133: 4572.. 4582: 1601031.. 1601041: 11: 1629552: 134: 4583.. 4586: 1603302.. 1603305: 4: 1601042: 135: 4587.. 4620: 1602537.. 1602570: 34: 1603306: 136: 4621.. 4631: 1629716.. 1629726: 11: 1602571: 137: 4632.. 4634: 1601042.. 1601044: 3: 1629727: 138: 4635.. 6143: 156004864.. 156006372: 1509: 1601045: last,eof data.txt: 139 extents found the I tried to defrag it $ btrfs fi defra data.txt $ sudo filefrag -v data.txt Filesystem type is: 9123683e File size of data.txt is 25165824 (6144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 6143: 164002967.. 164009110: 6144: last,eof data.txt: 1 extent found So it seems that the defrag works [*] https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt > > https://pastebin.com/1ufErVMs > > If I unwind the interleaving, it looks like all the extents fall into > two localities and within each locality the extents aren't that far > apart - so my guess is that this file is also not meaningfully > fragmented, in practice. Surely the drive firmware will reorder the > reads to arrive at the least amount of seeks? > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-08 22:11 ` Goffredo Baroncelli @ 2021-02-08 22:21 ` Zygo Blaxell 2021-02-09 1:05 ` Chris Murphy 2021-02-09 0:42 ` Chris Murphy 1 sibling, 1 reply; 19+ messages in thread From: Zygo Blaxell @ 2021-02-08 22:21 UTC (permalink / raw) To: kreijack; +Cc: Chris Murphy, Btrfs BTRFS On Mon, Feb 08, 2021 at 11:11:47PM +0100, Goffredo Baroncelli wrote: > On 2/7/21 11:06 PM, Chris Murphy wrote: > > systemd-journald journals on Btrfs default to nodatacow, upon log > > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The > > result looks curious. I can't tell what the logic is from the results. > > > > The journal file starts out being fallocated with a size of 8MB, and > > as it grows there is an append of 8MB increments, also fallocated. > > This leads to a filefrag -v that looks like this (ext4 and btrfs > > nodatacow follow the same behavior, both are provided for reference): > > > > ext4 > > https://pastebin.com/6vuufwXt > > > > btrfs > > https://pastebin.com/Y18B2m4h > > > > Following defragment with BTRFS_IOC_DEFRAG it looks like this: > > https://pastebin.com/1ufErVMs > > > > It appears at first glance to be significantly more fragmented. Closer > > inspection shows that most of the extents weren't relocated. But > > what's up with the peculiar interleaving? Is this an improvement over > > the original allocation? > > I am not sure how read the filefrag output: I see several lines like > [...] > 5: 1691.. 1693: 125477.. 125479: 3: > 6: 1694.. 1694: 125480.. 125480: 1: unwritten > [...] > > What means "unwritten" ? The kernel documentation [*] says: > [...] > * FIEMAP_EXTENT_UNWRITTEN > Unwritten extent - the extent is allocated but its data has not been > initialized. This indicates the extent's data will be all zero if read > through the filesystem but the contents are undefined if read directly from > the device. > [..] > So it seems that the data didn't touch the platters (!) > > My educate guess is that there is something strange in the sequence: > - write > - sync > - close log > - move log > - defrag log > > May be the defrag starts before all the data reach the platters ? defrag will put the file's contents back into delalloc, and it won't be allocated until a flush (fsync, sync, or commit interval). Defrag is roughly equivalent to simply copying the data to a new file in btrfs, except the logical extents are atomically updated to point to the new location. FIEMAP has an option flag to sync the data before returning a map. DEFRAG has an option to start IO immediately so it will presumably be done by the time you look at the extents with FIEMAP. > For what matters, I create a file with the same fragmentation like your one > > $ sudo filefrag -v data.txt > Filesystem type is: 9123683e > File size of data.txt is 25165824 (6144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 1597171.. 1597171: 1: > 1: 1.. 1599: 163433285.. 163434883: 1599: 1597172: > 2: 1600.. 1607: 1601255.. 1601262: 8: 163434884: > 3: 1608.. 1689: 1604137.. 1604218: 82: 1601263: > 4: 1690.. 1690: 1597484.. 1597484: 1: 1604219: > 5: 1691.. 1693: 1597465.. 1597467: 3: 1597485: > 6: 1694.. 1694: 1597966.. 1597966: 1: 1597468: > 7: 1695.. 1722: 1599557.. 1599584: 28: 1597967: > 8: 1723.. 1723: 1599211.. 1599211: 1: 1599585: > 9: 1724.. 1955: 1648394.. 1648625: 232: 1599212: > 10: 1956.. 1956: 1599695.. 1599695: 1: 1648626: > 11: 1957.. 2047: 1625881.. 1625971: 91: 1599696: > 12: 2048.. 2417: 1648804.. 1649173: 370: 1625972: > 13: 2418.. 2420: 1597468.. 1597470: 3: 1649174: > 14: 2421.. 2478: 1624667.. 1624724: 58: 1597471: > 15: 2479.. 2479: 1596416.. 1596416: 1: 1624725: > 16: 2480.. 2482: 1601045.. 1601047: 3: 1596417: > 17: 2483.. 2483: 1596854.. 1596854: 1: 1601048: > 18: 2484.. 2523: 1602715.. 1602754: 40: 1596855: > 19: 2524.. 2527: 1597471.. 1597474: 4: 1602755: > 20: 2528.. 2598: 1624725.. 1624795: 71: 1597475: > 21: 2599.. 2599: 1596858.. 1596858: 1: 1624796: > 22: 2600.. 2607: 1601263.. 1601270: 8: 1596859: > 23: 2608.. 2608: 1596863.. 1596863: 1: 1601271: > 24: 2609.. 2611: 1601271.. 1601273: 3: 1596864: > 25: 2612.. 2612: 1596864.. 1596864: 1: 1601274: > 26: 2613.. 2615: 1601274.. 1601276: 3: 1596865: > 27: 2616.. 2616: 1596981.. 1596981: 1: 1601277: > 28: 2617.. 2691: 1649174.. 1649248: 75: 1596982: > 29: 2692.. 2696: 1597475.. 1597479: 5: 1649249: > 30: 2697.. 2756: 1634995.. 1635054: 60: 1597480: > 31: 2757.. 2758: 1597480.. 1597481: 2: 1635055: > 32: 2759.. 2762: 1601351.. 1601354: 4: 1597482: > 33: 2763.. 2764: 1597482.. 1597483: 2: 1601355: > 34: 2765.. 2837: 1649249.. 1649321: 73: 1597484: > 35: 2838.. 2838: 1597038.. 1597038: 1: 1649322: > 36: 2839.. 2855: 1601538.. 1601554: 17: 1597039: > 37: 2856.. 2856: 1597045.. 1597045: 1: 1601555: > 38: 2857.. 2904: 1624547.. 1624594: 48: 1597046: > 39: 2905.. 2926: 1600795.. 1600816: 22: 1624595: > 40: 2927.. 2942: 1602034.. 1602049: 16: 1600817: > 41: 2943.. 2963: 1600817.. 1600837: 21: 1602050: > 42: 2964.. 2979: 1602183.. 1602198: 16: 1600838: > 43: 2980.. 3001: 1600927.. 1600948: 22: 1602199: > 44: 3002.. 3043: 1621164.. 1621205: 42: 1600949: > 45: 3044.. 3053: 1599231.. 1599240: 10: 1621206: > 46: 3054.. 3066: 1601952.. 1601964: 13: 1599241: > 47: 3067.. 3067: 1597056.. 1597056: 1: 1601965: > 48: 3068.. 3084: 1602375.. 1602391: 17: 1597057: > 49: 3085.. 3094: 1599290.. 1599299: 10: 1602392: > 50: 3095.. 3096: 1601355.. 1601356: 2: 1599300: > 51: 3097.. 3107: 1600717.. 1600727: 11: 1601357: > 52: 3108.. 3156: 1642892.. 1642940: 49: 1600728: > 53: 3157.. 3157: 1597059.. 1597059: 1: 1642941: > 54: 3158.. 3251: 1649322.. 1649415: 94: 1597060: > 55: 3252.. 3254: 1599241.. 1599243: 3: 1649416: > 56: 3255.. 3304: 1645466.. 1645515: 50: 1599244: > 57: 3305.. 3305: 1597100.. 1597100: 1: 1645516: > 58: 3306.. 3312: 1601357.. 1601363: 7: 1597101: > 59: 3313.. 3319: 1599300.. 1599306: 7: 1601364: > 60: 3320.. 3331: 1601611.. 1601622: 12: 1599307: > 61: 3332.. 3339: 1600838.. 1600845: 8: 1601623: > 62: 3340.. 3343: 1601419.. 1601422: 4: 1600846: > 63: 3344.. 3351: 1600846.. 1600853: 8: 1601423: > 64: 3352.. 3432: 1649416.. 1649496: 81: 1600854: > 65: 3433.. 3433: 1597109.. 1597109: 1: 1649497: > 66: 3434.. 3489: 1649497.. 1649552: 56: 1597110: > 67: 3490.. 3491: 1599227.. 1599228: 2: 1649553: > 68: 3492.. 3521: 1619348.. 1619377: 30: 1599229: > 69: 3522.. 3523: 1599307.. 1599308: 2: 1619378: > 70: 3524.. 3530: 1601688.. 1601694: 7: 1599309: > 71: 3531.. 3539: 1600949.. 1600957: 9: 1601695: > 72: 3540.. 3579: 1629356.. 1629395: 40: 1600958: > 73: 3580.. 3580: 1597124.. 1597124: 1: 1629396: > 74: 3581.. 3601: 1604219.. 1604239: 21: 1597125: > 75: 3602.. 3603: 1599585.. 1599586: 2: 1604240: > 76: 3604.. 3614: 1602636.. 1602646: 11: 1599587: > 77: 3615.. 3616: 1599587.. 1599588: 2: 1602647: > 78: 3617.. 3677: 1649553.. 1649613: 61: 1599589: > 79: 3678.. 3680: 1599692.. 1599694: 3: 1649614: > 80: 3681.. 3723: 1647818.. 1647860: 43: 1599695: > 81: 3724.. 3726: 1599821.. 1599823: 3: 1647861: > 82: 3727.. 3756: 1622218.. 1622247: 30: 1599824: > 83: 3757.. 3759: 1600630.. 1600632: 3: 1622248: > 84: 3760.. 3766: 1603288.. 1603294: 7: 1600633: > 85: 3767.. 3768: 1600633.. 1600634: 2: 1603295: > 86: 3769.. 3950: 76053306.. 76053487: 182: 1600635: > 87: 3951.. 3958: 1600958.. 1600965: 8: 76053488: > 88: 3959.. 3986: 1619921.. 1619948: 28: 1600966: > 89: 3987.. 3995: 1600966.. 1600974: 9: 1619949: > 90: 3996.. 4036: 1649614.. 1649654: 41: 1600975: > 91: 4037.. 4045: 1600975.. 1600983: 9: 1649655: > 92: 4046.. 4050: 1601423.. 1601427: 5: 1600984: > 93: 4051.. 4052: 1600854.. 1600855: 2: 1601428: > 94: 4053.. 4055: 1601555.. 1601557: 3: 1600856: > 95: 4056.. 4056: 1597129.. 1597129: 1: 1601558: > 96: 4057.. 4059: 1601745.. 1601747: 3: 1597130: > 97: 4060.. 4060: 1597134.. 1597134: 1: 1601748: > 98: 4061.. 4063: 1602050.. 1602052: 3: 1597135: > 99: 4064.. 4064: 1597137.. 1597137: 1: 1602053: > 100: 4065.. 4079: 1604297.. 1604311: 15: 1597138: > 101: 4080.. 4088: 1600987.. 1600995: 9: 1604312: > 102: 4089.. 4095: 1603295.. 1603301: 7: 1600996: > 103: 4096.. 4106: 1600996.. 1601006: 11: 1603302: > 104: 4107.. 4117: 1622600.. 1622610: 11: 1601007: > 105: 4118.. 4119: 1601007.. 1601008: 2: 1622611: > 106: 4120.. 4129: 1622611.. 1622620: 10: 1601009: > 107: 4130.. 4131: 1601009.. 1601010: 2: 1622621: > 108: 4132.. 4141: 1622621.. 1622630: 10: 1601011: > 109: 4142.. 4145: 1601011.. 1601014: 4: 1622631: > 110: 4146.. 4155: 1622986.. 1622995: 10: 1601015: > 111: 4156.. 4157: 1601015.. 1601016: 2: 1622996: > 112: 4158.. 4168: 1622996.. 1623006: 11: 1601017: > 113: 4169.. 4170: 1601017.. 1601018: 2: 1623007: > 114: 4171.. 4180: 1623007.. 1623016: 10: 1601019: > 115: 4181.. 4182: 1601019.. 1601020: 2: 1623017: > 116: 4183.. 4192: 1624473.. 1624482: 10: 1601021: > 117: 4193.. 4195: 1601021.. 1601023: 3: 1624483: > 118: 4196.. 4205: 1624796.. 1624805: 10: 1601024: > 119: 4206.. 4207: 1601024.. 1601025: 2: 1624806: > 120: 4208.. 4217: 1624806.. 1624815: 10: 1601026: > 121: 4218.. 4220: 1601026.. 1601028: 3: 1624816: > 122: 4221.. 4230: 1625972.. 1625981: 10: 1601029: > 123: 4231.. 4408: 1648626.. 1648803: 178: 1625982: > 124: 4409.. 4411: 1602199.. 1602201: 3: 1648804: > 125: 4412.. 4434: 1601328.. 1601350: 23: 1602202: > 126: 4435.. 4437: 1602647.. 1602649: 3: 1601351: > 127: 4438.. 4439: 1601029.. 1601030: 2: 1602650: > 128: 4440.. 4442: 1602755.. 1602757: 3: 1601031: > 129: 4443.. 4480: 1601650.. 1601687: 38: 1602758: > 130: 4481.. 4491: 1629530.. 1629540: 11: 1601688: > 131: 4492.. 4560: 1624404.. 1624472: 69: 1629541: > 132: 4561.. 4571: 1629541.. 1629551: 11: 1624473: > 133: 4572.. 4582: 1601031.. 1601041: 11: 1629552: > 134: 4583.. 4586: 1603302.. 1603305: 4: 1601042: > 135: 4587.. 4620: 1602537.. 1602570: 34: 1603306: > 136: 4621.. 4631: 1629716.. 1629726: 11: 1602571: > 137: 4632.. 4634: 1601042.. 1601044: 3: 1629727: > 138: 4635.. 6143: 156004864.. 156006372: 1509: 1601045: last,eof > data.txt: 139 extents found > > the I tried to defrag it > > $ btrfs fi defra data.txt > $ sudo filefrag -v data.txt > Filesystem type is: 9123683e > File size of data.txt is 25165824 (6144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 6143: 164002967.. 164009110: 6144: last,eof > data.txt: 1 extent found > > So it seems that the defrag works Be very careful how you set up this test case. If you use fallocate on a file, it has a _permanent_ effect on the inode, and alters a lot of normal btrfs behavior downstream. You won't see these effects if you just write some data to a file without using prealloc. > [*] https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt > > > > https://pastebin.com/1ufErVMs > > > > If I unwind the interleaving, it looks like all the extents fall into > > two localities and within each locality the extents aren't that far > > apart - so my guess is that this file is also not meaningfully > > fragmented, in practice. Surely the drive firmware will reorder the > > reads to arrive at the least amount of seeks? > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-08 22:21 ` Zygo Blaxell @ 2021-02-09 1:05 ` Chris Murphy 0 siblings, 0 replies; 19+ messages in thread From: Chris Murphy @ 2021-02-09 1:05 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Goffredo Baroncelli, Chris Murphy, Btrfs BTRFS On Mon, Feb 8, 2021 at 3:21 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > defrag will put the file's contents back into delalloc, and it won't be > allocated until a flush (fsync, sync, or commit interval). Defrag is > roughly equivalent to simply copying the data to a new file in btrfs, > except the logical extents are atomically updated to point to the new > location. BTRFS_IOC_DEFRAG results: https://pastebin.com/1ufErVMs BTRFS_IOC_DEFRAG_RANGE results: https://pastebin.com/429fZmNB They're different. Questions: is this a bug? it is intentional? does the interleaved BTRFS_IOC_DEFRAG version improve things over the non-defragmented file, which had only 3 8MB extents for a 24MB file, plus 1 4KiB block? Should BTRFS_IOC_DEFRAG be capable of estimating fragmentation and just do a no op in that case? > FIEMAP has an option flag to sync the data before returning a map. > DEFRAG has an option to start IO immediately so it will presumably be > done by the time you look at the extents with FIEMAP. I waited for the defrag result to settle, so the results I've posted are stable. > Be very careful how you set up this test case. If you use fallocate on > a file, it has a _permanent_ effect on the inode, and alters a lot of > normal btrfs behavior downstream. You won't see these effects if you > just write some data to a file without using prealloc. OK. That might answer the idempotent question. Following BTRFS_IOC_DEFRAG most unwritten exents are no longer present. I can't figure out the pattern. Some of the archived journals have them, others have one, but none have the four or more that I see in active use journals. And then when defragged with BTRFS_IOC_DEFRAG_RANGE none of those have unwritten extents. Since the file is changing each time it goes through the ioctl it makes sense what comes out the back end is different. While BTRFS_IOC_DEFRAG_RANGE has a no op if an extent is bigger than the -l (len=) value, I can't tell that BTRFS_IOC_DEFRAG has any sort of no op unless there's no fragments at all *shrug*. Maybe they should use BTRFS_IOC_DEFRAG_RANGE and specify an 8MB exent? Because in the nodatacow case, that's what they already have and it'd be a no op. And then for datacow case... well I don't like unconditional write amplification on SSDs just to satisfy the HDD case. But it'd be avoidable by just using default (nodatacow for the journals). -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-08 22:11 ` Goffredo Baroncelli 2021-02-08 22:21 ` Zygo Blaxell @ 2021-02-09 0:42 ` Chris Murphy 2021-02-09 18:13 ` Goffredo Baroncelli 1 sibling, 1 reply; 19+ messages in thread From: Chris Murphy @ 2021-02-09 0:42 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS On Mon, Feb 8, 2021 at 3:11 PM Goffredo Baroncelli <kreijack@libero.it> wrote: > > On 2/7/21 11:06 PM, Chris Murphy wrote: > > systemd-journald journals on Btrfs default to nodatacow, upon log > > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The > > result looks curious. I can't tell what the logic is from the results. > > > > The journal file starts out being fallocated with a size of 8MB, and > > as it grows there is an append of 8MB increments, also fallocated. > > This leads to a filefrag -v that looks like this (ext4 and btrfs > > nodatacow follow the same behavior, both are provided for reference): > > > > ext4 > > https://pastebin.com/6vuufwXt > > > > btrfs > > https://pastebin.com/Y18B2m4h > > > > Following defragment with BTRFS_IOC_DEFRAG it looks like this: > > https://pastebin.com/1ufErVMs > > > > It appears at first glance to be significantly more fragmented. Closer > > inspection shows that most of the extents weren't relocated. But > > what's up with the peculiar interleaving? Is this an improvement over > > the original allocation? > > I am not sure how read the filefrag output: I see several lines like > [...] > 5: 1691.. 1693: 125477.. 125479: 3: > 6: 1694.. 1694: 125480.. 125480: 1: unwritten > [...] > > What means "unwritten" ? The kernel documentation [*] says: My understanding is it's an exent that's been fallocated but not yet written to. What I don't know is whether they are possibly tripping up BTRFS_IOC_DEFRAG. I'm not skilled enough to create a bunch of these journal logs quickly (I'd have to just let a system run and age its own journals, which sucks, it takes forever) and then a small program that runs the same file through BTRFS_IOC_DEFRAG twice to see if it's idempotent. The resulting file after one submission does not have unwritten extents. Another thing I'm not sure of is whether ssd vs nossd affects the defrag results. Or datacow versus nodatacow. Another thing I'm not sure of is if autodefrag is a better solution to the problem. Whereby it acts as a no op when the file is nodatacow, and does the expected thing if it's datacow. But then we'd need an autodefrag xattr to set on the enclosing directory for these journals because there's no reliable way to set autodefrag mount option globally, not knowing all the work loads. It can make some workloads worse. > My educate guess is that there is something strange in the sequence: > - write > - sync > - close log > - move log > - defrag log > > May be the defrag starts before all the data reach the platters ? Perhaps. Attach strace to journald before --rotate, and then --rotate https://pastebin.com/UGihfCG9 > > For what matters, I create a file with the same fragmentation like your one > > $ sudo filefrag -v data.txt > Filesystem type is: 9123683e > File size of data.txt is 25165824 (6144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 1597171.. 1597171: 1: > 1: 1.. 1599: 163433285.. 163434883: 1599: 1597172: > 2: 1600.. 1607: 1601255.. 1601262: 8: 163434884: > 3: 1608.. 1689: 1604137.. 1604218: 82: 1601263: > 4: 1690.. 1690: 1597484.. 1597484: 1: 1604219: > 5: 1691.. 1693: 1597465.. 1597467: 3: 1597485: > 6: 1694.. 1694: 1597966.. 1597966: 1: 1597468: > 7: 1695.. 1722: 1599557.. 1599584: 28: 1597967: > 8: 1723.. 1723: 1599211.. 1599211: 1: 1599585: > 9: 1724.. 1955: 1648394.. 1648625: 232: 1599212: > 10: 1956.. 1956: 1599695.. 1599695: 1: 1648626: > 11: 1957.. 2047: 1625881.. 1625971: 91: 1599696: > 12: 2048.. 2417: 1648804.. 1649173: 370: 1625972: > 13: 2418.. 2420: 1597468.. 1597470: 3: 1649174: > 14: 2421.. 2478: 1624667.. 1624724: 58: 1597471: > 15: 2479.. 2479: 1596416.. 1596416: 1: 1624725: > 16: 2480.. 2482: 1601045.. 1601047: 3: 1596417: > 17: 2483.. 2483: 1596854.. 1596854: 1: 1601048: > 18: 2484.. 2523: 1602715.. 1602754: 40: 1596855: > 19: 2524.. 2527: 1597471.. 1597474: 4: 1602755: > 20: 2528.. 2598: 1624725.. 1624795: 71: 1597475: > 21: 2599.. 2599: 1596858.. 1596858: 1: 1624796: > 22: 2600.. 2607: 1601263.. 1601270: 8: 1596859: > 23: 2608.. 2608: 1596863.. 1596863: 1: 1601271: > 24: 2609.. 2611: 1601271.. 1601273: 3: 1596864: > 25: 2612.. 2612: 1596864.. 1596864: 1: 1601274: > 26: 2613.. 2615: 1601274.. 1601276: 3: 1596865: > 27: 2616.. 2616: 1596981.. 1596981: 1: 1601277: > 28: 2617.. 2691: 1649174.. 1649248: 75: 1596982: > 29: 2692.. 2696: 1597475.. 1597479: 5: 1649249: > 30: 2697.. 2756: 1634995.. 1635054: 60: 1597480: > 31: 2757.. 2758: 1597480.. 1597481: 2: 1635055: > 32: 2759.. 2762: 1601351.. 1601354: 4: 1597482: > 33: 2763.. 2764: 1597482.. 1597483: 2: 1601355: > 34: 2765.. 2837: 1649249.. 1649321: 73: 1597484: > 35: 2838.. 2838: 1597038.. 1597038: 1: 1649322: > 36: 2839.. 2855: 1601538.. 1601554: 17: 1597039: > 37: 2856.. 2856: 1597045.. 1597045: 1: 1601555: > 38: 2857.. 2904: 1624547.. 1624594: 48: 1597046: > 39: 2905.. 2926: 1600795.. 1600816: 22: 1624595: > 40: 2927.. 2942: 1602034.. 1602049: 16: 1600817: > 41: 2943.. 2963: 1600817.. 1600837: 21: 1602050: > 42: 2964.. 2979: 1602183.. 1602198: 16: 1600838: > 43: 2980.. 3001: 1600927.. 1600948: 22: 1602199: > 44: 3002.. 3043: 1621164.. 1621205: 42: 1600949: > 45: 3044.. 3053: 1599231.. 1599240: 10: 1621206: > 46: 3054.. 3066: 1601952.. 1601964: 13: 1599241: > 47: 3067.. 3067: 1597056.. 1597056: 1: 1601965: > 48: 3068.. 3084: 1602375.. 1602391: 17: 1597057: > 49: 3085.. 3094: 1599290.. 1599299: 10: 1602392: > 50: 3095.. 3096: 1601355.. 1601356: 2: 1599300: > 51: 3097.. 3107: 1600717.. 1600727: 11: 1601357: > 52: 3108.. 3156: 1642892.. 1642940: 49: 1600728: > 53: 3157.. 3157: 1597059.. 1597059: 1: 1642941: > 54: 3158.. 3251: 1649322.. 1649415: 94: 1597060: > 55: 3252.. 3254: 1599241.. 1599243: 3: 1649416: > 56: 3255.. 3304: 1645466.. 1645515: 50: 1599244: > 57: 3305.. 3305: 1597100.. 1597100: 1: 1645516: > 58: 3306.. 3312: 1601357.. 1601363: 7: 1597101: > 59: 3313.. 3319: 1599300.. 1599306: 7: 1601364: > 60: 3320.. 3331: 1601611.. 1601622: 12: 1599307: > 61: 3332.. 3339: 1600838.. 1600845: 8: 1601623: > 62: 3340.. 3343: 1601419.. 1601422: 4: 1600846: > 63: 3344.. 3351: 1600846.. 1600853: 8: 1601423: > 64: 3352.. 3432: 1649416.. 1649496: 81: 1600854: > 65: 3433.. 3433: 1597109.. 1597109: 1: 1649497: > 66: 3434.. 3489: 1649497.. 1649552: 56: 1597110: > 67: 3490.. 3491: 1599227.. 1599228: 2: 1649553: > 68: 3492.. 3521: 1619348.. 1619377: 30: 1599229: > 69: 3522.. 3523: 1599307.. 1599308: 2: 1619378: > 70: 3524.. 3530: 1601688.. 1601694: 7: 1599309: > 71: 3531.. 3539: 1600949.. 1600957: 9: 1601695: > 72: 3540.. 3579: 1629356.. 1629395: 40: 1600958: > 73: 3580.. 3580: 1597124.. 1597124: 1: 1629396: > 74: 3581.. 3601: 1604219.. 1604239: 21: 1597125: > 75: 3602.. 3603: 1599585.. 1599586: 2: 1604240: > 76: 3604.. 3614: 1602636.. 1602646: 11: 1599587: > 77: 3615.. 3616: 1599587.. 1599588: 2: 1602647: > 78: 3617.. 3677: 1649553.. 1649613: 61: 1599589: > 79: 3678.. 3680: 1599692.. 1599694: 3: 1649614: > 80: 3681.. 3723: 1647818.. 1647860: 43: 1599695: > 81: 3724.. 3726: 1599821.. 1599823: 3: 1647861: > 82: 3727.. 3756: 1622218.. 1622247: 30: 1599824: > 83: 3757.. 3759: 1600630.. 1600632: 3: 1622248: > 84: 3760.. 3766: 1603288.. 1603294: 7: 1600633: > 85: 3767.. 3768: 1600633.. 1600634: 2: 1603295: > 86: 3769.. 3950: 76053306.. 76053487: 182: 1600635: > 87: 3951.. 3958: 1600958.. 1600965: 8: 76053488: > 88: 3959.. 3986: 1619921.. 1619948: 28: 1600966: > 89: 3987.. 3995: 1600966.. 1600974: 9: 1619949: > 90: 3996.. 4036: 1649614.. 1649654: 41: 1600975: > 91: 4037.. 4045: 1600975.. 1600983: 9: 1649655: > 92: 4046.. 4050: 1601423.. 1601427: 5: 1600984: > 93: 4051.. 4052: 1600854.. 1600855: 2: 1601428: > 94: 4053.. 4055: 1601555.. 1601557: 3: 1600856: > 95: 4056.. 4056: 1597129.. 1597129: 1: 1601558: > 96: 4057.. 4059: 1601745.. 1601747: 3: 1597130: > 97: 4060.. 4060: 1597134.. 1597134: 1: 1601748: > 98: 4061.. 4063: 1602050.. 1602052: 3: 1597135: > 99: 4064.. 4064: 1597137.. 1597137: 1: 1602053: > 100: 4065.. 4079: 1604297.. 1604311: 15: 1597138: > 101: 4080.. 4088: 1600987.. 1600995: 9: 1604312: > 102: 4089.. 4095: 1603295.. 1603301: 7: 1600996: > 103: 4096.. 4106: 1600996.. 1601006: 11: 1603302: > 104: 4107.. 4117: 1622600.. 1622610: 11: 1601007: > 105: 4118.. 4119: 1601007.. 1601008: 2: 1622611: > 106: 4120.. 4129: 1622611.. 1622620: 10: 1601009: > 107: 4130.. 4131: 1601009.. 1601010: 2: 1622621: > 108: 4132.. 4141: 1622621.. 1622630: 10: 1601011: > 109: 4142.. 4145: 1601011.. 1601014: 4: 1622631: > 110: 4146.. 4155: 1622986.. 1622995: 10: 1601015: > 111: 4156.. 4157: 1601015.. 1601016: 2: 1622996: > 112: 4158.. 4168: 1622996.. 1623006: 11: 1601017: > 113: 4169.. 4170: 1601017.. 1601018: 2: 1623007: > 114: 4171.. 4180: 1623007.. 1623016: 10: 1601019: > 115: 4181.. 4182: 1601019.. 1601020: 2: 1623017: > 116: 4183.. 4192: 1624473.. 1624482: 10: 1601021: > 117: 4193.. 4195: 1601021.. 1601023: 3: 1624483: > 118: 4196.. 4205: 1624796.. 1624805: 10: 1601024: > 119: 4206.. 4207: 1601024.. 1601025: 2: 1624806: > 120: 4208.. 4217: 1624806.. 1624815: 10: 1601026: > 121: 4218.. 4220: 1601026.. 1601028: 3: 1624816: > 122: 4221.. 4230: 1625972.. 1625981: 10: 1601029: > 123: 4231.. 4408: 1648626.. 1648803: 178: 1625982: > 124: 4409.. 4411: 1602199.. 1602201: 3: 1648804: > 125: 4412.. 4434: 1601328.. 1601350: 23: 1602202: > 126: 4435.. 4437: 1602647.. 1602649: 3: 1601351: > 127: 4438.. 4439: 1601029.. 1601030: 2: 1602650: > 128: 4440.. 4442: 1602755.. 1602757: 3: 1601031: > 129: 4443.. 4480: 1601650.. 1601687: 38: 1602758: > 130: 4481.. 4491: 1629530.. 1629540: 11: 1601688: > 131: 4492.. 4560: 1624404.. 1624472: 69: 1629541: > 132: 4561.. 4571: 1629541.. 1629551: 11: 1624473: > 133: 4572.. 4582: 1601031.. 1601041: 11: 1629552: > 134: 4583.. 4586: 1603302.. 1603305: 4: 1601042: > 135: 4587.. 4620: 1602537.. 1602570: 34: 1603306: > 136: 4621.. 4631: 1629716.. 1629726: 11: 1602571: > 137: 4632.. 4634: 1601042.. 1601044: 3: 1629727: > 138: 4635.. 6143: 156004864.. 156006372: 1509: 1601045: last,eof > data.txt: 139 extents found > > the I tried to defrag it > > $ btrfs fi defra data.txt > $ sudo filefrag -v data.txt > Filesystem type is: 9123683e > File size of data.txt is 25165824 (6144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 6143: 164002967.. 164009110: 6144: last,eof > data.txt: 1 extent found > > So it seems that the defrag works I get different results between BTRFS_IOC_DEFRAG which is what systemd-journald uses, and BTRFS_IOC_DEFRAG_RANGE which is what 'btrfs fi defrag' is using with a default len of 32M. Another question about BTRFS_IOC_DEFRAG is if it's intended to be minimalist? Does it have a way to estimate fragmentation and just not do anything? Because the journald nodatacow journals are not meaningfully fragmented. They are the same on ext4 and on Btrfs - it's (so far) always 8MB extents, directly related to each fallocate grow of the journal file. This kind of faux-fragmentation I think is minor even on a HDD because it's the same on ext4 and XFS and no one complains there (as far as I'm aware). -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-09 0:42 ` Chris Murphy @ 2021-02-09 18:13 ` Goffredo Baroncelli 2021-02-09 19:01 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Goffredo Baroncelli @ 2021-02-09 18:13 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On 2/9/21 1:42 AM, Chris Murphy wrote: > Perhaps. Attach strace to journald before --rotate, and then --rotate > > https://pastebin.com/UGihfCG9 I looked to this strace. in line 115: it is called a ioctl(<BTRFS-DEFRAG>) in line 123: it is called a ioctl(<BTRFS-DEFRAG>) However the two descriptors for which the defrag is invoked are never sync-ed before. I was expecting is to see a sync (flush the data on the platters) and then a ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace. I wrote a script (see below) which basically: - create a fragmented file - run filefrag on it - optionally sync the file <----- - run btrfs fi defrag on it - run filefrag on it If I don't perform the sync, the defrag is ineffective. But if I sync the file BEFORE doing the defrag, I got only one extent. Now my hypothesis is: the journal log files are bad de-fragmented because these are not sync-ed before. This could be tested quite easily putting an fsync() before the ioctl(<BTRFS_DEFRAG>). Any thought ? Regards Goffredo ----- $ cat test.py import os, time, sys def create_file(nf): """ Create a fragmented file """ # the data below are from a real case data= [(0, 0), (1, 1599), (1600, 1607), (1608, 1689), (1690, 1690), (1691, 1693), (1694, 1694), (1695, 1722), (1723, 1723), (1724, 1955), (1956, 1956), (1957, 2047), (2048, 2417), (2418, 2420), (2421, 2478), (2479, 2479), (2480, 2482), (2483, 2483), (2484, 2523), (2524, 2527), (2528, 2598), (2599, 2599), (2600, 2607), (2608, 2608), (2609, 2611), (2612, 2612), (2613, 2615), (2616, 2616), (2617, 2691), (2692, 2696)] blocksize=4096 f = os.open(fn, os.O_RDWR+os.O_TRUNC+os.O_CREAT) os.close(f) ldata = len(data) i = 1 f = os.open(fn, os.O_RDWR) while i < ldata: (from_, to_) = data[ldata - i -1] l = (to_ - from_ + 1) * blocksize pos = from_ * blocksize os.lseek(f, pos, os.SEEK_SET) os.write(f, b"X"*l) i += 2 os.fsync(f) os.fsync(f) os.close(f) os.system("sync") os.system("sync") print("sleep 5s") #time.sleep(5) os.system("sync") os.system("sync") i = 0 f = os.open(fn, os.O_RDWR) while i < ldata: (from_, to_) = data[ldata - i -1] l = (to_ - from_ + 1) * blocksize pos = from_ * blocksize os.lseek(f, pos, os.SEEK_SET) os.write(f, b"X"*l) i += 2 os.close(f) def test_without_sync(fn): create_file(fn) print("\nCreated fragmented file") os.system("sudo filefrag -v "+fn) print("\nStart defrag without sync\n", end="") os.system("btrfs fi defra "+fn) print("End defrag") os.system("sync") os.system("sync") print("End sync") os.system("sudo filefrag -v "+fn) def test_with_sync(fn): create_file(fn) print("\nCreated fragmented file") os.system("sync") os.system("sync") os.system("sudo filefrag -v "+fn) print("\nStart defrag with sync\n", end="") os.system("btrfs fi defra "+fn) print("End defrag") os.system("sync") os.system("sync") print("End sync") os.system("sudo filefrag -v "+fn) fn = sys.argv[1] assert(len(fn)) os.system("sudo true") # to start sudo test_without_sync(fn) test_with_sync(fn) ----- $ python3 test.py /mnt/btrfs-raid1/home/ghigo/data.txt sleep 5s Created fragmented file Filesystem type is: 9123683e File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 1596416.. 1596416: 1: 1: 1.. 1599: 0.. 1598: 1599: 1596417: unknown_loc,delalloc 2: 1600.. 1607: 1597465.. 1597472: 8: 1599: 3: 1608.. 1689: 0.. 81: 82: 1597473: unknown_loc,delalloc 4: 1690.. 1690: 1596854.. 1596854: 1: 82: 5: 1691.. 1693: 0.. 2: 3: 1596855: unknown_loc,delalloc 6: 1694.. 1694: 1596858.. 1596858: 1: 3: 7: 1695.. 1722: 0.. 27: 28: 1596859: unknown_loc,delalloc 8: 1723.. 1723: 1596863.. 1596863: 1: 28: 9: 1724.. 1955: 0.. 231: 232: 1596864: unknown_loc,delalloc 10: 1956.. 1956: 1596864.. 1596864: 1: 232: 11: 1957.. 2047: 0.. 90: 91: 1596865: unknown_loc,delalloc 12: 2048.. 2417: 1648394.. 1648763: 370: 91: 13: 2418.. 2420: 0.. 2: 3: 1648764: unknown_loc,delalloc 14: 2421.. 2478: 1600795.. 1600852: 58: 3: 15: 2479.. 2479: 0.. 0: 1: 1600853: unknown_loc,delalloc 16: 2480.. 2482: 1597473.. 1597475: 3: 1: 17: 2483.. 2483: 0.. 0: 1: 1597476: unknown_loc,delalloc 18: 2484.. 2523: 1600927.. 1600966: 40: 1: 19: 2524.. 2527: 0.. 3: 4: 1600967: unknown_loc,delalloc 20: 2528.. 2598: 1624667.. 1624737: 71: 4: 21: 2599.. 2599: 0.. 0: 1: 1624738: unknown_loc,delalloc 22: 2600.. 2607: 1597476.. 1597483: 8: 1: 23: 2608.. 2608: 0.. 0: 1: 1597484: unknown_loc,delalloc 24: 2609.. 2611: 1599231.. 1599233: 3: 1: 25: 2612.. 2612: 0.. 0: 1: 1599234: unknown_loc,delalloc 26: 2613.. 2615: 1599234.. 1599236: 3: 1: 27: 2616.. 2616: 0.. 0: 1: 1599237: unknown_loc,delalloc 28: 2617.. 2691: 1624738.. 1624812: 75: 1: 29: 2692.. 2696: 0.. 4: 5: 1624813: last,unknown_loc,delalloc,eof /mnt/btrfs-raid1/home/ghigo/data.txt: 30 extents found Start defrag without sync End defrag End sync Filesystem type is: 9123683e File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 1596416.. 1596416: 1: 1: 1.. 1599: 163433285.. 163434883: 1599: 1596417: 2: 1600.. 1607: 1597465.. 1597472: 8: 163434884: 3: 1608.. 1689: 1604137.. 1604218: 82: 1597473: 4: 1690.. 1690: 1596854.. 1596854: 1: 1604219: 5: 1691.. 1693: 1599237.. 1599239: 3: 1596855: 6: 1694.. 1694: 1596858.. 1596858: 1: 1599240: 7: 1695.. 1722: 1599557.. 1599584: 28: 1596859: 8: 1723.. 1723: 1596863.. 1596863: 1: 1599585: 9: 1724.. 1955: 1651669.. 1651900: 232: 1596864: 10: 1956.. 1956: 1596864.. 1596864: 1: 1651901: 11: 1957.. 2047: 1850859.. 1850949: 91: 1596865: 12: 2048.. 2417: 1648394.. 1648763: 370: 1850950: 13: 2418.. 2420: 1599240.. 1599242: 3: 1648764: 14: 2421.. 2478: 1600795.. 1600852: 58: 1599243: 15: 2479.. 2479: 1596981.. 1596981: 1: 1600853: 16: 2480.. 2482: 1597473.. 1597475: 3: 1596982: 17: 2483.. 2483: 1597038.. 1597038: 1: 1597476: 18: 2484.. 2523: 1600927.. 1600966: 40: 1597039: 19: 2524.. 2527: 1599290.. 1599293: 4: 1600967: 20: 2528.. 2598: 1624667.. 1624737: 71: 1599294: 21: 2599.. 2599: 1597045.. 1597045: 1: 1624738: 22: 2600.. 2607: 1597476.. 1597483: 8: 1597046: 23: 2608.. 2608: 1597056.. 1597056: 1: 1597484: 24: 2609.. 2611: 1599231.. 1599233: 3: 1597057: 25: 2612.. 2612: 1597059.. 1597059: 1: 1599234: 26: 2613.. 2615: 1599234.. 1599236: 3: 1597060: 27: 2616.. 2616: 1597100.. 1597100: 1: 1599237: 28: 2617.. 2691: 1624738.. 1624812: 75: 1597101: 29: 2692.. 2696: 1599294.. 1599298: 5: 1624813: last,eof /mnt/btrfs-raid1/home/ghigo/data.txt: 30 extents found sleep 5s Created fragmented file Filesystem type is: 9123683e File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 1597109.. 1597109: 1: 1: 1.. 1599: 0.. 1598: 1599: 1597110: unknown_loc,delalloc 2: 1600.. 1607: 1599299.. 1599306: 8: 1599: 3: 1608.. 1689: 0.. 81: 82: 1599307: unknown_loc,delalloc 4: 1690.. 1690: 1597124.. 1597124: 1: 82: 5: 1691.. 1693: 0.. 2: 3: 1597125: unknown_loc,delalloc 6: 1694.. 1694: 1597129.. 1597129: 1: 3: 7: 1695.. 1722: 0.. 27: 28: 1597130: unknown_loc,delalloc 8: 1723.. 1723: 1597134.. 1597134: 1: 28: 9: 1724.. 1955: 0.. 231: 232: 1597135: unknown_loc,delalloc 10: 1956.. 1956: 1597137.. 1597137: 1: 232: 11: 1957.. 2047: 0.. 90: 91: 1597138: unknown_loc,delalloc 12: 2048.. 2417: 88373891.. 88374260: 370: 91: 13: 2418.. 2420: 0.. 2: 3: 88374261: unknown_loc,delalloc 14: 2421.. 2478: 1600987.. 1601044: 58: 3: 15: 2479.. 2479: 0.. 0: 1: 1601045: unknown_loc,delalloc 16: 2480.. 2482: 1599585.. 1599587: 3: 1: 17: 2483.. 2483: 0.. 0: 1: 1599588: unknown_loc,delalloc 18: 2484.. 2523: 1601650.. 1601689: 40: 1: 19: 2524.. 2527: 0.. 3: 4: 1601690: unknown_loc,delalloc 20: 2528.. 2598: 1625881.. 1625951: 71: 4: 21: 2599.. 2599: 0.. 0: 1: 1625952: unknown_loc,delalloc 22: 2600.. 2607: 1600717.. 1600724: 8: 1: 23: 2608.. 2608: 0.. 0: 1: 1600725: unknown_loc,delalloc 24: 2609.. 2611: 1599692.. 1599694: 3: 1: 25: 2612.. 2612: 0.. 0: 1: 1599695: unknown_loc,delalloc 26: 2613.. 2615: 1599821.. 1599823: 3: 1: 27: 2616.. 2616: 0.. 0: 1: 1599824: unknown_loc,delalloc 28: 2617.. 2691: 1629466.. 1629540: 75: 1: 29: 2692.. 2696: 0.. 4: 5: 1629541: last,unknown_loc,delalloc,eof /mnt/btrfs-raid1/home/ghigo/data.txt: 30 extents found Start defrag with sync End defrag End sync Filesystem type is: 9123683e File size of /mnt/btrfs-raid1/home/ghigo/data.txt is 11046912 (2697 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 2696: 163503187.. 163505883: 2697: last,eof /mnt/btrfs-raid1/home/ghigo/data.txt: 1 extent found ---- -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-09 18:13 ` Goffredo Baroncelli @ 2021-02-09 19:01 ` Chris Murphy 2021-02-09 19:45 ` Goffredo Baroncelli 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2021-02-09 19:01 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > On 2/9/21 1:42 AM, Chris Murphy wrote: > > Perhaps. Attach strace to journald before --rotate, and then --rotate > > > > https://pastebin.com/UGihfCG9 > > I looked to this strace. > > in line 115: it is called a ioctl(<BTRFS-DEFRAG>) > in line 123: it is called a ioctl(<BTRFS-DEFRAG>) > > However the two descriptors for which the defrag is invoked are never sync-ed before. > > I was expecting is to see a sync (flush the data on the platters) and then a > ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace. > > I wrote a script (see below) which basically: > - create a fragmented file > - run filefrag on it > - optionally sync the file <----- > - run btrfs fi defrag on it > - run filefrag on it > > If I don't perform the sync, the defrag is ineffective. But if I sync the > file BEFORE doing the defrag, I got only one extent. > Now my hypothesis is: the journal log files are bad de-fragmented because these > are not sync-ed before. > This could be tested quite easily putting an fsync() before the > ioctl(<BTRFS_DEFRAG>). > > Any thought ? No idea. If it's a full sync then it could be expensive on either slower devices or heavier workloads. On the one hand, there's no point of doing an ineffective defrag so maybe the defrag ioctl should just do the sync first? On the other hand, this would effectively make the defrag ioctl a full file system sync which might be unexpected. It's a set of tradeoffs and I don't know what the expectation is. What about fdatasync() on the journal file rather than a full sync? -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-09 19:01 ` Chris Murphy @ 2021-02-09 19:45 ` Goffredo Baroncelli 2021-02-09 20:26 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Goffredo Baroncelli @ 2021-02-09 19:45 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On 2/9/21 8:01 PM, Chris Murphy wrote: > On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli <kreijack@inwind.it> wrote: >> >> On 2/9/21 1:42 AM, Chris Murphy wrote: >>> Perhaps. Attach strace to journald before --rotate, and then --rotate >>> >>> https://pastebin.com/UGihfCG9 >> >> I looked to this strace. >> >> in line 115: it is called a ioctl(<BTRFS-DEFRAG>) >> in line 123: it is called a ioctl(<BTRFS-DEFRAG>) >> >> However the two descriptors for which the defrag is invoked are never sync-ed before. >> >> I was expecting is to see a sync (flush the data on the platters) and then a >> ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace. >> >> I wrote a script (see below) which basically: >> - create a fragmented file >> - run filefrag on it >> - optionally sync the file <----- >> - run btrfs fi defrag on it >> - run filefrag on it >> >> If I don't perform the sync, the defrag is ineffective. But if I sync the >> file BEFORE doing the defrag, I got only one extent. >> Now my hypothesis is: the journal log files are bad de-fragmented because these >> are not sync-ed before. >> This could be tested quite easily putting an fsync() before the >> ioctl(<BTRFS_DEFRAG>). >> >> Any thought ? > > No idea. If it's a full sync then it could be expensive on either > slower devices or heavier workloads. On the one hand, there's no point > of doing an ineffective defrag so maybe the defrag ioctl should just > do the sync first? On the other hand, this would effectively make the > defrag ioctl a full file system sync which might be unexpected. It's a > set of tradeoffs and I don't know what the expectation is. > > What about fdatasync() on the journal file rather than a full sync? I tried a fsync(2) call, and the results is the same. Only after reading your reply I realized that I used a sync(2), when I meant to use fsync(2). I update my python test code ---- import os, time, sys def create_file(nf): """ Create a fragmented file """ # the data below are from a real case data= [(0, 0), (1, 1599), (1600, 1607), (1608, 1689), (1690, 1690), (1691, 1693), (1694, 1694), (1695, 1722), (1723, 1723), (1724, 1955), (1956, 1956), (1957, 2047), (2048, 2417), (2418, 2420), (2421, 2478), (2479, 2479), (2480, 2482), (2483, 2483), (2484, 2523), (2524, 2527), (2528, 2598), (2599, 2599), (2600, 2607), (2608, 2608), (2609, 2611), (2612, 2612), (2613, 2615), (2616, 2616), (2617, 2691), (2692, 2696)] blocksize=4096 # write the odd extents... f = os.open(fn, os.O_RDWR+os.O_TRUNC+os.O_CREAT) os.close(f) ldata = len(data) i = 1 f = os.open(fn, os.O_RDWR) while i < ldata: (from_, to_) = data[ldata - i -1] l = (to_ - from_ + 1) * blocksize pos = from_ * blocksize os.lseek(f, pos, os.SEEK_SET) os.write(f, b"X"*l) i += 2 # ... sync and then write the even extents os.fsync(f) os.close(f) i = 0 f = os.open(fn, os.O_RDWR) while i < ldata: (from_, to_) = data[ldata - i -1] l = (to_ - from_ + 1) * blocksize pos = from_ * blocksize os.lseek(f, pos, os.SEEK_SET) os.write(f, b"X"*l) i += 2 os.close(f) def fsync(nf): f = os.open(nf, os.O_RDWR) os.fsync(f) os.close(f) def test_without_sync(fn): create_file(fn) print("\nCreated fragmented file") os.system("sudo filefrag -v "+fn) print("\nStart defrag without sync\n", end="") os.system("btrfs fi defra "+fn) print("End defrag") fsync(fn) print("End sync") os.system("sudo filefrag -v "+fn) def test_with_sync(fn): create_file(fn) print("\nCreated fragmented file") fsync(fn) os.system("sudo filefrag -v "+fn) print("\nStart defrag with sync\n", end="") os.system("btrfs fi defra "+fn) print("End defrag") fsync(fn) print("End sync") os.system("sudo filefrag -v "+fn) fn = sys.argv[1] assert(len(fn)) os.system("sudo true") # to start sudo test_without_sync(fn) test_with_sync(fn) ---- > > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-09 19:45 ` Goffredo Baroncelli @ 2021-02-09 20:26 ` Chris Murphy 2021-02-10 6:37 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2021-02-09 20:26 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS On Tue, Feb 9, 2021 at 12:45 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > On 2/9/21 8:01 PM, Chris Murphy wrote: > > On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli <kreijack@inwind.it> wrote: > >> > >> On 2/9/21 1:42 AM, Chris Murphy wrote: > >>> Perhaps. Attach strace to journald before --rotate, and then --rotate > >>> > >>> https://pastebin.com/UGihfCG9 > >> > >> I looked to this strace. > >> > >> in line 115: it is called a ioctl(<BTRFS-DEFRAG>) > >> in line 123: it is called a ioctl(<BTRFS-DEFRAG>) > >> > >> However the two descriptors for which the defrag is invoked are never sync-ed before. > >> > >> I was expecting is to see a sync (flush the data on the platters) and then a > >> ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace. > >> > >> I wrote a script (see below) which basically: > >> - create a fragmented file > >> - run filefrag on it > >> - optionally sync the file <----- > >> - run btrfs fi defrag on it > >> - run filefrag on it > >> > >> If I don't perform the sync, the defrag is ineffective. But if I sync the > >> file BEFORE doing the defrag, I got only one extent. > >> Now my hypothesis is: the journal log files are bad de-fragmented because these > >> are not sync-ed before. > >> This could be tested quite easily putting an fsync() before the > >> ioctl(<BTRFS_DEFRAG>). > >> > >> Any thought ? > > > > No idea. If it's a full sync then it could be expensive on either > > slower devices or heavier workloads. On the one hand, there's no point > > of doing an ineffective defrag so maybe the defrag ioctl should just > > do the sync first? On the other hand, this would effectively make the > > defrag ioctl a full file system sync which might be unexpected. It's a > > set of tradeoffs and I don't know what the expectation is. > > > > What about fdatasync() on the journal file rather than a full sync? > > I tried a fsync(2) call, and the results is the same. > Only after reading your reply I realized that I used a sync(2), when > I meant to use fsync(2). > > I update my python test code Ok fsync should be least costly of the three. The three unique things about systemd-journald that might be factors: * nodatacow file * fallocated file in 8MB increments multiple times up to 128M * BTRFS_IOC_DEFRAG, whereas btrfs-progs uses BTRFS_IOC_DEFRAG_RANGE So maybe it's all explained by lack of fsync, I'm not sure. But the commit that added this doesn't show any form of sync. https://github.com/systemd/systemd/commit/f27a386430cc7a27ebd06899d93310fb3bd4cee7 -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-09 20:26 ` Chris Murphy @ 2021-02-10 6:37 ` Chris Murphy 2021-02-10 19:14 ` Goffredo Baroncelli 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2021-02-10 6:37 UTC (permalink / raw) To: Chris Murphy; +Cc: Goffredo Baroncelli, Btrfs BTRFS This is an active (but idle) system.journal file. That is, it's open but not being written to. I did a sync right before this: https://pastebin.com/jHh5tfpe And then: btrfs fi defrag -l 8M system.journal https://pastebin.com/Kq1GjJuh Looks like most of it was a no op. So it seems btrfs in this case is not confused by so many small extent items, it know they are contiguous? It doesn't answer the question what the "too small" threshold is for BTRFS_IOC_DEFRAG, which is what sd-journald is using, though. Another sync, and then, 'journalctl --rotate' and the resulting archived file is now: https://pastebin.com/aqac0dRj These are not the same results between the two ioctls for the same file, and not the same result as what you get with -l 32M (which I do get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result is peculiar, but I don't think we can say it's ineffective, it might be an intentional no op either because it's nodatacow or it sees that these many extents are mostly contiguous and not worth defragmenting (which would be good for keeping write amplification down). So I don't know, maybe it's not wrong. -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-10 6:37 ` Chris Murphy @ 2021-02-10 19:14 ` Goffredo Baroncelli 2021-02-11 0:19 ` Chris Murphy 2021-02-11 3:08 ` kreijack 0 siblings, 2 replies; 19+ messages in thread From: Goffredo Baroncelli @ 2021-02-10 19:14 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS Hi Chris, it seems that systemd-journald is more smart/complex than I thought: 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it closes the files, it mark again these as COW then defrag [1] 2) looking at the code, I suspect that systemd-journald closes the file asynchronously [2]. This means that looking at the "live" journal is not sufficient. In fact: /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) [...] --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal ---------------C----- user-1000.journal ---------------C----- system.journal The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be "closed", the NOCOW flag will be removed and a defragmentation will start. Now my journals have few (2 or 3 extents). But I saw cases where the extents of the more recent files are hundreds, but after few "journalct --rotate" the older files become less fragmented. [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 [2] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687 On 2/10/21 7:37 AM, Chris Murphy wrote: > This is an active (but idle) system.journal file. That is, it's open > but not being written to. I did a sync right before this: > > https://pastebin.com/jHh5tfpe > > And then: btrfs fi defrag -l 8M system.journal > > https://pastebin.com/Kq1GjJuh > > Looks like most of it was a no op. So it seems btrfs in this case is > not confused by so many small extent items, it know they are > contiguous? > > It doesn't answer the question what the "too small" threshold is for > BTRFS_IOC_DEFRAG, which is what sd-journald is using, though. > > Another sync, and then, 'journalctl --rotate' and the resulting > archived file is now: > > https://pastebin.com/aqac0dRj > > These are not the same results between the two ioctls for the same > file, and not the same result as what you get with -l 32M (which I do > get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result > is peculiar, but I don't think we can say it's ineffective, it might > be an intentional no op either because it's nodatacow or it sees that > these many extents are mostly contiguous and not worth defragmenting > (which would be good for keeping write amplification down). > > So I don't know, maybe it's not wrong. > > -- > Chris Murphy > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-10 19:14 ` Goffredo Baroncelli @ 2021-02-11 0:19 ` Chris Murphy 2021-02-11 3:08 ` kreijack 1 sibling, 0 replies; 19+ messages in thread From: Chris Murphy @ 2021-02-11 0:19 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS On Wed, Feb 10, 2021 at 12:14 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > Hi Chris, > > it seems that systemd-journald is more smart/complex than I thought: > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > closes the files, it mark again these as COW then defrag [1] Found that in commit 11689d2a021d95a8447d938180e0962cd9439763 from 2015. But archived journals are still all nocow for me on systemd 247. Is it because the enclosing directory has file attribute 'C' ? Another example: Active journal "system.journal" INODE_ITEM contains sequence 4515 flags 0x13(NODATASUM|NODATACOW|PREALLOC) 7 day old archived journal "systemd.journal" INODE_ITEM shows: sequence 227 flags 0x13(NODATASUM|NODATACOW|PREALLOC) So if it ever was COW, it flipped to NOCOW before the defrag. Is it expected? and also this archived file's INODE_ITEM shows generation 1748644 transid 1760983 size 16777216 nbytes 16777216 with EXTENT_ITEMs show generation 1755533 type 1 (regular) generation 1753668 type 1 (regular) generation 1755533 type 1 (regular) generation 1753989 type 1 (regular) generation 1755533 type 1 (regular) generation 1753526 type 1 (regular) generation 1755533 type 1 (regular) generation 1755531 type 1 (regular) generation 1755533 type 1 (regular) generation 1755531 type 2 (prealloc) file tree output for this file https://pastebin.com/6uDFNDdd > 2) looking at the code, I suspect that systemd-journald closes the > file asynchronously [2]. This means that looking at the "live" journal > is not sufficient. In fact: > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > [...] > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > ---------------C----- user-1000.journal > ---------------C----- system.journal > > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be > "closed", the NOCOW flag will be removed and a defragmentation will start. > > Now my journals have few (2 or 3 extents). But I saw cases where the extents > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less > fragmented. Josef explained to me that BTRFS_IOC_DEFRAG is pretty simple and just dirties extents it considers too small, and they end up just going through the normal write path, along with anything else pending. And also that fsync() will set the extents on disk so that the defrag ioctl know what to dirty, but that ordinarily it's not required and might have to do with the interleaving write pattern for the journals. I'm not sure what this ioctl considers big enough that it's worth just leaving alone. But in any case it sounds like the current write workload at the time of defrag could affect the allocation, unlike BTRFS_IOC_DEFRAG_RANGE which has a few knobs to control the outcome. Or maybe the knobs just influence the outcome. Not sure. If the device is HDD, it might be nice if the nodatacow journals are datacow again so they could be compressed. But my evaluation shows that nodatacow journals stick to an 8MB extent pattern, correlating to fallocated append as they grow. It's not significantly fragmented to start out with, whether HDD or SSD. -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-10 19:14 ` Goffredo Baroncelli 2021-02-11 0:19 ` Chris Murphy @ 2021-02-11 3:08 ` kreijack 2021-02-11 3:13 ` Zygo Blaxell 1 sibling, 1 reply; 19+ messages in thread From: kreijack @ 2021-02-11 3:08 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote: > Hi Chris, > > it seems that systemd-journald is more smart/complex than I thought: > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > closes the files, it mark again these as COW then defrag [1] > > 2) looking at the code, I suspect that systemd-journald closes the > file asynchronously [2]. This means that looking at the "live" journal > is not sufficient. In fact: > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > [...] > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > ---------------C----- user-1000.journal > ---------------C----- system.journal > > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be > "closed", the NOCOW flag will be removed and a defragmentation will start. Wait what? > Now my journals have few (2 or 3 extents). But I saw cases where the extents > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less > fragmented. > > [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 That line doesn't work, and systemd ignores the error. The NOCOW flag cannot be set or cleared unless the file is empty. This is checked in btrfs_ioctl_setflags. This is not something that can be changed easily--if the NOCOW bit is cleared on a non-empty file, btrfs data read code will expect csums that aren't present on disk because they were written while the file was NODATASUM, and the reads will fail pretty badly. The entire file would have to have csums added or removed at the same time as the flag change (or all nodatacow file reads take a performance hit looking for csums that may or may not be present). At file close, the systemd should copy the data to a new file with no special attributes and discard or recycle the old inode. This copy will be mostly contiguous and have desirable properties like csums and compression, and will have iops equivalent to btrfs fi defrag. > [2] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687 > > On 2/10/21 7:37 AM, Chris Murphy wrote: > > This is an active (but idle) system.journal file. That is, it's open > > but not being written to. I did a sync right before this: > > > > https://pastebin.com/jHh5tfpe > > > > And then: btrfs fi defrag -l 8M system.journal > > > > https://pastebin.com/Kq1GjJuh > > > > Looks like most of it was a no op. So it seems btrfs in this case is > > not confused by so many small extent items, it know they are > > contiguous? > > > > It doesn't answer the question what the "too small" threshold is for > > BTRFS_IOC_DEFRAG, which is what sd-journald is using, though. > > > > Another sync, and then, 'journalctl --rotate' and the resulting > > archived file is now: > > > > https://pastebin.com/aqac0dRj > > > > These are not the same results between the two ioctls for the same > > file, and not the same result as what you get with -l 32M (which I do > > get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result > > is peculiar, but I don't think we can say it's ineffective, it might > > be an intentional no op either because it's nodatacow or it sees that > > these many extents are mostly contiguous and not worth defragmenting > > (which would be good for keeping write amplification down). > > > > So I don't know, maybe it's not wrong. > > > > -- > > Chris Murphy > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-11 3:08 ` kreijack @ 2021-02-11 3:13 ` Zygo Blaxell 2021-02-11 3:39 ` Chris Murphy 2021-02-11 3:52 ` Chris Murphy 0 siblings, 2 replies; 19+ messages in thread From: Zygo Blaxell @ 2021-02-11 3:13 UTC (permalink / raw) To: kreijack; +Cc: Chris Murphy, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 4866 bytes --] Sorry, I busted my mail client. That was from me. :-P On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreijack@inwind.it wrote: > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote: > > Hi Chris, > > > > it seems that systemd-journald is more smart/complex than I thought: > > > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > > closes the files, it mark again these as COW then defrag [1] > > > > 2) looking at the code, I suspect that systemd-journald closes the > > file asynchronously [2]. This means that looking at the "live" journal > > is not sufficient. In fact: > > > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > > [...] > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > > ---------------C----- user-1000.journal > > ---------------C----- system.journal > > > > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be > > "closed", the NOCOW flag will be removed and a defragmentation will start. > > Wait what? > > > Now my journals have few (2 or 3 extents). But I saw cases where the extents > > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less > > fragmented. > > > > [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 > > That line doesn't work, and systemd ignores the error. > > The NOCOW flag cannot be set or cleared unless the file is empty. > This is checked in btrfs_ioctl_setflags. > > This is not something that can be changed easily--if the NOCOW bit is > cleared on a non-empty file, btrfs data read code will expect csums > that aren't present on disk because they were written while the file was > NODATASUM, and the reads will fail pretty badly. The entire file would > have to have csums added or removed at the same time as the flag change > (or all nodatacow file reads take a performance hit looking for csums > that may or may not be present). > > At file close, the systemd should copy the data to a new file with no > special attributes and discard or recycle the old inode. This copy > will be mostly contiguous and have desirable properties like csums and > compression, and will have iops equivalent to btrfs fi defrag. > > > [2] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687 > > > > On 2/10/21 7:37 AM, Chris Murphy wrote: > > > This is an active (but idle) system.journal file. That is, it's open > > > but not being written to. I did a sync right before this: > > > > > > https://pastebin.com/jHh5tfpe > > > > > > And then: btrfs fi defrag -l 8M system.journal > > > > > > https://pastebin.com/Kq1GjJuh > > > > > > Looks like most of it was a no op. So it seems btrfs in this case is > > > not confused by so many small extent items, it know they are > > > contiguous? > > > > > > It doesn't answer the question what the "too small" threshold is for > > > BTRFS_IOC_DEFRAG, which is what sd-journald is using, though. > > > > > > Another sync, and then, 'journalctl --rotate' and the resulting > > > archived file is now: > > > > > > https://pastebin.com/aqac0dRj > > > > > > These are not the same results between the two ioctls for the same > > > file, and not the same result as what you get with -l 32M (which I do > > > get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result > > > is peculiar, but I don't think we can say it's ineffective, it might > > > be an intentional no op either because it's nodatacow or it sees that > > > these many extents are mostly contiguous and not worth defragmenting > > > (which would be good for keeping write amplification down). > > > > > > So I don't know, maybe it's not wrong. > > > > > > -- > > > Chris Murphy > > > > > > > > > -- > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-11 3:13 ` Zygo Blaxell @ 2021-02-11 3:39 ` Chris Murphy 2021-02-11 6:12 ` Zygo Blaxell 2021-02-11 3:52 ` Chris Murphy 1 sibling, 1 reply; 19+ messages in thread From: Chris Murphy @ 2021-02-11 3:39 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Goffredo Baroncelli, Chris Murphy, Btrfs BTRFS On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > Sorry, I busted my mail client. That was from me. :-P > > On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreijack@inwind.it wrote: > > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote: > > > Hi Chris, > > > > > > it seems that systemd-journald is more smart/complex than I thought: > > > > > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > > > closes the files, it mark again these as COW then defrag [1] > > > > > > 2) looking at the code, I suspect that systemd-journald closes the > > > file asynchronously [2]. This means that looking at the "live" journal > > > is not sufficient. In fact: > > > > > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > > > [...] > > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > > > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > > > ---------------C----- user-1000.journal > > > ---------------C----- system.journal > > > > > > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be > > > "closed", the NOCOW flag will be removed and a defragmentation will start. > > > > Wait what? > > > > > Now my journals have few (2 or 3 extents). But I saw cases where the extents > > > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less > > > fragmented. > > > > > > [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 > > > > That line doesn't work, and systemd ignores the error. > > > > The NOCOW flag cannot be set or cleared unless the file is empty. > > This is checked in btrfs_ioctl_setflags. > > > > This is not something that can be changed easily--if the NOCOW bit is > > cleared on a non-empty file, btrfs data read code will expect csums > > that aren't present on disk because they were written while the file was > > NODATASUM, and the reads will fail pretty badly. The entire file would > > have to have csums added or removed at the same time as the flag change > > (or all nodatacow file reads take a performance hit looking for csums > > that may or may not be present). > > > > At file close, the systemd should copy the data to a new file with no > > special attributes and discard or recycle the old inode. This copy > > will be mostly contiguous and have desirable properties like csums and > > compression, and will have iops equivalent to btrfs fi defrag. Journals implement their own checksumming. Yeah, if there's corruption, Btrfs raid can't do a transparent fixup. But the whole journal isn't lost, just the affected record. *shrug* I think if (a) nodatacow and/or (b) SSD, just leave it alone. Why add more writes? In particular the nodatacow case where I'm seeing consistently the file made from multiples of 8MB contiguous blocks, even on HDD the seek latency here can't be worth defraging the file. I think defrag makes sense (a) datacow journals, i.e. the default nodatacow is inhibited (b) HDD. In that case the fragmentation is quite considerable, hundreds to thousands of extents. It's sufficiently bad that it'd be probably be better if they were defragmented automatically with a trigger that tests for number of non-contiguous small blocks that somehow cheaply estimates latency reading all of them. Since the files are interleaved, doing something like "systemctl status dbus" might actually read many blocks even if the result isn't a whole heck of alot of visible data. But on SSD, cow or nocow, and HDD nocow - I think just leave them alone. -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-11 3:39 ` Chris Murphy @ 2021-02-11 6:12 ` Zygo Blaxell 2021-02-11 8:46 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Zygo Blaxell @ 2021-02-11 6:12 UTC (permalink / raw) To: Chris Murphy; +Cc: Goffredo Baroncelli, Btrfs BTRFS On Wed, Feb 10, 2021 at 08:39:12PM -0700, Chris Murphy wrote: > On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > Sorry, I busted my mail client. That was from me. :-P > > > > On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreijack@inwind.it wrote: > > > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote: > > > > Hi Chris, > > > > > > > > it seems that systemd-journald is more smart/complex than I thought: > > > > > > > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > > > > closes the files, it mark again these as COW then defrag [1] > > > > > > > > 2) looking at the code, I suspect that systemd-journald closes the > > > > file asynchronously [2]. This means that looking at the "live" journal > > > > is not sufficient. In fact: > > > > > > > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > > > > [...] > > > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > > > > --------------------- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > > > > --------------------- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > > > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > > > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > > > > ---------------C----- system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > > > > ---------------C----- user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > > > > ---------------C----- user-1000.journal > > > > ---------------C----- system.journal > > > > > > > > The output above means that the last 6 files are "pending" for a de-fragmentation. When these will be > > > > "closed", the NOCOW flag will be removed and a defragmentation will start. > > > > > > Wait what? > > > > > > > Now my journals have few (2 or 3 extents). But I saw cases where the extents > > > > of the more recent files are hundreds, but after few "journalct --rotate" the older files become less > > > > fragmented. > > > > > > > > [1] https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 > > > > > > That line doesn't work, and systemd ignores the error. > > > > > > The NOCOW flag cannot be set or cleared unless the file is empty. > > > This is checked in btrfs_ioctl_setflags. > > > > > > This is not something that can be changed easily--if the NOCOW bit is > > > cleared on a non-empty file, btrfs data read code will expect csums > > > that aren't present on disk because they were written while the file was > > > NODATASUM, and the reads will fail pretty badly. The entire file would > > > have to have csums added or removed at the same time as the flag change > > > (or all nodatacow file reads take a performance hit looking for csums > > > that may or may not be present). > > > > > > At file close, the systemd should copy the data to a new file with no > > > special attributes and discard or recycle the old inode. This copy > > > will be mostly contiguous and have desirable properties like csums and > > > compression, and will have iops equivalent to btrfs fi defrag. > > Journals implement their own checksumming. Yeah, Lennart said the same thing six years ago. I'm using btrfs data csums to detect disk failures (the most important benefit being that we can stop buying SSD models where silent data corruption is a problem). On our systems that have systemd journals, the journals are pretty big--10% of the writable media. That's 10% of the media where defects can hide undetected without csums. Checking journal csums with a separate tool is crazy. We used to do that with git and svn and archive files and media files and a hundred database formats with ext4, and it was the equivalent of a full time employee's job trying to figure out where all the chaos was coming from when a bad disk model came through the fleet. Never again. Now btrfs scrub just sends us an email telling us which disk models are garbage, we stop buying them, and now all the hardware that we buy (more than once) just works. If I had to, I'd remove the FS_NOCOW_FL flag support from my kernels to prevent applications from breaking that. > Yeah, if there's > corruption, Btrfs raid can't do a transparent fixup. But the whole > journal isn't lost, just the affected record. *shrug* I think if (a) > nodatacow and/or (b) SSD, just leave it alone. Why add more writes? Well, I'm trying to guess the original intent here. There are comments in the systemd git history talking about getting btrfs features back by turning off nodatacow as systemd closes the journal file. We can assume that the existing code turning off the FS_NOCOW_FL bit was intended to restore data csums (which implies at least reading all the data), but nobody noticed it doesn't work. The defrag command that follows implies an intended copy of the data. Though with this code it's hard to tell what's bug, what's intent, and what's cargo cult programming. If we want the data compressed (and who doesn't? journal data compresses 8:1 with btrfs zstd) then we'll always need to make a copy at close. Because systemd used prealloc, the copy is necessarily to a new inode, as there's no way to re-enable compression on an inode once prealloc is used (this has deep disk-format reasons, but not as deep as the nodatacow ones). If we don't care about compression or datasums, then keep the file nodatacow and do nothing at close. The defrag isn't needed and the FS_NOCOW_FL flag change doesn't work. > In particular the nodatacow case where I'm seeing consistently the > file made from multiples of 8MB contiguous blocks, even on HDD the > seek latency here can't be worth defraging the file. > > I think defrag makes sense (a) datacow journals, i.e. the default > nodatacow is inhibited (b) HDD. It makes sense for SSD too. It's 4K extents, so the metadata and small-IO overheads will be non-trivial even on SSD. Deleting or truncating datacow journal files will put a lot of tiny free space holes into the filesystem. It will flood the next commit with delayed refs and push up latency. > In that case the fragmentation is > quite considerable, hundreds to thousands of extents. It's > sufficiently bad that it'd be probably be better if they were > defragmented automatically with a trigger that tests for number of > non-contiguous small blocks that somehow cheaply estimates latency > reading all of them. Yeah it would be nice of autodefrag could be made to not suck. Even systemd running defrag_range after writing every 128K-512K would be so much better than no defrag at all or autodefrag. Short bursts of latency, and a small but not unreasonable target extent size. > Since the files are interleaved, doing something > like "systemctl status dbus" might actually read many blocks even if > the result isn't a whole heck of alot of visible data. > > But on SSD, cow or nocow, and HDD nocow - I think just leave them alone. > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-11 6:12 ` Zygo Blaxell @ 2021-02-11 8:46 ` Chris Murphy 2021-02-13 0:16 ` Zygo Blaxell 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2021-02-11 8:46 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Chris Murphy, Goffredo Baroncelli, Btrfs BTRFS On Wed, Feb 10, 2021 at 11:12 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > If we want the data compressed (and who doesn't? journal data compresses > 8:1 with btrfs zstd) then we'll always need to make a copy at close. > Because systemd used prealloc, the copy is necessarily to a new inode, > as there's no way to re-enable compression on an inode once prealloc > is used (this has deep disk-format reasons, but not as deep as the > nodatacow ones). Pretty sure sd-journald still fallocates when datacow by touching /etc/tmpfiles.d/journal-nocow.conf And I know for sure those datacow files do compress on rotation. Preallocated datacow might not be so bad if it weren't for that one damn header or indexing block, whatever the proper term is, that sd-journald hammers every time it fsyncs. I don't know if I wanna know what it means to snapshot a datacow file that's prealloc. But in theory if the same blocks weren't all being hammered, a preallocated file shouldn't fragment like hell if each prealloc block gets just one write. > If we don't care about compression or datasums, then keep the file > nodatacow and do nothing at close. The defrag isn't needed and the > FS_NOCOW_FL flag change doesn't work. Agreed. > It makes sense for SSD too. It's 4K extents, so the metadata and small-IO > overheads will be non-trivial even on SSD. Deleting or truncating datacow > journal files will put a lot of tiny free space holes into the filesystem. > It will flood the next commit with delayed refs and push up latency. I haven't seen meaningful latency on a single journal file, datacow and heavily fragmented, on ssd. But to test on more than one file at a time i need to revert the defrag commits, and build systemd, and let a bunch of journals accumulate somehow. If I dump too much data artificially to try and mimic aging, I know I will get nowhere near as many of those 4KiB extents. So I dunno. > > > In that case the fragmentation is > > quite considerable, hundreds to thousands of extents. It's > > sufficiently bad that it'd be probably be better if they were > > defragmented automatically with a trigger that tests for number of > > non-contiguous small blocks that somehow cheaply estimates latency > > reading all of them. > > Yeah it would be nice of autodefrag could be made to not suck. It triggers on inserts, not appends. So it doesn't do anything for the sd-journald case. I would think the active journals are the one more likely to get searched for recent events than archived journals. So in the datacow case, you only get relief once it's rotated. It'd be nice to find an decent, not necessarily perfect, way for them to not get so fragmented in the first place. Or just defrag once a file has 16M of non-contiguous extents. Estimating extents though is another issue, especially with compression enabled. -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-11 8:46 ` Chris Murphy @ 2021-02-13 0:16 ` Zygo Blaxell 0 siblings, 0 replies; 19+ messages in thread From: Zygo Blaxell @ 2021-02-13 0:16 UTC (permalink / raw) To: Chris Murphy; +Cc: Goffredo Baroncelli, Btrfs BTRFS On Thu, Feb 11, 2021 at 01:46:07AM -0700, Chris Murphy wrote: > On Wed, Feb 10, 2021 at 11:12 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > If we want the data compressed (and who doesn't? journal data compresses > > 8:1 with btrfs zstd) then we'll always need to make a copy at close. > > Because systemd used prealloc, the copy is necessarily to a new inode, > > as there's no way to re-enable compression on an inode once prealloc > > is used (this has deep disk-format reasons, but not as deep as the > > nodatacow ones). > > Pretty sure sd-journald still fallocates when datacow by touching > /etc/tmpfiles.d/journal-nocow.conf Fallocate on datacow just wastes space and CPU time if the application is not doing sequential 4K writes with no overwrites (sequential keeps the metadata at bounded size, otherwise it grows too). Datacow takes precedence over fallocate. It works only when you're overwriting a prealloc block with a data block for the first time, and after that it's just datacow with compress disabled and a reference to a big extent that doesn't go away until the last block is overwritten. I think fallocate on datacow should be deprecated and removed from btrfs. Fixing it doesn't seem to be possible without Pyrrhic time or space costs. On the other hand, it does have that one working use case, and I could be convinced to back down if someone shows me one example of an application in the wild that is using fallocate + datacow on btrfs correctly. > And I know for sure those datacow files do compress on rotation. Hmmm...OK, I missed that defrag can force compression in a prealloc file because it bypasses the inode check for prealloc (same for reflinks, you can reflink a compressed extent into a prealloc file if you wrote the extent in a non-prealloc file). It is only normal compressed writes directly to the inode that are blocked by prealloc. So we can keep the inode, but compression still only happens by making a copy of all the data with defrag. If the data is still in page cache then we can skip the read, at least. > Preallocated datacow might not be so bad if it weren't for that one > damn header or indexing block, whatever the proper term is, that > sd-journald hammers every time it fsyncs. It seems to write every other block more than once too. > I don't know if I wanna know > what it means to snapshot a datacow file that's prealloc. The first subvol to write to the prealloc data blocks gets to write in-place. All others get datacow, just like nodatacow files when they have a reflink. It is basically the same as the nodatacow extent-sharing check, except competing prealloc refs can be ignored (they will read as zero, and if they are written they will do their own extent-sharing check and notice they have lost the race to use the allocated block). > But in > theory if the same blocks weren't all being hammered, a preallocated > file shouldn't fragment like hell if each prealloc block gets just one > write. That is the key, each block must have only one 4K write, ever. Writing 2x adjacent 2K blocks seems to count as 2 writes even if they are 4K aligned and there is no flush or commit in between. > > If we don't care about compression or datasums, then keep the file > > nodatacow and do nothing at close. The defrag isn't needed and the > > FS_NOCOW_FL flag change doesn't work. > > Agreed. > > > > It makes sense for SSD too. It's 4K extents, so the metadata and small-IO > > overheads will be non-trivial even on SSD. Deleting or truncating datacow > > journal files will put a lot of tiny free space holes into the filesystem. > > It will flood the next commit with delayed refs and push up latency. > > I haven't seen meaningful latency on a single journal file, datacow > and heavily fragmented, on ssd. Someone pushed back last time I proposed simply letting datacow be datacow, citing high latency on NVME devices. I'm not sure what "meaningful" latency is...journalctl takes a crazy long time to start up compared to, say, 'tail -F' or 'less'. I've always assumed journald's file format was an interim thing that would have been deprecated and replaced years ago (you know you've failed to design a file format when 'less' is winning races against you). I never started using it, so I've never investigated what's really wrong with it (or what compelling advantage offsets the problems it seems to have). > But to test on more than one file at a > time i need to revert the defrag commits, and build systemd, and let a > bunch of journals accumulate somehow. If I dump too much data > artificially to try and mimic aging, I know I will get nowhere near as > many of those 4KiB extents. So I dunno. Something like: while :; do date > /dev/kmsg date >> logfile sync logfile done should be the worst case for both journald and a plaintext logfile. Maybe needs a 'sleep 1' to space things out for journald. > > > In that case the fragmentation is > > > quite considerable, hundreds to thousands of extents. It's > > > sufficiently bad that it'd be probably be better if they were > > > defragmented automatically with a trigger that tests for number of > > > non-contiguous small blocks that somehow cheaply estimates latency > > > reading all of them. > > > > Yeah it would be nice of autodefrag could be made to not suck. > > It triggers on inserts, not appends. So it doesn't do anything for the > sd-journald case. Appends are probably where autodefrag is most useful, and also cheapest (the cold data is more likely to still be in page cache for appends than it is for mid-file inserts), and also really common (lots of programs have log files). It would be nice if autodefrag could be configured to do those and nothing else--I might even be able to use it then. > I would think the active journals are the one more likely to get > searched for recent events than archived journals. So in the datacow > case, you only get relief once it's rotated. It'd be nice to find an > decent, not necessarily perfect, way for them to not get so fragmented > in the first place. Or just defrag once a file has 16M of > non-contiguous extents. Or run defrag_range on the tail of the file every time the file grows by 128K. Huge extents aren't required to get OK performance, we only need to avoid tiny extents because they are cripplingly slow. 64K is almost an OOM better than 4K for sequential reading over SATA. 128K isn't much bigger and would line up nicely with compressed extent size. > Estimating extents though is another issue, especially with compression enabled. Shouldn't be necessary. Either it's nodatacow and the extent sizes are all 8M (or whatever size you requested in fallocate), or it's datacow and the extent size is always 4K (or you have truly huge journal data volumes and none of this matters because even datacow will give good extent sizes on a firehose of data). There will not be compression if there are no 8K single-commit writes (have to save at least 4K per write, or btrfs won't be able to compress). > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: is BTRFS_IOC_DEFRAG behavior optimal? 2021-02-11 3:13 ` Zygo Blaxell 2021-02-11 3:39 ` Chris Murphy @ 2021-02-11 3:52 ` Chris Murphy 1 sibling, 0 replies; 19+ messages in thread From: Chris Murphy @ 2021-02-11 3:52 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Goffredo Baroncelli, Chris Murphy, Btrfs BTRFS On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > > At file close, the systemd should copy the data to a new file with no > > special attributes and discard or recycle the old inode. This copy > > will be mostly contiguous and have desirable properties like csums and > > compression, and will have iops equivalent to btrfs fi defrag. Or switch to a cow-friendly format that's no worse on overwriting file systems, but improves things on Btrfs and ZFS. RocksDB does well. -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2021-02-13 0:17 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-02-07 22:06 is BTRFS_IOC_DEFRAG behavior optimal? Chris Murphy 2021-02-08 22:11 ` Goffredo Baroncelli 2021-02-08 22:21 ` Zygo Blaxell 2021-02-09 1:05 ` Chris Murphy 2021-02-09 0:42 ` Chris Murphy 2021-02-09 18:13 ` Goffredo Baroncelli 2021-02-09 19:01 ` Chris Murphy 2021-02-09 19:45 ` Goffredo Baroncelli 2021-02-09 20:26 ` Chris Murphy 2021-02-10 6:37 ` Chris Murphy 2021-02-10 19:14 ` Goffredo Baroncelli 2021-02-11 0:19 ` Chris Murphy 2021-02-11 3:08 ` kreijack 2021-02-11 3:13 ` Zygo Blaxell 2021-02-11 3:39 ` Chris Murphy 2021-02-11 6:12 ` Zygo Blaxell 2021-02-11 8:46 ` Chris Murphy 2021-02-13 0:16 ` Zygo Blaxell 2021-02-11 3:52 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).