* [PATCH 1/5] mkfs: hoist the internal log size clamp code
2022-03-15 23:23 [PATCHSET 0/5] xfsprogs: stop allowing tiny filesystems Darrick J. Wong
@ 2022-03-15 23:23 ` Darrick J. Wong
2022-03-16 18:18 ` Eric Sandeen
2022-03-15 23:23 ` [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG Darrick J. Wong
` (3 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2022-03-15 23:23 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs, allison.henderson
From: Darrick J. Wong <djwong@kernel.org>
Move the code that clamps the computation of the internal log size so
that we can begin to enhnace mkfs without turning calculate_log_size
into more spaghetti.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
mkfs/xfs_mkfs.c | 49 +++++++++++++++++++++++++++++--------------------
1 file changed, 29 insertions(+), 20 deletions(-)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 96682f9a..b97bd360 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3259,6 +3259,34 @@ validate_log_size(uint64_t logblocks, int blocklog, int min_logblocks)
}
}
+static void
+clamp_internal_log_size(
+ struct mkfs_params *cfg,
+ struct xfs_mount *mp,
+ int min_logblocks)
+{
+ /* Ensure the chosen size meets minimum log size requirements */
+ cfg->logblocks = max(min_logblocks, cfg->logblocks);
+
+ /*
+ * Make sure the log fits wholly within an AG
+ *
+ * XXX: If agf->freeblks ends up as 0 because the log uses all
+ * the free space, it causes the kernel all sorts of problems
+ * with per-ag reservations. Right now just back it off one
+ * block, but there's a whole can of worms here that needs to be
+ * opened to decide what is the valid maximum size of a log in
+ * an AG.
+ */
+ cfg->logblocks = min(cfg->logblocks,
+ libxfs_alloc_ag_max_usable(mp) - 1);
+
+ /* and now clamp the size to the maximum supported size */
+ cfg->logblocks = min(cfg->logblocks, XFS_MAX_LOG_BLOCKS);
+ if ((cfg->logblocks << cfg->blocklog) > XFS_MAX_LOG_BYTES)
+ cfg->logblocks = XFS_MAX_LOG_BYTES >> cfg->blocklog;
+}
+
static void
calculate_log_size(
struct mkfs_params *cfg,
@@ -3331,26 +3359,7 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
cfg->logblocks = cfg->logblocks >> cfg->blocklog;
}
- /* Ensure the chosen size meets minimum log size requirements */
- cfg->logblocks = max(min_logblocks, cfg->logblocks);
-
- /*
- * Make sure the log fits wholly within an AG
- *
- * XXX: If agf->freeblks ends up as 0 because the log uses all
- * the free space, it causes the kernel all sorts of problems
- * with per-ag reservations. Right now just back it off one
- * block, but there's a whole can of worms here that needs to be
- * opened to decide what is the valid maximum size of a log in
- * an AG.
- */
- cfg->logblocks = min(cfg->logblocks,
- libxfs_alloc_ag_max_usable(mp) - 1);
-
- /* and now clamp the size to the maximum supported size */
- cfg->logblocks = min(cfg->logblocks, XFS_MAX_LOG_BLOCKS);
- if ((cfg->logblocks << cfg->blocklog) > XFS_MAX_LOG_BYTES)
- cfg->logblocks = XFS_MAX_LOG_BYTES >> cfg->blocklog;
+ clamp_internal_log_size(cfg, mp, min_logblocks);
validate_log_size(cfg->logblocks, cfg->blocklog, min_logblocks);
}
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 1/5] mkfs: hoist the internal log size clamp code
2022-03-15 23:23 ` [PATCH 1/5] mkfs: hoist the internal log size clamp code Darrick J. Wong
@ 2022-03-16 18:18 ` Eric Sandeen
0 siblings, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2022-03-16 18:18 UTC (permalink / raw)
To: Darrick J. Wong, sandeen; +Cc: linux-xfs, allison.henderson
On 3/15/22 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Move the code that clamps the computation of the internal log size so
> that we can begin to enhnace mkfs without turning calculate_log_size
> into more spaghetti.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> mkfs/xfs_mkfs.c | 49 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 29 insertions(+), 20 deletions(-)
no functional changes detected ;)
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG
2022-03-15 23:23 [PATCHSET 0/5] xfsprogs: stop allowing tiny filesystems Darrick J. Wong
2022-03-15 23:23 ` [PATCH 1/5] mkfs: hoist the internal log size clamp code Darrick J. Wong
@ 2022-03-15 23:23 ` Darrick J. Wong
2022-03-16 18:50 ` Eric Sandeen
2022-03-15 23:23 ` [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible Darrick J. Wong
` (2 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2022-03-15 23:23 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs, allison.henderson
From: Darrick J. Wong <djwong@kernel.org>
Currently, we don't let an internal log consume every last block in an
AG. According to the comment, we're doing this to avoid tripping AGF
verifiers if freeblks==0, but on a modern filesystem this isn't
sufficient to avoid problems. First, the per-AG reservations for
reflink and rmap claim up to about 1.7% of each AG for btree expansion,
and secondly, we need to have enough space in the AG to allocate the
root inode chunk, if it should be the case that the log ends up in AG 0.
We don't care about nonredundant (i.e. agcount==1) filesystems, but it
can also happen if the user passes in -lagnum=0.
Change this constraint so that we can't leave less than 5% free space
after allocating the log. This is perhaps a bit much, but as we're
about to disallow tiny filesystems anyway, it seems unlikely to cause
problems with scenarios that we care about.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
mkfs/xfs_mkfs.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index b97bd360..ad776492 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3271,15 +3271,18 @@ clamp_internal_log_size(
/*
* Make sure the log fits wholly within an AG
*
- * XXX: If agf->freeblks ends up as 0 because the log uses all
- * the free space, it causes the kernel all sorts of problems
- * with per-ag reservations. Right now just back it off one
- * block, but there's a whole can of worms here that needs to be
- * opened to decide what is the valid maximum size of a log in
- * an AG.
+ * XXX: If agf->freeblks ends up as 0 because the log uses all the free
+ * space, it causes the kernel all sorts of problems with per-AG
+ * reservations. The reservations are only supposed to take 2% of the
+ * AG, but there's a further problem that if the log ends up in AG 0,
+ * we also need space to allocate the root directory inode chunk.
+ *
+ * Right now just back it off by 5%, but there's a whole can of worms
+ * here that needs to be opened to decide what is the valid maximum
+ * size of a log in an AG.
*/
cfg->logblocks = min(cfg->logblocks,
- libxfs_alloc_ag_max_usable(mp) - 1);
+ libxfs_alloc_ag_max_usable(mp) * 95 / 100);
/* and now clamp the size to the maximum supported size */
cfg->logblocks = min(cfg->logblocks, XFS_MAX_LOG_BLOCKS);
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG
2022-03-15 23:23 ` [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG Darrick J. Wong
@ 2022-03-16 18:50 ` Eric Sandeen
2022-03-31 5:21 ` Eric Sandeen
0 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2022-03-16 18:50 UTC (permalink / raw)
To: Darrick J. Wong, sandeen; +Cc: linux-xfs, allison.henderson
On 3/15/22 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Currently, we don't let an internal log consume every last block in an
> AG. According to the comment, we're doing this to avoid tripping AGF
> verifiers if freeblks==0, but on a modern filesystem this isn't
> sufficient to avoid problems. First, the per-AG reservations for
> reflink and rmap claim up to about 1.7% of each AG for btree expansion,
Hm, will that be a factor if the log consumes every last block in that
AG? Or is the problem that if we consume "most" blocks, that leaves the
possibility of reflink/rmap btree expansion subsequently failing because
we do have a little room for new allocations in that AG?
Or is it a problem right out of the gate because the per-ag reservations
collide with a maximal log before the filesystem is even in use?
> and secondly, we need to have enough space in the AG to allocate the
> root inode chunk, if it should be the case that the log ends up in AG 0.
> We don't care about nonredundant (i.e. agcount==1) filesystems, but it
> can also happen if the user passes in -lagnum=0.
>
> Change this constraint so that we can't leave less than 5% free space
> after allocating the log. This is perhaps a bit much, but as we're
> about to disallow tiny filesystems anyway, it seems unlikely to cause
> problems with scenarios that we care about.
This is only modifying the case where we automatically calculated a
log size, and doesn't affect a manually-specified size. Is that
intentional? (I guess we already had this discrepancy, whether it was
the old "-1" heuristic or the new "95%" heuristic...
But 5% is likely to be a fair bit bigger than 1 block, so I'm wondering
if the manually-specified case needs to be limited as well.
Thanks,
-Eric
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG
2022-03-16 18:50 ` Eric Sandeen
@ 2022-03-31 5:21 ` Eric Sandeen
2022-03-31 16:20 ` Darrick J. Wong
0 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2022-03-31 5:21 UTC (permalink / raw)
To: Eric Sandeen, Darrick J. Wong; +Cc: linux-xfs, allison.henderson
On 3/16/22 8:50 AM, Eric Sandeen wrote:
> On 3/15/22 6:23 PM, Darrick J. Wong wrote:
>> From: Darrick J. Wong <djwong@kernel.org>
>>
>> Currently, we don't let an internal log consume every last block in an
>> AG. According to the comment, we're doing this to avoid tripping AGF
>> verifiers if freeblks==0, but on a modern filesystem this isn't
>> sufficient to avoid problems. First, the per-AG reservations for
>> reflink and rmap claim up to about 1.7% of each AG for btree expansion,
>
> Hm, will that be a factor if the log consumes every last block in that
> AG? Or is the problem that if we consume "most" blocks, that leaves the
> possibility of reflink/rmap btree expansion subsequently failing because
> we do have a little room for new allocations in that AG?
>
> Or is it a problem right out of the gate because the per-ag reservations
> collide with a maximal log before the filesystem is even in use?
Darrick, any comment on this? What did you actually run into that prompted
this change?
Still bugs me a little that a manually-sized log escapes this limit, and if
it's needed for proper functioning, we should probably enforce it everywhere.
I do understand that the existing code only validates auto-sized logs. But
I don't want to sweep this under the rug, even if we choose to not fix it all
right now.
Mostly looking for clarification on the what fails and how, with the current
code.
Thanks,
-Eric
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG
2022-03-31 5:21 ` Eric Sandeen
@ 2022-03-31 16:20 ` Darrick J. Wong
0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-03-31 16:20 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Eric Sandeen, linux-xfs, allison.henderson
On Wed, Mar 30, 2022 at 07:21:36PM -1000, Eric Sandeen wrote:
> On 3/16/22 8:50 AM, Eric Sandeen wrote:
> > On 3/15/22 6:23 PM, Darrick J. Wong wrote:
> >> From: Darrick J. Wong <djwong@kernel.org>
> >>
> >> Currently, we don't let an internal log consume every last block in an
> >> AG. According to the comment, we're doing this to avoid tripping AGF
> >> verifiers if freeblks==0, but on a modern filesystem this isn't
> >> sufficient to avoid problems. First, the per-AG reservations for
> >> reflink and rmap claim up to about 1.7% of each AG for btree expansion,
> >
> > Hm, will that be a factor if the log consumes every last block in that
> > AG? Or is the problem that if we consume "most" blocks, that leaves the
> > possibility of reflink/rmap btree expansion subsequently failing because
> > we do have a little room for new allocations in that AG?
> >
> > Or is it a problem right out of the gate because the per-ag reservations
> > collide with a maximal log before the filesystem is even in use?
>
> Darrick, any comment on this? What did you actually run into that prompted
> this change?
Oops, sorry, forgot to reply to this.
The per-AG reservations shouldn't be a problem on a modern kernel
because we assume an internal log never moves or grows, so we only need
to account for rmap/refcount btree expansions to cover space used
elsewhere in the AG. That said, *one* block is cutting it really close.
Also, earlier kernels (4.9 era) weren't smart about that, though 4.9 is
dead according to gregkh and any kernel that says rmap/reflink are
experimental should not be used.
The problem that I saw is that you could trick the default calculations
into making the log to consume so much space that either (a) the root
directory gets allocated in the next AG, or if you were <cough>
formatting a single AG then mkfs just exits with a cryptic message about
ENOSPC.
It's fairly difficult to trigger this for a plain old block device, but
if mkfs thinks the disk has a gigantic stripe unit, it'll go around
putting the root directory at a much higher block number than is usual.
For the most part the "max_ag_blocks - 1" was actually fine, but not so
much for the case where the AG size is close to 64MB.
So in the end the 5% figure is still rather handwavy wormcanning.
> Still bugs me a little that a manually-sized log escapes this limit, and if
> it's needed for proper functioning, we should probably enforce it everywhere.
<nod> I guess the same limits ought to be applied to explicit logsize to
improve the "Well if it hurts don't do that!" experience. ;)
> I do understand that the existing code only validates auto-sized logs. But
> I don't want to sweep this under the rug, even if we choose to not fix it all
> right now.
>
> Mostly looking for clarification on the what fails and how, with the current
> code.
<nod> I'll try to add a sample mkfs failure to the commit message.
--D
>
> Thanks,
> -Eric
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible
2022-03-15 23:23 [PATCHSET 0/5] xfsprogs: stop allowing tiny filesystems Darrick J. Wong
2022-03-15 23:23 ` [PATCH 1/5] mkfs: hoist the internal log size clamp code Darrick J. Wong
2022-03-15 23:23 ` [PATCH 2/5] mkfs: don't let internal logs consume more than 95% of an AG Darrick J. Wong
@ 2022-03-15 23:23 ` Darrick J. Wong
2022-03-16 19:27 ` Eric Sandeen
2022-03-25 17:48 ` Eric Sandeen
2022-03-15 23:23 ` [PATCH 4/5] mkfs: stop allowing tiny filesystems Darrick J. Wong
2022-03-15 23:23 ` [PATCH 5/5] mkfs: simplify the default log size ratio computation Darrick J. Wong
4 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-03-15 23:23 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs, allison.henderson
From: Darrick J. Wong <djwong@kernel.org>
Recently, the upstream maintainers have been taking a lot of heat on
account of writer threads encountering high latency when asking for log
grant space when the log is small. The reported use case is a heavily
threaded indexing product logging trace information to a filesystem
ranging in size between 20 and 250GB. The meetings that result from the
complaints about latency and stall warnings in dmesg both from this use
case and also a large well known cloud product are now consuming 25% of
the maintainer's weekly time and have been for months.
For small filesystems, the log is small by default because we have
defaulted to a ratio of 1:2048 (or even less). For grown filesystems,
this is even worse, because big filesystems generate big metadata.
However, the log size is still insufficient even if it is formatted at
the larger size.
On a 220GB filesystem, the 99.95% latencies observed with a 200-writer
file synchronous append workload running on a 44-AG filesystem (with 44
CPUs) spread across 4 hard disks showed:
99.5%
Log(MB) Latency(ms) BW (MB/s) xlog_grant_head_wait
10 520 243 1875
20 220 308 540
40 140 360 6
80 92 363 0
160 86 364 0
For 4 NVME, the results were:
10 201 409 898
20 177 488 144
40 122 550 0
80 120 549 0
160 121 545 0
This shows pretty clearly that we could reduce the amount of time that
threads spend waiting on the XFS log by increasing the log size to at
least 40MB regardless of size. We then repeated the benchmark with a
cloud system and an old machine to see if there were any ill effects on
less stable hardware.
For cloudy iscsi block storage, the results were:
10 390 176 2584
20 173 186 357
40 37 187 0
80 40 183 0
160 37 183 0
A decade-old machine w/ 24 CPUs and a giant spinning disk RAID6 array
produced this:
10 55 5.4 0
20 40 5.9 0
40 62 5.7 0
80 66 5.7 0
160 25 5.4 0
From the first three scenarios, it is clear that there are gains to be
had by sizing the log somewhere between 40 and 80MB -- the long tail
latency drops quite a bit, and programs are no longer blocking on the
log's transaction space grant heads. Split the difference and set the
log size floor to 64MB.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
mkfs/xfs_mkfs.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index ad776492..84dbb799 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -18,6 +18,14 @@
#define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
#define MEGABYTES(count, blog) ((uint64_t)(count) << (20 - (blog)))
+/*
+ * Realistically, the log should never be smaller than 64MB. Studies by the
+ * kernel maintainer in early 2022 have shown a dramatic reduction in long tail
+ * latency of the xlog grant head waitqueue when running a heavy metadata
+ * update workload when the log size is at least 64MB.
+ */
+#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog) (MEGABYTES(64, (blog)))
+
/*
* Use this macro before we have superblock and mount structure to
* convert from basic blocks to filesystem blocks.
@@ -3259,6 +3267,28 @@ validate_log_size(uint64_t logblocks, int blocklog, int min_logblocks)
}
}
+/*
+ * Ensure that the log is large enough to provide reasonable performance on a
+ * modern system.
+ */
+static void
+calc_realistic_log_size(
+ struct mkfs_params *cfg)
+{
+ unsigned int realistic_log_blocks;
+
+ realistic_log_blocks = XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog);
+
+ /*
+ * If the "realistic" size is more than 7/8 of the AG, this is a tiny
+ * filesystem and we don't care.
+ */
+ if (realistic_log_blocks > (cfg->agsize * 7 / 8))
+ return;
+
+ cfg->logblocks = max(cfg->logblocks, realistic_log_blocks);
+}
+
static void
clamp_internal_log_size(
struct mkfs_params *cfg,
@@ -3362,6 +3392,8 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
cfg->logblocks = cfg->logblocks >> cfg->blocklog;
}
+ calc_realistic_log_size(cfg);
+
clamp_internal_log_size(cfg, mp, min_logblocks);
validate_log_size(cfg->logblocks, cfg->blocklog, min_logblocks);
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible
2022-03-15 23:23 ` [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible Darrick J. Wong
@ 2022-03-16 19:27 ` Eric Sandeen
2022-03-25 17:48 ` Eric Sandeen
1 sibling, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2022-03-16 19:27 UTC (permalink / raw)
To: Darrick J. Wong, sandeen; +Cc: linux-xfs, allison.henderson
On 3/15/22 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Recently, the upstream maintainers have been taking a lot of heat on
> account of writer threads encountering high latency when asking for log
> grant space when the log is small. The reported use case is a heavily
> threaded indexing product logging trace information to a filesystem
> ranging in size between 20 and 250GB. The meetings that result from the
> complaints about latency and stall warnings in dmesg both from this use
> case and also a large well known cloud product are now consuming 25% of
> the maintainer's weekly time and have been for months.
And we don't want that!
> For small filesystems, the log is small by default because we have
> defaulted to a ratio of 1:2048 (or even less). For grown filesystems,
> this is even worse, because big filesystems generate big metadata.
> However, the log size is still insufficient even if it is formatted at
> the larger size.
I have no complaints about raising the log size like this; I think it's prudent,
even if it doesn't solve world hunger and bring global peace.
I do want to give a little more thought to how calculate_log_size() looks now;
it's inherited spaghetti but my eye twitches a little bit when we follow this:
* For small filesystems, we want to use the
* XFS_MIN_LOG_BYTES for filesystems smaller than 16G if
* at all possible,
with "haha no actually we should calculate something realistic" ;)
I don't want to hold this up for long but I want to put a little thought into
whether the resulting code can be a bit more understandable, there's so much
pingponging around about clamping the size up, down, bigger, smaller, this
heuristic, that heuristic I'm having trouble keeping it straight in my
old brain.
Thanks,
-Eric
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible
2022-03-15 23:23 ` [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible Darrick J. Wong
2022-03-16 19:27 ` Eric Sandeen
@ 2022-03-25 17:48 ` Eric Sandeen
2022-03-25 22:08 ` Darrick J. Wong
1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2022-03-25 17:48 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson
Darrick, it bugged me a bit that the way this patch worked was to go through
one set of calculations to sort out a log size, then throw in a new heuristic
or two to change that result. And TBH the 7/8 bit seems very ad hoc when
we already had the prior 95% heuristic.
While I know that most of this goes away by the last patch, I'm considering
deferring patches 4 & 5 for the next release because of the impact on
fstests, so would like to land somewhere clean by patch 3.
What do you think of this patch as a replacement for this patch 3/5? It's a
bit clearer to me, and results in (almost) the same net results as your patch,
with minor differences around 512MB filesystems due to the removal of the
7/8 limit.
Depending on what you think, I can tweak your commit log as needed, leave
your authorship, and add a:
[sandeen: consolidate heuristics into something a bit more clear]
or something like that... or, claim authorship (and all ensuing bugs) ;)
Thanks,
-Eric
diff --git a/include/xfs_multidisk.h b/include/xfs_multidisk.h
index a16a9fe2..ef4443b0 100644
--- a/include/xfs_multidisk.h
+++ b/include/xfs_multidisk.h
@@ -17,8 +17,6 @@
#define XFS_MIN_INODE_PERBLOCK 2 /* min inodes per block */
#define XFS_DFL_IMAXIMUM_PCT 25 /* max % of space for inodes */
#define XFS_MIN_REC_DIRSIZE 12 /* 4096 byte dirblocks (V2) */
-#define XFS_DFL_LOG_FACTOR 5 /* default log size, factor */
- /* with max trans reservation */
#define XFS_MAX_INODE_SIG_BITS 32 /* most significant bits in an
* inode number that we'll
* accept w/o warnings
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index ad776492..cf1d64a2 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -18,6 +18,14 @@
#define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
#define MEGABYTES(count, blog) ((uint64_t)(count) << (20 - (blog)))
+/*
+ * Realistically, the log should never be smaller than 64MB. Studies by the
+ * kernel maintainer in early 2022 have shown a dramatic reduction in long tail
+ * latency of the xlog grant head waitqueue when running a heavy metadata
+ * update workload when the log size is at least 64MB.
+ */
+#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog) (MEGABYTES(64, (blog)))
+
/*
* Use this macro before we have superblock and mount structure to
* convert from basic blocks to filesystem blocks.
@@ -3297,7 +3305,7 @@ calculate_log_size(
struct xfs_mount *mp)
{
struct xfs_sb *sbp = &mp->m_sb;
- int min_logblocks;
+ int min_logblocks; /* absolute minimum */
struct xfs_mount mount;
/* we need a temporary mount to calculate the minimum log size. */
@@ -3339,29 +3347,19 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
/* internal log - if no size specified, calculate automatically */
if (!cfg->logblocks) {
- if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
- /* tiny filesystems get minimum sized logs. */
- cfg->logblocks = min_logblocks;
- } else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
+ /* Use a 2048:1 fs:log ratio for most filesystems */
+ cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
+ cfg->logblocks = cfg->logblocks >> cfg->blocklog;
- /*
- * For small filesystems, we want to use the
- * XFS_MIN_LOG_BYTES for filesystems smaller than 16G if
- * at all possible, ramping up to 128MB at 256GB.
- */
- cfg->logblocks = min(XFS_MIN_LOG_BYTES >> cfg->blocklog,
- min_logblocks * XFS_DFL_LOG_FACTOR);
- } else {
- /*
- * With a 2GB max log size, default to maximum size
- * at 4TB. This keeps the same ratio from the older
- * max log size of 128M at 256GB fs size. IOWs,
- * the ratio of fs size to log size is 2048:1.
- */
- cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
- cfg->logblocks = cfg->logblocks >> cfg->blocklog;
- }
+ /* But don't go below a reasonable size */
+ cfg->logblocks = max(cfg->logblocks,
+ XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));
+
+ /* And for a tiny filesystem, use the absolute minimum size */
+ if (cfg->dblocks < MEGABYTES(512, cfg->blocklog))
+ cfg->logblocks = min_logblocks;
+ /* Enforce min/max limits */
clamp_internal_log_size(cfg, mp, min_logblocks);
validate_log_size(cfg->logblocks, cfg->blocklog, min_logblocks);
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible
2022-03-25 17:48 ` Eric Sandeen
@ 2022-03-25 22:08 ` Darrick J. Wong
0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-03-25 22:08 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-xfs, allison.henderson
On Fri, Mar 25, 2022 at 07:48:08AM -1000, Eric Sandeen wrote:
> Darrick, it bugged me a bit that the way this patch worked was to go through
> one set of calculations to sort out a log size, then throw in a new heuristic
> or two to change that result. And TBH the 7/8 bit seems very ad hoc when
> we already had the prior 95% heuristic.
>
> While I know that most of this goes away by the last patch, I'm considering
> deferring patches 4 & 5 for the next release because of the impact on
> fstests, so would like to land somewhere clean by patch 3.
>
> What do you think of this patch as a replacement for this patch 3/5? It's a
> bit clearer to me, and results in (almost) the same net results as your patch,
> with minor differences around 512MB filesystems due to the removal of the
> 7/8 limit.
>
> Depending on what you think, I can tweak your commit log as needed, leave
> your authorship, and add a:
>
> [sandeen: consolidate heuristics into something a bit more clear]
>
> or something like that... or, claim authorship (and all ensuing bugs) ;)
Man, my email is slow today!
This looks ok to me, though at this point you've basically reauthored
the whole thing. You might as well own it and add:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
>
> Thanks,
> -Eric
>
> diff --git a/include/xfs_multidisk.h b/include/xfs_multidisk.h
> index a16a9fe2..ef4443b0 100644
> --- a/include/xfs_multidisk.h
> +++ b/include/xfs_multidisk.h
> @@ -17,8 +17,6 @@
> #define XFS_MIN_INODE_PERBLOCK 2 /* min inodes per block */
> #define XFS_DFL_IMAXIMUM_PCT 25 /* max % of space for inodes */
> #define XFS_MIN_REC_DIRSIZE 12 /* 4096 byte dirblocks (V2) */
> -#define XFS_DFL_LOG_FACTOR 5 /* default log size, factor */
> - /* with max trans reservation */
> #define XFS_MAX_INODE_SIG_BITS 32 /* most significant bits in an
> * inode number that we'll
> * accept w/o warnings
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index ad776492..cf1d64a2 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -18,6 +18,14 @@
> #define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
> #define MEGABYTES(count, blog) ((uint64_t)(count) << (20 - (blog)))
>
> +/*
> + * Realistically, the log should never be smaller than 64MB. Studies by the
> + * kernel maintainer in early 2022 have shown a dramatic reduction in long tail
> + * latency of the xlog grant head waitqueue when running a heavy metadata
> + * update workload when the log size is at least 64MB.
> + */
> +#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog) (MEGABYTES(64, (blog)))
> +
> /*
> * Use this macro before we have superblock and mount structure to
> * convert from basic blocks to filesystem blocks.
> @@ -3297,7 +3305,7 @@ calculate_log_size(
> struct xfs_mount *mp)
> {
> struct xfs_sb *sbp = &mp->m_sb;
> - int min_logblocks;
> + int min_logblocks; /* absolute minimum */
> struct xfs_mount mount;
>
> /* we need a temporary mount to calculate the minimum log size. */
> @@ -3339,29 +3347,19 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
>
> /* internal log - if no size specified, calculate automatically */
> if (!cfg->logblocks) {
> - if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
> - /* tiny filesystems get minimum sized logs. */
> - cfg->logblocks = min_logblocks;
> - } else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
> + /* Use a 2048:1 fs:log ratio for most filesystems */
> + cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
> + cfg->logblocks = cfg->logblocks >> cfg->blocklog;
>
> - /*
> - * For small filesystems, we want to use the
> - * XFS_MIN_LOG_BYTES for filesystems smaller than 16G if
> - * at all possible, ramping up to 128MB at 256GB.
> - */
> - cfg->logblocks = min(XFS_MIN_LOG_BYTES >> cfg->blocklog,
> - min_logblocks * XFS_DFL_LOG_FACTOR);
> - } else {
> - /*
> - * With a 2GB max log size, default to maximum size
> - * at 4TB. This keeps the same ratio from the older
> - * max log size of 128M at 256GB fs size. IOWs,
> - * the ratio of fs size to log size is 2048:1.
> - */
> - cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
> - cfg->logblocks = cfg->logblocks >> cfg->blocklog;
> - }
> + /* But don't go below a reasonable size */
> + cfg->logblocks = max(cfg->logblocks,
> + XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));
> +
> + /* And for a tiny filesystem, use the absolute minimum size */
> + if (cfg->dblocks < MEGABYTES(512, cfg->blocklog))
> + cfg->logblocks = min_logblocks;
>
> + /* Enforce min/max limits */
> clamp_internal_log_size(cfg, mp, min_logblocks);
>
> validate_log_size(cfg->logblocks, cfg->blocklog, min_logblocks);
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 4/5] mkfs: stop allowing tiny filesystems
2022-03-15 23:23 [PATCHSET 0/5] xfsprogs: stop allowing tiny filesystems Darrick J. Wong
` (2 preceding siblings ...)
2022-03-15 23:23 ` [PATCH 3/5] mkfs: increase the minimum log size to 64MB when possible Darrick J. Wong
@ 2022-03-15 23:23 ` Darrick J. Wong
2022-03-15 23:23 ` [PATCH 5/5] mkfs: simplify the default log size ratio computation Darrick J. Wong
4 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-03-15 23:23 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs, allison.henderson
From: Darrick J. Wong <djwong@kernel.org>
Refuse to format a filesystem that are "too small", because these
configurations are known to have performance and redundancy problems
that are not present on the volume sizes that XFS is best at handling.
Specifically, this means that we won't allow logs smaller than 64MB, we
won't allow single-AG filesystems, and we won't allow volumes smaller
than 300MB. There are two exceptions: the first is an undocumented CLI
option that can be used for crafting debug filesystems.
The second exception is that if fstests is detected, because there are a
lot of fstests that use tiny filesystems to perform targeted regression
and functional testing in a controlled environment. Fixing the ~40 or
so tests to run more slowly with larger filesystems isn't worth the risk
of breaking the tests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
mkfs/xfs_mkfs.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 81 insertions(+), 1 deletion(-)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 84dbb799..239d529c 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -838,6 +838,7 @@ struct cli_params {
int64_t logagno;
int loginternal;
int lsunit;
+ int has_warranty;
/* parameters where 0 is not a valid value */
int64_t agcount;
@@ -2463,6 +2464,68 @@ _("illegal CoW extent size hint %lld, must be less than %u.\n"),
}
}
+/* Complain if this filesystem is not a supported configuration. */
+static void
+validate_warranty(
+ struct xfs_mount *mp,
+ struct cli_params *cli)
+{
+ /* Undocumented option to enable unsupported tiny filesystems. */
+ if (!cli->has_warranty) {
+ printf(
+ _("Filesystems formatted with --yes-i-know-what-i-am-doing are not supported!!\n"));
+ return;
+ }
+
+ /*
+ * fstests has a large number of tests that create tiny filesystems to
+ * perform specific regression and resource depletion tests in a
+ * controlled environment. Avoid breaking fstests by allowing
+ * unsupported configurations if TEST_DIR, TEST_DEV, and QA_CHECK_FS
+ * are all set.
+ */
+ if (getenv("TEST_DIR") && getenv("TEST_DEV") && getenv("QA_CHECK_FS"))
+ return;
+
+ /*
+ * We don't support filesystems smaller than 300MB anymore. Tiny
+ * filesystems have never been XFS' design target. This limit has been
+ * carefully calculated to prevent formatting with a log smaller than
+ * the "realistic" size.
+ *
+ * If the realistic log size is 64MB, there are four AGs, and the log
+ * AG should be at least 1/8 free after formatting, this gives us:
+ *
+ * 64MB * (8 / 7) * 4 = 293MB
+ */
+ if (mp->m_sb.sb_dblocks < MEGABYTES(300, mp->m_sb.sb_blocklog)) {
+ fprintf(stderr,
+ _("Filesystem must be larger than 300MB.\n"));
+ usage();
+ }
+
+ /*
+ * For best performance, we don't allow unrealistically small logs.
+ * See the comment for XFS_MIN_REALISTIC_LOG_BLOCKS.
+ */
+ if (mp->m_sb.sb_logblocks <
+ XFS_MIN_REALISTIC_LOG_BLOCKS(mp->m_sb.sb_blocklog)) {
+ fprintf(stderr,
+ _("Log size must be at least 64MB.\n"));
+ usage();
+ }
+
+ /*
+ * Filesystems should not have fewer than two AGs, because we need to
+ * have redundant superblocks.
+ */
+ if (mp->m_sb.sb_agcount < 2) {
+ fprintf(stderr,
+ _("Filesystem must have redundant superblocks!\n"));
+ usage();
+ }
+}
+
/*
* Validate the configured stripe geometry, or is none is specified, pull
* the configuration from the underlying device.
@@ -3892,9 +3955,21 @@ main(
struct cli_params cli = {
.xi = &xi,
.loginternal = 1,
+ .has_warranty = 1,
};
struct mkfs_params cfg = {};
+ struct option long_options[] = {
+ {
+ .name = "yes-i-know-what-i-am-doing",
+ .has_arg = no_argument,
+ .flag = &cli.has_warranty,
+ .val = 0,
+ },
+ {NULL, 0, NULL, 0 },
+ };
+ int option_index = 0;
+
/* build time defaults */
struct mkfs_default_params dft = {
.source = _("package build definitions"),
@@ -3953,8 +4028,11 @@ main(
memcpy(&cli.sb_feat, &dft.sb_feat, sizeof(cli.sb_feat));
memcpy(&cli.fsx, &dft.fsx, sizeof(cli.fsx));
- while ((c = getopt(argc, argv, "b:c:d:i:l:L:m:n:KNp:qr:s:CfV")) != EOF) {
+ while ((c = getopt_long(argc, argv, "b:c:d:i:l:L:m:n:KNp:qr:s:CfV",
+ long_options, &option_index)) != EOF) {
switch (c) {
+ case 0:
+ break;
case 'C':
case 'f':
force_overwrite = 1;
@@ -4092,6 +4170,8 @@ main(
validate_extsize_hint(mp, &cli);
validate_cowextsize_hint(mp, &cli);
+ validate_warranty(mp, &cli);
+
/* Print the intended geometry of the fs. */
if (!quiet || dry_run) {
struct xfs_fsop_geom geo;
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 5/5] mkfs: simplify the default log size ratio computation
2022-03-15 23:23 [PATCHSET 0/5] xfsprogs: stop allowing tiny filesystems Darrick J. Wong
` (3 preceding siblings ...)
2022-03-15 23:23 ` [PATCH 4/5] mkfs: stop allowing tiny filesystems Darrick J. Wong
@ 2022-03-15 23:23 ` Darrick J. Wong
4 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-03-15 23:23 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs, allison.henderson
From: Darrick J. Wong <djwong@kernel.org>
Now that the minimum default log size is 64MB, we can simplify the ratio
computation because the alternate calculations no longer matter:
fssize oldlogsize newlogsize
16m 3m 3m
256m 5m 64m
512m 5m 64m
1g 10m 64m
4g 10m 64m
8g 10m 64m
16g 10m 64m
32g 16m 64m
64g 32m 64m
128g 64m 64m
220g 110m 110m
256g 128m 128m
512g 256m 256m
1t 512m 512m
10t 2038m 2038m
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
mkfs/xfs_mkfs.c | 30 ++++++++----------------------
1 file changed, 8 insertions(+), 22 deletions(-)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 239d529c..15dcf48a 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3432,28 +3432,14 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
/* internal log - if no size specified, calculate automatically */
if (!cfg->logblocks) {
- if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
- /* tiny filesystems get minimum sized logs. */
- cfg->logblocks = min_logblocks;
- } else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
-
- /*
- * For small filesystems, we want to use the
- * XFS_MIN_LOG_BYTES for filesystems smaller than 16G if
- * at all possible, ramping up to 128MB at 256GB.
- */
- cfg->logblocks = min(XFS_MIN_LOG_BYTES >> cfg->blocklog,
- min_logblocks * XFS_DFL_LOG_FACTOR);
- } else {
- /*
- * With a 2GB max log size, default to maximum size
- * at 4TB. This keeps the same ratio from the older
- * max log size of 128M at 256GB fs size. IOWs,
- * the ratio of fs size to log size is 2048:1.
- */
- cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
- cfg->logblocks = cfg->logblocks >> cfg->blocklog;
- }
+ /*
+ * With a 2GB max log size, default to maximum size at 4TB.
+ * This keeps the same ratio from the older max log size of
+ * 128M at 256GB fs size. IOWs, the ratio of fs size to log
+ * size is 2048:1.
+ */
+ cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
+ cfg->logblocks = cfg->logblocks >> cfg->blocklog;
calc_realistic_log_size(cfg);
^ permalink raw reply related [flat|nested] 13+ messages in thread