* xfs_extent_busy_flush vs. aio @ 2018-01-23 14:57 Avi Kivity 2018-01-23 15:28 ` Brian Foster 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-01-23 14:57 UTC (permalink / raw) To: linux-xfs I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my beautiful io_submit() calls. Questions: - Is it correct that RWF_NOWAIT will not detect the condition that led to the log being forced? - If so, can it be fixed? - Can I do something to reduce the odds of this occurring? larger logs, more logs, flush more often, resurrect extinct species and sacrifice them to the xfs gods? - Can an xfs developer do something? For example, make it RWF_NOWAIT friendly (if the answer to the first question was "correct") [*] equivalent, because I'm actually looking at an older kernel that lacks this function. But I'm moderately confident that the xfs_log_force I'm seeing was transformed into xfs_extent_busy_flush by ebf55872616c7d4754db5a318591a72a8d5e6896 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 14:57 xfs_extent_busy_flush vs. aio Avi Kivity @ 2018-01-23 15:28 ` Brian Foster 2018-01-23 15:45 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Brian Foster @ 2018-01-23 15:28 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my > beautiful io_submit() calls. > > > Questions: > > - Is it correct that RWF_NOWAIT will not detect the condition that led to > the log being forced? > > - If so, can it be fixed? > > - Can I do something to reduce the odds of this occurring? larger logs, > more logs, flush more often, resurrect extinct species and sacrifice them to > the xfs gods? > > - Can an xfs developer do something? For example, make it RWF_NOWAIT > friendly (if the answer to the first question was "correct") > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like it skips any write call that would require allocation in xfs_file_iomap_begin(). The busy flush should only happen in the block allocation path, so something is missing here. Do you have a backtrace for the log force you're seeing? Brian > > [*] equivalent, because I'm actually looking at an older kernel that lacks > this function. But I'm moderately confident that the xfs_log_force I'm > seeing was transformed into xfs_extent_busy_flush by > ebf55872616c7d4754db5a318591a72a8d5e6896 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 15:28 ` Brian Foster @ 2018-01-23 15:45 ` Avi Kivity 2018-01-23 16:11 ` Brian Foster 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-01-23 15:45 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On 01/23/2018 05:28 PM, Brian Foster wrote: > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: >> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my >> beautiful io_submit() calls. >> >> >> Questions: >> >> - Is it correct that RWF_NOWAIT will not detect the condition that led to >> the log being forced? >> >> - If so, can it be fixed? >> >> - Can I do something to reduce the odds of this occurring? larger logs, >> more logs, flush more often, resurrect extinct species and sacrifice them to >> the xfs gods? >> >> - Can an xfs developer do something? For example, make it RWF_NOWAIT >> friendly (if the answer to the first question was "correct") >> > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like > it skips any write call that would require allocation in > xfs_file_iomap_begin(). The busy flush should only happen in the block > allocation path, so something is missing here. Do you have a backtrace > for the log force you're seeing? > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT. 0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel] 0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel] 0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel] 0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel] 0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel] 0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs] 0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs] 0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs] 0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs] 0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs] 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] 0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs] 0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 15:45 ` Avi Kivity @ 2018-01-23 16:11 ` Brian Foster 2018-01-23 16:22 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Brian Foster @ 2018-01-23 16:11 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: > > > On 01/23/2018 05:28 PM, Brian Foster wrote: > > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: > > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my > > > beautiful io_submit() calls. > > > > > > > > > Questions: > > > > > > - Is it correct that RWF_NOWAIT will not detect the condition that led to > > > the log being forced? > > > > > > - If so, can it be fixed? > > > > > > - Can I do something to reduce the odds of this occurring? larger logs, > > > more logs, flush more often, resurrect extinct species and sacrifice them to > > > the xfs gods? > > > > > > - Can an xfs developer do something? For example, make it RWF_NOWAIT > > > friendly (if the answer to the first question was "correct") > > > > > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like > > it skips any write call that would require allocation in > > xfs_file_iomap_begin(). The busy flush should only happen in the block > > allocation path, so something is missing here. Do you have a backtrace > > for the log force you're seeing? > > > > > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT. > Oh, so the case below is roughly how I would have expected to hit the flush/wait without RWF_NOWAIT. The latter flag should prevent this, to answer your first question. For the follow up question, I think this should only occur when the fs is fairly low on free space. Is that the case here? I'm not sure there's a specific metric, fwiw, but it's just a matter of attempting an (user data) allocation that only finds busy extents in the free space btrees and thus has to the force the log to satisfy the allocation. I suppose running with more free space available would avoid this. I think running with less in-core log space could indirectly reduce extent busy time, but that may also have other performance ramifications and so is probably not a great idea. Brian > 0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel] > 0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel] > 0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel] > 0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel] > 0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel] > 0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs] > 0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs] > 0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs] > 0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs] > 0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs] > 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] > 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] > 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] > 0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs] > 0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs] > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 16:11 ` Brian Foster @ 2018-01-23 16:22 ` Avi Kivity 2018-01-23 16:47 ` Brian Foster 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-01-23 16:22 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On 01/23/2018 06:11 PM, Brian Foster wrote: > On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: >> >> On 01/23/2018 05:28 PM, Brian Foster wrote: >>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: >>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my >>>> beautiful io_submit() calls. >>>> >>>> >>>> Questions: >>>> >>>> - Is it correct that RWF_NOWAIT will not detect the condition that led to >>>> the log being forced? >>>> >>>> - If so, can it be fixed? >>>> >>>> - Can I do something to reduce the odds of this occurring? larger logs, >>>> more logs, flush more often, resurrect extinct species and sacrifice them to >>>> the xfs gods? >>>> >>>> - Can an xfs developer do something? For example, make it RWF_NOWAIT >>>> friendly (if the answer to the first question was "correct") >>>> >>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like >>> it skips any write call that would require allocation in >>> xfs_file_iomap_begin(). The busy flush should only happen in the block >>> allocation path, so something is missing here. Do you have a backtrace >>> for the log force you're seeing? >>> >>> >> Here's a trace. It's from a kernel that lacks RWF_NOWAIT. >> > Oh, so the case below is roughly how I would have expected to hit the > flush/wait without RWF_NOWAIT. The latter flag should prevent this, to > answer your first question. Thanks, that's very encouraging. We are exploring recommending upstream-ish kernels to users and customers, given their relative stability these days and aio-related improvements (not to mention the shame of having to admit to running an old kernel when reporting a problem to an upstream list). > > For the follow up question, I think this should only occur when the fs > is fairly low on free space. Is that the case here? No: /dev/md0 3.0T 1.2T 1.8T 40% /var/lib/scylla > I'm not sure there's > a specific metric, fwiw, but it's just a matter of attempting an (user > data) allocation that only finds busy extents in the free space btrees > and thus has to the force the log to satisfy the allocation. What does "busy" mean here? recently freed so we want to force the log to make sure the extent isn't doubly-allocated? (wild guess) > I suppose > running with more free space available would avoid this. I think running > with less in-core log space could indirectly reduce extent busy time, > but that may also have other performance ramifications and so is > probably not a great idea. At 60%, I hope low free space is not a problem. btw, I'm also seeing 10ms+ periods of high CPU utilization: 0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel] 0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel] 0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs] 0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs] 0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs] 0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs] 0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs] 0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs] 0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs] 0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs] 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent? Should I be increasing my extent hint (currently at 32MB)? > > Brian > >> 0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel] >> 0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel] >> 0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel] >> 0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel] >> 0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel] >> 0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs] >> 0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs] >> 0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs] >> 0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs] >> 0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs] >> 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] >> 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] >> 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] >> 0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs] >> 0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs] >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 16:22 ` Avi Kivity @ 2018-01-23 16:47 ` Brian Foster 2018-01-23 17:00 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Brian Foster @ 2018-01-23 16:47 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote: > > > On 01/23/2018 06:11 PM, Brian Foster wrote: > > On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: > > > > > > On 01/23/2018 05:28 PM, Brian Foster wrote: > > > > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: > > > > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my > > > > > beautiful io_submit() calls. > > > > > > > > > > > > > > > Questions: > > > > > > > > > > - Is it correct that RWF_NOWAIT will not detect the condition that led to > > > > > the log being forced? > > > > > > > > > > - If so, can it be fixed? > > > > > > > > > > - Can I do something to reduce the odds of this occurring? larger logs, > > > > > more logs, flush more often, resurrect extinct species and sacrifice them to > > > > > the xfs gods? > > > > > > > > > > - Can an xfs developer do something? For example, make it RWF_NOWAIT > > > > > friendly (if the answer to the first question was "correct") > > > > > > > > > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like > > > > it skips any write call that would require allocation in > > > > xfs_file_iomap_begin(). The busy flush should only happen in the block > > > > allocation path, so something is missing here. Do you have a backtrace > > > > for the log force you're seeing? > > > > > > > > > > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT. > > > > > Oh, so the case below is roughly how I would have expected to hit the > > flush/wait without RWF_NOWAIT. The latter flag should prevent this, to > > answer your first question. > > Thanks, that's very encouraging. We are exploring recommending upstream-ish > kernels to users and customers, given their relative stability these days > and aio-related improvements (not to mention the shame of having to admit to > running an old kernel when reporting a problem to an upstream list). > > > > > For the follow up question, I think this should only occur when the fs > > is fairly low on free space. Is that the case here? > > No: > > /dev/md0 3.0T 1.2T 1.8T 40% /var/lib/scylla > > > > I'm not sure there's > > a specific metric, fwiw, but it's just a matter of attempting an (user > > data) allocation that only finds busy extents in the free space btrees > > and thus has to the force the log to satisfy the allocation. > > What does "busy" mean here? recently freed so we want to force the log to > make sure the extent isn't doubly-allocated? (wild guess) > Recently freed and the transaction that freed the blocks has not yet been persisted to the on-disk log. A subsequent attempt to allocate those blocks for user data waits for the transaction to commit to disk to ensure that the block is not written before the filesystem has persisted the fact that it has been freed. Otherwise, my understanding is that if the blocks are written to and the filesystem crashes before the previous free was persisted, we'd have allowed an overwrite of a still-used metadata block. > > I suppose > > running with more free space available would avoid this. I think running > > with less in-core log space could indirectly reduce extent busy time, > > but that may also have other performance ramifications and so is > > probably not a great idea. > > At 60%, I hope low free space is not a problem. > Yeah, that seems strange. I wouldn't expect busy extents to be a problem with that much free space. > btw, I'm also seeing 10ms+ periods of high CPU utilization: > > 0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel] > 0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel] > 0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs] > 0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs] > 0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs] > 0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs] > 0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs] > 0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs] > 0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs] > 0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs] > 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] > 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] > 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] > > Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent? > Should I be increasing my extent hint (currently at 32MB)? > I haven't done enough performance testing to have an intuition on the typical CPU time required to allocate blocks. Somebody else may be able to chime in on that. I suppose it could depend on the level of free space fragmentation, which can be observed via 'xfs_db -c "freesp -s" <dev>', whether I/Os or btree splits/joins were required, etc. FWIW, the above stack looks like it's stuck waiting on a memory allocation for a btree buffer xfs_buf_log_item, which is an internal data structure used to track metadata objects through the log subsystem. We have a kmem zone for such objects because they are allocated/freed frequently, but perhaps the zone had to grow..? We do pass KM_SLEEP there.. Brian > > > > Brian > > > > > 0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel] > > > 0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel] > > > 0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel] > > > 0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel] > > > 0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel] > > > 0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs] > > > 0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs] > > > 0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs] > > > 0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs] > > > 0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs] > > > 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] > > > 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] > > > 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] > > > 0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs] > > > 0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs] > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 16:47 ` Brian Foster @ 2018-01-23 17:00 ` Avi Kivity 2018-01-23 17:39 ` Brian Foster 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-01-23 17:00 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On 01/23/2018 06:47 PM, Brian Foster wrote: > On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote: >> >> On 01/23/2018 06:11 PM, Brian Foster wrote: >>> On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: >>>> On 01/23/2018 05:28 PM, Brian Foster wrote: >>>>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: >>>>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my >>>>>> beautiful io_submit() calls. >>>>>> >>>>>> >>>>>> Questions: >>>>>> >>>>>> - Is it correct that RWF_NOWAIT will not detect the condition that led to >>>>>> the log being forced? >>>>>> >>>>>> - If so, can it be fixed? >>>>>> >>>>>> - Can I do something to reduce the odds of this occurring? larger logs, >>>>>> more logs, flush more often, resurrect extinct species and sacrifice them to >>>>>> the xfs gods? >>>>>> >>>>>> - Can an xfs developer do something? For example, make it RWF_NOWAIT >>>>>> friendly (if the answer to the first question was "correct") >>>>>> >>>>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like >>>>> it skips any write call that would require allocation in >>>>> xfs_file_iomap_begin(). The busy flush should only happen in the block >>>>> allocation path, so something is missing here. Do you have a backtrace >>>>> for the log force you're seeing? >>>>> >>>>> >>>> Here's a trace. It's from a kernel that lacks RWF_NOWAIT. >>>> >>> Oh, so the case below is roughly how I would have expected to hit the >>> flush/wait without RWF_NOWAIT. The latter flag should prevent this, to >>> answer your first question. >> Thanks, that's very encouraging. We are exploring recommending upstream-ish >> kernels to users and customers, given their relative stability these days >> and aio-related improvements (not to mention the shame of having to admit to >> running an old kernel when reporting a problem to an upstream list). >> >>> For the follow up question, I think this should only occur when the fs >>> is fairly low on free space. Is that the case here? >> No: >> >> /dev/md0 3.0T 1.2T 1.8T 40% /var/lib/scylla >> >> >>> I'm not sure there's >>> a specific metric, fwiw, but it's just a matter of attempting an (user >>> data) allocation that only finds busy extents in the free space btrees >>> and thus has to the force the log to satisfy the allocation. >> What does "busy" mean here? recently freed so we want to force the log to >> make sure the extent isn't doubly-allocated? (wild guess) >> > Recently freed and the transaction that freed the blocks has not yet > been persisted to the on-disk log. A subsequent attempt to allocate > those blocks for user data waits for the transaction to commit to disk > to ensure that the block is not written before the filesystem has > persisted the fact that it has been freed. Otherwise, my understanding > is that if the blocks are written to and the filesystem crashes before > the previous free was persisted, we'd have allowed an overwrite of a > still-used metadata block. Understood, thanks. > >>> I suppose >>> running with more free space available would avoid this. I think running >>> with less in-core log space could indirectly reduce extent busy time, >>> but that may also have other performance ramifications and so is >>> probably not a great idea. >> At 60%, I hope low free space is not a problem. >> > Yeah, that seems strange. I wouldn't expect busy extents to be a problem > with that much free space. The workload creates new files, appends to them, lets them stew for a while, then deletes them. Maybe something is preventing xfs from seeing non-busy extents? The disk is writing at 300-600MB/s for several days, so quite some churn. > >> btw, I'm also seeing 10ms+ periods of high CPU utilization: >> >> 0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel] >> 0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel] >> 0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs] >> 0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs] >> 0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs] >> 0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs] >> 0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs] >> 0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs] >> 0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs] >> 0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs] >> 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] >> 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] >> 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] >> >> Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent? >> Should I be increasing my extent hint (currently at 32MB)? >> > I haven't done enough performance testing to have an intuition on the > typical CPU time required to allocate blocks. Somebody else may be able > to chime in on that. I suppose it could depend on the level of free > space fragmentation, which can be observed via 'xfs_db -c "freesp -s" > <dev>', whether I/Os or btree splits/joins were required, etc. > > FWIW, the above stack looks like it's stuck waiting on a memory > allocation for a btree buffer xfs_buf_log_item, which is an internal > data structure used to track metadata objects through the log subsystem. > We have a kmem zone for such objects because they are allocated/freed > frequently, but perhaps the zone had to grow..? We do pass KM_SLEEP > there.. It's not really waiting, that's a cond_resched. The scheduler switched away because some other task needed its attention, not because memory was not available. That's understandable since xfs hogged the cpu for 10ms. I will look at xfs_bmap output later, after I renew my friendship with trace-cmd. > Brian > >>> Brian >>> >>>> 0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel] >>>> 0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel] >>>> 0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel] >>>> 0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel] >>>> 0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel] >>>> 0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs] >>>> 0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs] >>>> 0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs] >>>> 0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs] >>>> 0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs] >>>> 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] >>>> 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] >>>> 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] >>>> 0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs] >>>> 0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs] >>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 17:00 ` Avi Kivity @ 2018-01-23 17:39 ` Brian Foster 2018-01-25 8:50 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Brian Foster @ 2018-01-23 17:39 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote: > > > On 01/23/2018 06:47 PM, Brian Foster wrote: > > On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote: > > > > > > On 01/23/2018 06:11 PM, Brian Foster wrote: > > > > On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: > > > > > On 01/23/2018 05:28 PM, Brian Foster wrote: > > > > > > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: > > > > > > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my > > > > > > > beautiful io_submit() calls. > > > > > > > > > > > > > > > > > > > > > Questions: > > > > > > > > > > > > > > - Is it correct that RWF_NOWAIT will not detect the condition that led to > > > > > > > the log being forced? > > > > > > > > > > > > > > - If so, can it be fixed? > > > > > > > > > > > > > > - Can I do something to reduce the odds of this occurring? larger logs, > > > > > > > more logs, flush more often, resurrect extinct species and sacrifice them to > > > > > > > the xfs gods? > > > > > > > > > > > > > > - Can an xfs developer do something? For example, make it RWF_NOWAIT > > > > > > > friendly (if the answer to the first question was "correct") > > > > > > > > > > > > > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like > > > > > > it skips any write call that would require allocation in > > > > > > xfs_file_iomap_begin(). The busy flush should only happen in the block > > > > > > allocation path, so something is missing here. Do you have a backtrace > > > > > > for the log force you're seeing? > > > > > > > > > > > > > > > > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT. > > > > > > > > > Oh, so the case below is roughly how I would have expected to hit the > > > > flush/wait without RWF_NOWAIT. The latter flag should prevent this, to > > > > answer your first question. > > > Thanks, that's very encouraging. We are exploring recommending upstream-ish > > > kernels to users and customers, given their relative stability these days > > > and aio-related improvements (not to mention the shame of having to admit to > > > running an old kernel when reporting a problem to an upstream list). > > > > > > > For the follow up question, I think this should only occur when the fs > > > > is fairly low on free space. Is that the case here? > > > No: > > > > > > /dev/md0 3.0T 1.2T 1.8T 40% /var/lib/scylla > > > > > > > > > > I'm not sure there's > > > > a specific metric, fwiw, but it's just a matter of attempting an (user > > > > data) allocation that only finds busy extents in the free space btrees > > > > and thus has to the force the log to satisfy the allocation. > > > What does "busy" mean here? recently freed so we want to force the log to > > > make sure the extent isn't doubly-allocated? (wild guess) > > > > > Recently freed and the transaction that freed the blocks has not yet > > been persisted to the on-disk log. A subsequent attempt to allocate > > those blocks for user data waits for the transaction to commit to disk > > to ensure that the block is not written before the filesystem has > > persisted the fact that it has been freed. Otherwise, my understanding > > is that if the blocks are written to and the filesystem crashes before > > the previous free was persisted, we'd have allowed an overwrite of a > > still-used metadata block. > > Understood, thanks. > > > > > > > I suppose > > > > running with more free space available would avoid this. I think running > > > > with less in-core log space could indirectly reduce extent busy time, > > > > but that may also have other performance ramifications and so is > > > > probably not a great idea. > > > At 60%, I hope low free space is not a problem. > > > > > Yeah, that seems strange. I wouldn't expect busy extents to be a problem > > with that much free space. > > The workload creates new files, appends to them, lets them stew for a while, > then deletes them. Maybe something is preventing xfs from seeing non-busy > extents? > Yeah, could be.. perhaps the issue is that despite the large amount of total free space, the free space is too fragmented to satisfy a particular allocation request..? > The disk is writing at 300-600MB/s for several days, so quite some churn. > > > > > > btw, I'm also seeing 10ms+ periods of high CPU utilization: > > > > > > 0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel] > > > 0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel] > > > 0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs] > > > 0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs] > > > 0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs] > > > 0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs] > > > 0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs] > > > 0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs] > > > 0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs] > > > 0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs] > > > 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] > > > 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] > > > 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] > > > > > > Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent? > > > Should I be increasing my extent hint (currently at 32MB)? > > > > > I haven't done enough performance testing to have an intuition on the > > typical CPU time required to allocate blocks. Somebody else may be able > > to chime in on that. I suppose it could depend on the level of free > > space fragmentation, which can be observed via 'xfs_db -c "freesp -s" > > <dev>', whether I/Os or btree splits/joins were required, etc. > > > > FWIW, the above stack looks like it's stuck waiting on a memory > > allocation for a btree buffer xfs_buf_log_item, which is an internal > > data structure used to track metadata objects through the log subsystem. > > We have a kmem zone for such objects because they are allocated/freed > > frequently, but perhaps the zone had to grow..? We do pass KM_SLEEP > > there.. > > It's not really waiting, that's a cond_resched. The scheduler switched away > because some other task needed its attention, not because memory was not > available. That's understandable since xfs hogged the cpu for 10ms. > Ah, I misread it as you were blocked in that callchain. I suppose ftrace or something could help annotate the time spent in the allocation path. Free space fragmentation could potentially be a factor here as well, causing the search algorithm(s) to run through a lot of records/blocks to find something usable, for example. Brian > I will look at xfs_bmap output later, after I renew my friendship with > trace-cmd. > > > Brian > > > > > > Brian > > > > > > > > > 0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel] > > > > > 0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel] > > > > > 0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel] > > > > > 0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel] > > > > > 0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel] > > > > > 0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs] > > > > > 0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs] > > > > > 0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs] > > > > > 0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs] > > > > > 0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs] > > > > > 0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs] > > > > > 0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs] > > > > > 0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs] > > > > > 0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs] > > > > > 0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs] > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-23 17:39 ` Brian Foster @ 2018-01-25 8:50 ` Avi Kivity 2018-01-25 13:08 ` Brian Foster 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-01-25 8:50 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On 01/23/2018 07:39 PM, Brian Foster wrote: > On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote: >> >> On 01/23/2018 06:47 PM, Brian Foster wrote: >>> On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote: >>>> On 01/23/2018 06:11 PM, Brian Foster wrote: >>>>> On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: >>>>>> On 01/23/2018 05:28 PM, Brian Foster wrote: >>>>>>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: >>>>>>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my >>>>>>>> beautiful io_submit() calls. >>>>>>>> >>>>>>>> >>>>>>>> Questions: >>>>>>>> >>>>>>>> - Is it correct that RWF_NOWAIT will not detect the condition that led to >>>>>>>> the log being forced? >>>>>>>> >>>>>>>> - If so, can it be fixed? >>>>>>>> >>>>>>>> - Can I do something to reduce the odds of this occurring? larger logs, >>>>>>>> more logs, flush more often, resurrect extinct species and sacrifice them to >>>>>>>> the xfs gods? >>>>>>>> >>>>>>>> - Can an xfs developer do something? For example, make it RWF_NOWAIT >>>>>>>> friendly (if the answer to the first question was "correct") >>>>>>>> >>>>>>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like >>>>>>> it skips any write call that would require allocation in >>>>>>> xfs_file_iomap_begin(). The busy flush should only happen in the block >>>>>>> allocation path, so something is missing here. Do you have a backtrace >>>>>>> for the log force you're seeing? >>>>>>> >>>>>>> >>>>>> Here's a trace. It's from a kernel that lacks RWF_NOWAIT. >>>>>> >>>>> Oh, so the case below is roughly how I would have expected to hit the >>>>> flush/wait without RWF_NOWAIT. The latter flag should prevent this, to >>>>> answer your first question. >>>> Thanks, that's very encouraging. We are exploring recommending upstream-ish >>>> kernels to users and customers, given their relative stability these days >>>> and aio-related improvements (not to mention the shame of having to admit to >>>> running an old kernel when reporting a problem to an upstream list). >>>> >>>>> For the follow up question, I think this should only occur when the fs >>>>> is fairly low on free space. Is that the case here? >>>> No: >>>> >>>> /dev/md0 3.0T 1.2T 1.8T 40% /var/lib/scylla >>>> >>>> >>>>> I'm not sure there's >>>>> a specific metric, fwiw, but it's just a matter of attempting an (user >>>>> data) allocation that only finds busy extents in the free space btrees >>>>> and thus has to the force the log to satisfy the allocation. >>>> What does "busy" mean here? recently freed so we want to force the log to >>>> make sure the extent isn't doubly-allocated? (wild guess) >>>> >>> Recently freed and the transaction that freed the blocks has not yet >>> been persisted to the on-disk log. A subsequent attempt to allocate >>> those blocks for user data waits for the transaction to commit to disk >>> to ensure that the block is not written before the filesystem has >>> persisted the fact that it has been freed. Otherwise, my understanding >>> is that if the blocks are written to and the filesystem crashes before >>> the previous free was persisted, we'd have allowed an overwrite of a >>> still-used metadata block. >> Understood, thanks. >> >>>>> I suppose >>>>> running with more free space available would avoid this. I think running >>>>> with less in-core log space could indirectly reduce extent busy time, >>>>> but that may also have other performance ramifications and so is >>>>> probably not a great idea. >>>> At 60%, I hope low free space is not a problem. >>>> >>> Yeah, that seems strange. I wouldn't expect busy extents to be a problem >>> with that much free space. >> The workload creates new files, appends to them, lets them stew for a while, >> then deletes them. Maybe something is preventing xfs from seeing non-busy >> extents? >> > Yeah, could be.. perhaps the issue is that despite the large amount of > total free space, the free space is too fragmented to satisfy a > particular allocation request..? from to extents blocks pct 1 1 2702 2702 0.00 2 3 690 1547 0.00 4 7 115 568 0.00 8 15 60 634 0.00 16 31 63 1457 0.00 32 63 102 4751 0.00 64 127 7940 895365 0.19 128 255 49680 12422100 2.67 256 511 1025 417078 0.09 512 1023 4170 3660771 0.79 1024 2047 2168 3503054 0.75 2048 4095 2567 7729442 1.66 4096 8191 8688 59394413 12.76 8192 16383 310 3100186 0.67 16384 32767 112 2339935 0.50 32768 65535 35 1381122 0.30 65536 131071 8 651391 0.14 131072 262143 2 344196 0.07 524288 1048575 4 2909925 0.62 1048576 2097151 3 3550680 0.76 4194304 8388607 10 82497658 17.72 8388608 16777215 10 158022653 33.94 16777216 24567552 5 122778062 26.37 total free extents 80469 total free blocks 465609690 average free extent size 5786.2 Looks like plenty of free large extents, with most of the free space completely, unfragmented. Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have something to do with it. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-25 8:50 ` Avi Kivity @ 2018-01-25 13:08 ` Brian Foster 2018-01-29 9:40 ` Avi Kivity 2018-02-02 9:48 ` Christoph Hellwig 0 siblings, 2 replies; 20+ messages in thread From: Brian Foster @ 2018-01-25 13:08 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote: > On 01/23/2018 07:39 PM, Brian Foster wrote: > > On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote: > > > > > > On 01/23/2018 06:47 PM, Brian Foster wrote: > > > > On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote: > > > > > On 01/23/2018 06:11 PM, Brian Foster wrote: > > > > > > On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: > > > > > > > On 01/23/2018 05:28 PM, Brian Foster wrote: > > > > > > > > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: > > > > > > > > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my > > > > > > > > > beautiful io_submit() calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > Questions: > > > > > > > > > > > > > > > > > > - Is it correct that RWF_NOWAIT will not detect the condition that led to > > > > > > > > > the log being forced? > > > > > > > > > > > > > > > > > > - If so, can it be fixed? > > > > > > > > > > > > > > > > > > - Can I do something to reduce the odds of this occurring? larger logs, > > > > > > > > > more logs, flush more often, resurrect extinct species and sacrifice them to > > > > > > > > > the xfs gods? > > > > > > > > > > > > > > > > > > - Can an xfs developer do something? For example, make it RWF_NOWAIT > > > > > > > > > friendly (if the answer to the first question was "correct") > > > > > > > > > > > > > > > > > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like > > > > > > > > it skips any write call that would require allocation in > > > > > > > > xfs_file_iomap_begin(). The busy flush should only happen in the block > > > > > > > > allocation path, so something is missing here. Do you have a backtrace > > > > > > > > for the log force you're seeing? > > > > > > > > > > > > > > > > > > > > > > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT. > > > > > > > > > > > > > Oh, so the case below is roughly how I would have expected to hit the > > > > > > flush/wait without RWF_NOWAIT. The latter flag should prevent this, to > > > > > > answer your first question. > > > > > Thanks, that's very encouraging. We are exploring recommending upstream-ish > > > > > kernels to users and customers, given their relative stability these days > > > > > and aio-related improvements (not to mention the shame of having to admit to > > > > > running an old kernel when reporting a problem to an upstream list). > > > > > > > > > > > For the follow up question, I think this should only occur when the fs > > > > > > is fairly low on free space. Is that the case here? > > > > > No: > > > > > > > > > > /dev/md0 3.0T 1.2T 1.8T 40% /var/lib/scylla > > > > > > > > > > > > > > > > I'm not sure there's > > > > > > a specific metric, fwiw, but it's just a matter of attempting an (user > > > > > > data) allocation that only finds busy extents in the free space btrees > > > > > > and thus has to the force the log to satisfy the allocation. > > > > > What does "busy" mean here? recently freed so we want to force the log to > > > > > make sure the extent isn't doubly-allocated? (wild guess) > > > > > > > > > Recently freed and the transaction that freed the blocks has not yet > > > > been persisted to the on-disk log. A subsequent attempt to allocate > > > > those blocks for user data waits for the transaction to commit to disk > > > > to ensure that the block is not written before the filesystem has > > > > persisted the fact that it has been freed. Otherwise, my understanding > > > > is that if the blocks are written to and the filesystem crashes before > > > > the previous free was persisted, we'd have allowed an overwrite of a > > > > still-used metadata block. > > > Understood, thanks. > > > > > > > > > I suppose > > > > > > running with more free space available would avoid this. I think running > > > > > > with less in-core log space could indirectly reduce extent busy time, > > > > > > but that may also have other performance ramifications and so is > > > > > > probably not a great idea. > > > > > At 60%, I hope low free space is not a problem. > > > > > > > > > Yeah, that seems strange. I wouldn't expect busy extents to be a problem > > > > with that much free space. > > > The workload creates new files, appends to them, lets them stew for a while, > > > then deletes them. Maybe something is preventing xfs from seeing non-busy > > > extents? > > > > > Yeah, could be.. perhaps the issue is that despite the large amount of > > total free space, the free space is too fragmented to satisfy a > > particular allocation request..? > > from to extents blocks pct > 1 1 2702 2702 0.00 > 2 3 690 1547 0.00 > 4 7 115 568 0.00 > 8 15 60 634 0.00 > 16 31 63 1457 0.00 > 32 63 102 4751 0.00 > 64 127 7940 895365 0.19 > 128 255 49680 12422100 2.67 > 256 511 1025 417078 0.09 > 512 1023 4170 3660771 0.79 > 1024 2047 2168 3503054 0.75 > 2048 4095 2567 7729442 1.66 > 4096 8191 8688 59394413 12.76 > 8192 16383 310 3100186 0.67 > 16384 32767 112 2339935 0.50 > 32768 65535 35 1381122 0.30 > 65536 131071 8 651391 0.14 > 131072 262143 2 344196 0.07 > 524288 1048575 4 2909925 0.62 > 1048576 2097151 3 3550680 0.76 > 4194304 8388607 10 82497658 17.72 > 8388608 16777215 10 158022653 33.94 > 16777216 24567552 5 122778062 26.37 > total free extents 80469 > total free blocks 465609690 > average free extent size 5786.2 > > Looks like plenty of free large extents, with most of the free space > completely, unfragmented. > Indeed.. > Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have > something to do with it. > Most likely. Based on this, it's hard to say for certain why you'd be running into allocation latency caused by busy extents. Does this filesystem use the '-o discard' mount option by any chance? I suppose it's possible that this was some kind of transient state, or perhaps only a small set of AGs are affected, etc. It's also possible this may have been improved in more recent kernels by Christoph's rework of some of that code. In any event, this would probably require a bit more runtime analysis to figure out where/why allocations are getting stalled as such. I'd probably start by looking at the xfs_extent_busy_* tracepoints (also note that if there's potentially something to be improved on here, it's more useful to do so against current upstream). Or you could just move to something that supports RWF_NOWAIT.. ;) Brian > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-25 13:08 ` Brian Foster @ 2018-01-29 9:40 ` Avi Kivity 2018-01-29 11:35 ` Dave Chinner 2018-02-02 9:48 ` Christoph Hellwig 1 sibling, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-01-29 9:40 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On 01/25/2018 03:08 PM, Brian Foster wrote: > On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote: >> On 01/23/2018 07:39 PM, Brian Foster wrote: >>> On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote: >>>> On 01/23/2018 06:47 PM, Brian Foster wrote: >>>>> On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote: >>>>>> On 01/23/2018 06:11 PM, Brian Foster wrote: >>>>>>> On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote: >>>>>>>> On 01/23/2018 05:28 PM, Brian Foster wrote: >>>>>>>>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote: >>>>>>>>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my >>>>>>>>>> beautiful io_submit() calls. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Questions: >>>>>>>>>> >>>>>>>>>> - Is it correct that RWF_NOWAIT will not detect the condition that led to >>>>>>>>>> the log being forced? >>>>>>>>>> >>>>>>>>>> - If so, can it be fixed? >>>>>>>>>> >>>>>>>>>> - Can I do something to reduce the odds of this occurring? larger logs, >>>>>>>>>> more logs, flush more often, resurrect extinct species and sacrifice them to >>>>>>>>>> the xfs gods? >>>>>>>>>> >>>>>>>>>> - Can an xfs developer do something? For example, make it RWF_NOWAIT >>>>>>>>>> friendly (if the answer to the first question was "correct") >>>>>>>>>> >>>>>>>>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like >>>>>>>>> it skips any write call that would require allocation in >>>>>>>>> xfs_file_iomap_begin(). The busy flush should only happen in the block >>>>>>>>> allocation path, so something is missing here. Do you have a backtrace >>>>>>>>> for the log force you're seeing? >>>>>>>>> >>>>>>>>> >>>>>>>> Here's a trace. It's from a kernel that lacks RWF_NOWAIT. >>>>>>>> >>>>>>> Oh, so the case below is roughly how I would have expected to hit the >>>>>>> flush/wait without RWF_NOWAIT. The latter flag should prevent this, to >>>>>>> answer your first question. >>>>>> Thanks, that's very encouraging. We are exploring recommending upstream-ish >>>>>> kernels to users and customers, given their relative stability these days >>>>>> and aio-related improvements (not to mention the shame of having to admit to >>>>>> running an old kernel when reporting a problem to an upstream list). >>>>>> >>>>>>> For the follow up question, I think this should only occur when the fs >>>>>>> is fairly low on free space. Is that the case here? >>>>>> No: >>>>>> >>>>>> /dev/md0 3.0T 1.2T 1.8T 40% /var/lib/scylla >>>>>> >>>>>> >>>>>>> I'm not sure there's >>>>>>> a specific metric, fwiw, but it's just a matter of attempting an (user >>>>>>> data) allocation that only finds busy extents in the free space btrees >>>>>>> and thus has to the force the log to satisfy the allocation. >>>>>> What does "busy" mean here? recently freed so we want to force the log to >>>>>> make sure the extent isn't doubly-allocated? (wild guess) >>>>>> >>>>> Recently freed and the transaction that freed the blocks has not yet >>>>> been persisted to the on-disk log. A subsequent attempt to allocate >>>>> those blocks for user data waits for the transaction to commit to disk >>>>> to ensure that the block is not written before the filesystem has >>>>> persisted the fact that it has been freed. Otherwise, my understanding >>>>> is that if the blocks are written to and the filesystem crashes before >>>>> the previous free was persisted, we'd have allowed an overwrite of a >>>>> still-used metadata block. >>>> Understood, thanks. >>>> >>>>>>> I suppose >>>>>>> running with more free space available would avoid this. I think running >>>>>>> with less in-core log space could indirectly reduce extent busy time, >>>>>>> but that may also have other performance ramifications and so is >>>>>>> probably not a great idea. >>>>>> At 60%, I hope low free space is not a problem. >>>>>> >>>>> Yeah, that seems strange. I wouldn't expect busy extents to be a problem >>>>> with that much free space. >>>> The workload creates new files, appends to them, lets them stew for a while, >>>> then deletes them. Maybe something is preventing xfs from seeing non-busy >>>> extents? >>>> >>> Yeah, could be.. perhaps the issue is that despite the large amount of >>> total free space, the free space is too fragmented to satisfy a >>> particular allocation request..? >> from to extents blocks pct >> 1 1 2702 2702 0.00 >> 2 3 690 1547 0.00 >> 4 7 115 568 0.00 >> 8 15 60 634 0.00 >> 16 31 63 1457 0.00 >> 32 63 102 4751 0.00 >> 64 127 7940 895365 0.19 >> 128 255 49680 12422100 2.67 >> 256 511 1025 417078 0.09 >> 512 1023 4170 3660771 0.79 >> 1024 2047 2168 3503054 0.75 >> 2048 4095 2567 7729442 1.66 >> 4096 8191 8688 59394413 12.76 >> 8192 16383 310 3100186 0.67 >> 16384 32767 112 2339935 0.50 >> 32768 65535 35 1381122 0.30 >> 65536 131071 8 651391 0.14 >> 131072 262143 2 344196 0.07 >> 524288 1048575 4 2909925 0.62 >> 1048576 2097151 3 3550680 0.76 >> 4194304 8388607 10 82497658 17.72 >> 8388608 16777215 10 158022653 33.94 >> 16777216 24567552 5 122778062 26.37 >> total free extents 80469 >> total free blocks 465609690 >> average free extent size 5786.2 >> >> Looks like plenty of free large extents, with most of the free space >> completely, unfragmented. >> > Indeed.. > >> Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have >> something to do with it. >> > Most likely. Based on this, it's hard to say for certain why you'd be > running into allocation latency caused by busy extents. Does this > filesystem use the '-o discard' mount option by any chance? No. We'd love to but have had bad experience + strong recommendation from this list not to use it. > I suppose it's possible that this was some kind of transient state, or > perhaps only a small set of AGs are affected, etc. It's also possible > this may have been improved in more recent kernels by Christoph's rework > of some of that code. In any event, this would probably require a bit > more runtime analysis to figure out where/why allocations are getting > stalled as such. I'd probably start by looking at the xfs_extent_busy_* > tracepoints (also note that if there's potentially something to be > improved on here, it's more useful to do so against current upstream). > > Or you could just move to something that supports RWF_NOWAIT.. ;) > NOWAIT only helps with blocking, not high CPU usage. Even if we moved it to another thread, it would still take the same time (but granted, our important thread could still do work, sharing the core with the one doing the allocation). Regardless, we are recommending to our users and customers to move to NOWAIT kernels, but of course some are justifiable cautious. Upstream kernels are now quite stable, so enterprise distribution kernels don't provide the same value they used to. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-29 9:40 ` Avi Kivity @ 2018-01-29 11:35 ` Dave Chinner 2018-01-29 11:44 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2018-01-29 11:35 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, linux-xfs On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote: > > > On 01/25/2018 03:08 PM, Brian Foster wrote: > > On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote: > > > On 01/23/2018 07:39 PM, Brian Foster wrote: > > > > Yeah, could be.. perhaps the issue is that despite the large amount of > > > > total free space, the free space is too fragmented to satisfy a > > > > particular allocation request..? > > > from to extents blocks pct > > > 1 1 2702 2702 0.00 > > > 2 3 690 1547 0.00 > > > 4 7 115 568 0.00 > > > 8 15 60 634 0.00 > > > 16 31 63 1457 0.00 > > > 32 63 102 4751 0.00 > > > 64 127 7940 895365 0.19 > > > 128 255 49680 12422100 2.67 > > > 256 511 1025 417078 0.09 > > > 512 1023 4170 3660771 0.79 > > > 1024 2047 2168 3503054 0.75 > > > 2048 4095 2567 7729442 1.66 > > > 4096 8191 8688 59394413 12.76 > > > 8192 16383 310 3100186 0.67 > > > 16384 32767 112 2339935 0.50 > > > 32768 65535 35 1381122 0.30 > > > 65536 131071 8 651391 0.14 > > > 131072 262143 2 344196 0.07 > > > 524288 1048575 4 2909925 0.62 > > > 1048576 2097151 3 3550680 0.76 > > > 4194304 8388607 10 82497658 17.72 > > > 8388608 16777215 10 158022653 33.94 > > > 16777216 24567552 5 122778062 26.37 > > > total free extents 80469 > > > total free blocks 465609690 > > > average free extent size 5786.2 > > > > > > Looks like plenty of free large extents, with most of the free space > > > completely, unfragmented. > > > > > Indeed.. You need to look at each AG, not the overall summary. You could have a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's that one AG that is causing all your problems. There's many reasons this can happen, but the most common is the working files in a directory (or subset of directories in the same AG) have a combined space usage of larger than an AG .... > > > Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have > > > something to do with it. > > > > > Most likely. Based on this, it's hard to say for certain why you'd be > > running into allocation latency caused by busy extents. One of only two reasons: 1. the AG has a large enough free space, but they are all marked busy (i.e. just been freed), or 2. The extent selected has had the busy range trimmed out of it and it's now less than the minimum extent length requested. Both cases imply that we're allocating extents that have been very recently freed, and that implies there is no other suitable non-busy free space in the AG. Hence the need to look at the per-AG freespace pattern rather than the global summary. Also, it's worth dumping the freespace via xfs_spaceman as it walks the in memory trees rather than the on-disk trees and so is properly coherent with operations in progress. (i.e. xfs_spaceman -c "freesp ..." /mntpt) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-29 11:35 ` Dave Chinner @ 2018-01-29 11:44 ` Avi Kivity 2018-01-29 21:56 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-01-29 11:44 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs On 01/29/2018 01:35 PM, Dave Chinner wrote: > On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote: >> >> On 01/25/2018 03:08 PM, Brian Foster wrote: >>> On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote: >>>> On 01/23/2018 07:39 PM, Brian Foster wrote: >>>>> Yeah, could be.. perhaps the issue is that despite the large amount of >>>>> total free space, the free space is too fragmented to satisfy a >>>>> particular allocation request..? >>>> from to extents blocks pct >>>> 1 1 2702 2702 0.00 >>>> 2 3 690 1547 0.00 >>>> 4 7 115 568 0.00 >>>> 8 15 60 634 0.00 >>>> 16 31 63 1457 0.00 >>>> 32 63 102 4751 0.00 >>>> 64 127 7940 895365 0.19 >>>> 128 255 49680 12422100 2.67 >>>> 256 511 1025 417078 0.09 >>>> 512 1023 4170 3660771 0.79 >>>> 1024 2047 2168 3503054 0.75 >>>> 2048 4095 2567 7729442 1.66 >>>> 4096 8191 8688 59394413 12.76 >>>> 8192 16383 310 3100186 0.67 >>>> 16384 32767 112 2339935 0.50 >>>> 32768 65535 35 1381122 0.30 >>>> 65536 131071 8 651391 0.14 >>>> 131072 262143 2 344196 0.07 >>>> 524288 1048575 4 2909925 0.62 >>>> 1048576 2097151 3 3550680 0.76 >>>> 4194304 8388607 10 82497658 17.72 >>>> 8388608 16777215 10 158022653 33.94 >>>> 16777216 24567552 5 122778062 26.37 >>>> total free extents 80469 >>>> total free blocks 465609690 >>>> average free extent size 5786.2 >>>> >>>> Looks like plenty of free large extents, with most of the free space >>>> completely, unfragmented. >>>> >>> Indeed.. > You need to look at each AG, not the overall summary. You could have > a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's > that one AG that is causing all your problems. > > There's many reasons this can happen, but the most common is the > working files in a directory (or subset of directories in the same > AG) have a combined space usage of larger than an AG .... That's certainly possible, even likely (one huge directory with all of the files). This layout is imposed on us by the compatibility gods. Is there a way to tell XFS to change its policy of on-ag-per-directory? If not, then we'll have to find some workaround to the compatibility problem. > >>>> Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have >>>> something to do with it. >>>> >>> Most likely. Based on this, it's hard to say for certain why you'd be >>> running into allocation latency caused by busy extents. > One of only two reasons: > > 1. the AG has a large enough free space, but they are all > marked busy (i.e. just been freed), or > > 2. The extent selected has had the busy range trimmed out of > it and it's now less than the minimum extent length > requested. > > Both cases imply that we're allocating extents that have been very > recently freed, and that implies there is no other suitable non-busy > free space in the AG. Hence the need to look at the per-AG freespace > pattern rather than the global summary. > > Also, it's worth dumping the freespace via xfs_spaceman as it walks > the in memory trees rather than the on-disk trees and so is properly > coherent with operations in progress. (i.e. xfs_spaceman -c "freesp > ..." /mntpt) > That system has since been recycled, but we'll keep it in mind the next time we see it. Also, we'll have to fix the one-huge-directory problem for sure, one way or another. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-29 11:44 ` Avi Kivity @ 2018-01-29 21:56 ` Dave Chinner 2018-01-30 8:58 ` Avi Kivity 2018-02-06 14:10 ` Avi Kivity 0 siblings, 2 replies; 20+ messages in thread From: Dave Chinner @ 2018-01-29 21:56 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, linux-xfs On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote: > > > On 01/29/2018 01:35 PM, Dave Chinner wrote: > > On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote: > > > > > > On 01/25/2018 03:08 PM, Brian Foster wrote: > > > > On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote: > > > > > On 01/23/2018 07:39 PM, Brian Foster wrote: > > > > > > Yeah, could be.. perhaps the issue is that despite the large amount of > > > > > > total free space, the free space is too fragmented to satisfy a > > > > > > particular allocation request..? > > > > > from to extents blocks pct > > > > > 1 1 2702 2702 0.00 > > > > > 2 3 690 1547 0.00 > > > > > 4 7 115 568 0.00 > > > > > 8 15 60 634 0.00 > > > > > 16 31 63 1457 0.00 > > > > > 32 63 102 4751 0.00 > > > > > 64 127 7940 895365 0.19 > > > > > 128 255 49680 12422100 2.67 > > > > > 256 511 1025 417078 0.09 > > > > > 512 1023 4170 3660771 0.79 > > > > > 1024 2047 2168 3503054 0.75 > > > > > 2048 4095 2567 7729442 1.66 > > > > > 4096 8191 8688 59394413 12.76 > > > > > 8192 16383 310 3100186 0.67 > > > > > 16384 32767 112 2339935 0.50 > > > > > 32768 65535 35 1381122 0.30 > > > > > 65536 131071 8 651391 0.14 > > > > > 131072 262143 2 344196 0.07 > > > > > 524288 1048575 4 2909925 0.62 > > > > > 1048576 2097151 3 3550680 0.76 > > > > > 4194304 8388607 10 82497658 17.72 > > > > > 8388608 16777215 10 158022653 33.94 > > > > > 16777216 24567552 5 122778062 26.37 > > > > > total free extents 80469 > > > > > total free blocks 465609690 > > > > > average free extent size 5786.2 > > > > > > > > > > Looks like plenty of free large extents, with most of the free space > > > > > completely, unfragmented. > > > > > > > > > Indeed.. > > You need to look at each AG, not the overall summary. You could have > > a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's > > that one AG that is causing all your problems. > > > > There's many reasons this can happen, but the most common is the > > working files in a directory (or subset of directories in the same > > AG) have a combined space usage of larger than an AG .... > > That's certainly possible, even likely (one huge directory with all of the > files). > > This layout is imposed on us by the compatibility gods. Is there a way to > tell XFS to change its policy of on-ag-per-directory? mount with inode32. That rotors files around all AGs in a round robin fashion instead of trying to keep directory locality for a working set. i.e. it distributes the files evenly across the filesystem. However, this can have substantial impact on performance if the workload requires locality of files for performance and you're running on spinning rust. OTOH, if locality is not needed then distributing all files across all AGs evenly should give more even capacity usage. And, based on what was discussed on #xfs overnight, don't bother with the filestreams allocator - it will not solve the "full AG" problem, but it will introduce a whole bunch of new problems for you. > That system has since been recycled, but we'll keep it in mind the next time > we see it. Also, we'll have to fix the one-huge-directory problem for sure, > one way or another. If inode32 doesn't work for you, the easiest way is to use a small directory hash - create top level directory with ~2x AG count child directories and hash your files into those. This will distribute the files roughly evenly across all AGs in the filesystem whilst still maintaining locality within directories. It's kind of half way between inode64 and inode32.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-29 21:56 ` Dave Chinner @ 2018-01-30 8:58 ` Avi Kivity 2018-02-06 14:10 ` Avi Kivity 1 sibling, 0 replies; 20+ messages in thread From: Avi Kivity @ 2018-01-30 8:58 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs On 01/29/2018 11:56 PM, Dave Chinner wrote: > On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote: >> >> On 01/29/2018 01:35 PM, Dave Chinner wrote: >>> On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote: >>>> On 01/25/2018 03:08 PM, Brian Foster wrote: >>>>> On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote: >>>>>> On 01/23/2018 07:39 PM, Brian Foster wrote: >>>>>>> Yeah, could be.. perhaps the issue is that despite the large amount of >>>>>>> total free space, the free space is too fragmented to satisfy a >>>>>>> particular allocation request..? >>>>>> from to extents blocks pct >>>>>> 1 1 2702 2702 0.00 >>>>>> 2 3 690 1547 0.00 >>>>>> 4 7 115 568 0.00 >>>>>> 8 15 60 634 0.00 >>>>>> 16 31 63 1457 0.00 >>>>>> 32 63 102 4751 0.00 >>>>>> 64 127 7940 895365 0.19 >>>>>> 128 255 49680 12422100 2.67 >>>>>> 256 511 1025 417078 0.09 >>>>>> 512 1023 4170 3660771 0.79 >>>>>> 1024 2047 2168 3503054 0.75 >>>>>> 2048 4095 2567 7729442 1.66 >>>>>> 4096 8191 8688 59394413 12.76 >>>>>> 8192 16383 310 3100186 0.67 >>>>>> 16384 32767 112 2339935 0.50 >>>>>> 32768 65535 35 1381122 0.30 >>>>>> 65536 131071 8 651391 0.14 >>>>>> 131072 262143 2 344196 0.07 >>>>>> 524288 1048575 4 2909925 0.62 >>>>>> 1048576 2097151 3 3550680 0.76 >>>>>> 4194304 8388607 10 82497658 17.72 >>>>>> 8388608 16777215 10 158022653 33.94 >>>>>> 16777216 24567552 5 122778062 26.37 >>>>>> total free extents 80469 >>>>>> total free blocks 465609690 >>>>>> average free extent size 5786.2 >>>>>> >>>>>> Looks like plenty of free large extents, with most of the free space >>>>>> completely, unfragmented. >>>>>> >>>>> Indeed.. >>> You need to look at each AG, not the overall summary. You could have >>> a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's >>> that one AG that is causing all your problems. >>> >>> There's many reasons this can happen, but the most common is the >>> working files in a directory (or subset of directories in the same >>> AG) have a combined space usage of larger than an AG .... >> That's certainly possible, even likely (one huge directory with all of the >> files). >> >> This layout is imposed on us by the compatibility gods. Is there a way to >> tell XFS to change its policy of on-ag-per-directory? > mount with inode32. That rotors files around all AGs in a round > robin fashion instead of trying to keep directory locality for a > working set. i.e. it distributes the files evenly across the > filesystem. I remember you recommending it a couple of years ago, with the qualification to only do it if we see a problem. I think we qualify now. > > However, this can have substantial impact on performance if the > workload requires locality of files for performance and you're > running on spinning rust. This is not a problem. We do have a few users on spinning disks, but they're a small minority, and our installer can avoid the inode32 option if it sees a rotational device. > OTOH, if locality is not needed then > distributing all files across all AGs evenly should give more even > capacity usage. > > And, based on what was discussed on #xfs overnight, don't bother > with the filestreams allocator - it will not solve the "full AG" > problem, but it will introduce a whole bunch of new problems for > you. > >> That system has since been recycled, but we'll keep it in mind the next time >> we see it. Also, we'll have to fix the one-huge-directory problem for sure, >> one way or another. > If inode32 doesn't work for you, the easiest way is to use a small > directory hash - create top level directory with ~2x AG count child > directories and hash your files into those. This will distribute the > files roughly evenly across all AGs in the filesystem whilst still > maintaining locality within directories. It's kind of half way > between inode64 and inode32.... > Yeah. Directory layout is sorta imposed on us, but if we have to break it, we will. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-29 21:56 ` Dave Chinner 2018-01-30 8:58 ` Avi Kivity @ 2018-02-06 14:10 ` Avi Kivity 2018-02-07 1:57 ` Dave Chinner 1 sibling, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-02-06 14:10 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs On 01/29/2018 11:56 PM, Dave Chinner wrote: > On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote: >> >> On 01/29/2018 01:35 PM, Dave Chinner wrote: >>> On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote: >>>> On 01/25/2018 03:08 PM, Brian Foster wrote: >>>>> On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote: >>>>>> On 01/23/2018 07:39 PM, Brian Foster wrote: >>>>>>> Yeah, could be.. perhaps the issue is that despite the large amount of >>>>>>> total free space, the free space is too fragmented to satisfy a >>>>>>> particular allocation request..? >>>>>> from to extents blocks pct >>>>>> 1 1 2702 2702 0.00 >>>>>> 2 3 690 1547 0.00 >>>>>> 4 7 115 568 0.00 >>>>>> 8 15 60 634 0.00 >>>>>> 16 31 63 1457 0.00 >>>>>> 32 63 102 4751 0.00 >>>>>> 64 127 7940 895365 0.19 >>>>>> 128 255 49680 12422100 2.67 >>>>>> 256 511 1025 417078 0.09 >>>>>> 512 1023 4170 3660771 0.79 >>>>>> 1024 2047 2168 3503054 0.75 >>>>>> 2048 4095 2567 7729442 1.66 >>>>>> 4096 8191 8688 59394413 12.76 >>>>>> 8192 16383 310 3100186 0.67 >>>>>> 16384 32767 112 2339935 0.50 >>>>>> 32768 65535 35 1381122 0.30 >>>>>> 65536 131071 8 651391 0.14 >>>>>> 131072 262143 2 344196 0.07 >>>>>> 524288 1048575 4 2909925 0.62 >>>>>> 1048576 2097151 3 3550680 0.76 >>>>>> 4194304 8388607 10 82497658 17.72 >>>>>> 8388608 16777215 10 158022653 33.94 >>>>>> 16777216 24567552 5 122778062 26.37 >>>>>> total free extents 80469 >>>>>> total free blocks 465609690 >>>>>> average free extent size 5786.2 >>>>>> >>>>>> Looks like plenty of free large extents, with most of the free space >>>>>> completely, unfragmented. >>>>>> >>>>> Indeed.. >>> You need to look at each AG, not the overall summary. You could have >>> a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's >>> that one AG that is causing all your problems. >>> >>> There's many reasons this can happen, but the most common is the >>> working files in a directory (or subset of directories in the same >>> AG) have a combined space usage of larger than an AG .... >> That's certainly possible, even likely (one huge directory with all of the >> files). >> >> This layout is imposed on us by the compatibility gods. Is there a way to >> tell XFS to change its policy of on-ag-per-directory? > mount with inode32. That rotors files around all AGs in a round > robin fashion instead of trying to keep directory locality for a > working set. i.e. it distributes the files evenly across the > filesystem. http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s09.html says: "When 32 bit inode numbers are used on a volume larger than 1TB in size, several changes occur. A 100TB volume using 256 byte inodes mounted in the default inode32 mode has just one percent of its space available for allocating inodes. XFS will reserve the first 1TB of disk space exclusively for inodes to ensure that the imbalance is no worse than this due to file data allocations." Does this mean that a 1.1TB disk has 1TB reserved for inodes and 0.1TB left over for data? Or is it driven by the "one percent" which is mentioned above, so it would be 0.011TB? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-02-06 14:10 ` Avi Kivity @ 2018-02-07 1:57 ` Dave Chinner 2018-02-07 10:54 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2018-02-07 1:57 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, linux-xfs On Tue, Feb 06, 2018 at 04:10:12PM +0200, Avi Kivity wrote: > > On 01/29/2018 11:56 PM, Dave Chinner wrote: > > On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote: > > > > There's many reasons this can happen, but the most common is the > > > > working files in a directory (or subset of directories in the same > > > > AG) have a combined space usage of larger than an AG .... > > > That's certainly possible, even likely (one huge directory with all of the > > > files). > > > > > > This layout is imposed on us by the compatibility gods. Is there a way to > > > tell XFS to change its policy of on-ag-per-directory? > > mount with inode32. That rotors files around all AGs in a round > > robin fashion instead of trying to keep directory locality for a > > working set. i.e. it distributes the files evenly across the > > filesystem. > > http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s09.html > says: > > "When 32 bit inode numbers are used on a volume larger than 1TB in size, > several changes occur. > > A 100TB volume using 256 byte inodes mounted in the default inode32 mode has > just one percent of its space available for allocating inodes. > > XFS will reserve the first 1TB of disk space exclusively for inodes to > ensure that the imbalance is no worse than this due to file data > allocations." s/exclusively// > Does this mean that a 1.1TB disk has 1TB reserved for inodes and 0.1TB left > over for data? No, that would be silly. > Or is it driven by the "one percent" which is mentioned > above, so it would be 0.011TB? No, you're inferring behavioural rules that don't exist from a simple example. Maximum number of inodes is controlled by min(imaxpct, free space). For inode32, "free space" is what's in the first 32 bits of the inode address space. For inode64, it's global free space. To enable this, inode32 sets the AGs wholly within the first 32 bits of the inode address space to be "metadata prefered" and "inode capable". Important things to note: - "first 32 bits of inode address space" means the range of space that inode32 reserves for inodes changes according to inode size. 256 byte inodes = 1TB, 2kB inodes = 8TB. If the filesystem is smaller than this threshold, then it will silently use the inode64 allocation policy until the filesystem is grown beyond 32 bit inode address space size. - "inode capable" means inodes can be allocated in the AG - "metadata preferred" means user data will not get allocated in this AG unless all non-prefered AGs are full. So, assuming 256 byte inodes, you 1.1TB fs will have a imaxpct of ~25%, allowing a maximum of 256GB of inodes or about a billion inodes. But once you put more than 0.1TB of data into the filesystem, data will start filling up the inode capable AGs as well, and then your limit for inodes looks just like inode64 (i.e. depedent on free space). IOWs, inode32 limits where and how many inodes you can create, not how much user data you can write inode the filesystem. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-02-07 1:57 ` Dave Chinner @ 2018-02-07 10:54 ` Avi Kivity 2018-02-07 23:43 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2018-02-07 10:54 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs On 02/07/2018 03:57 AM, Dave Chinner wrote: > On Tue, Feb 06, 2018 at 04:10:12PM +0200, Avi Kivity wrote: >> On 01/29/2018 11:56 PM, Dave Chinner wrote: >>> On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote: >>>>> There's many reasons this can happen, but the most common is the >>>>> working files in a directory (or subset of directories in the same >>>>> AG) have a combined space usage of larger than an AG .... >>>> That's certainly possible, even likely (one huge directory with all of the >>>> files). >>>> >>>> This layout is imposed on us by the compatibility gods. Is there a way to >>>> tell XFS to change its policy of on-ag-per-directory? >>> mount with inode32. That rotors files around all AGs in a round >>> robin fashion instead of trying to keep directory locality for a >>> working set. i.e. it distributes the files evenly across the >>> filesystem. >> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s09.html >> says: >> >> "When 32 bit inode numbers are used on a volume larger than 1TB in size, >> several changes occur. >> >> A 100TB volume using 256 byte inodes mounted in the default inode32 mode has >> just one percent of its space available for allocating inodes. >> >> XFS will reserve the first 1TB of disk space exclusively for inodes to >> ensure that the imbalance is no worse than this due to file data >> allocations." > s/exclusively// > >> Does this mean that a 1.1TB disk has 1TB reserved for inodes and 0.1TB left >> over for data? > No, that would be silly. Suggest doc changes for both. > >> Or is it driven by the "one percent" which is mentioned >> above, so it would be 0.011TB? > No, you're inferring behavioural rules that don't exist from a > simple example. > > Maximum number of inodes is controlled by min(imaxpct, free space). > For inode32, "free space" is what's in the first 32 bits of the inode > address space. For inode64, it's global free space. > > To enable this, inode32 sets the AGs wholly within the first 32 bits > of the inode address space to be "metadata prefered" and "inode > capable". > > Important things to note: > > - "first 32 bits of inode address space" means the range of > space that inode32 reserves for inodes changes according > to inode size. 256 byte inodes = 1TB, 2kB inodes = 8TB. If > the filesystem is smaller than this threshold, then it > will silently use the inode64 allocation policy until the > filesystem is grown beyond 32 bit inode address space > size. > > - "inode capable" means inodes can be allocated in the AG > > - "metadata preferred" means user data will not get > allocated in this AG unless all non-prefered AGs are full. > > > So, assuming 256 byte inodes, you 1.1TB fs will have a imaxpct of > ~25%, allowing a maximum of 256GB of inodes or about a billion > inodes. But once you put more than 0.1TB of data into the > filesystem, data will start filling up the inode capable AGs as > well, and then your limit for inodes looks just like inode64 (i.e. > depedent on free space). > > IOWs, inode32 limits where and how many inodes you can > create, not how much user data you can write inode the filesystem. > Thanks a lot for the clarifications. Looks like inode32 can be used to reduce some of our pain. There's a danger that when switching from inode64 to inode32 you end up with the inode32 address space already exhausted, right? Does that result in ENOSPC or what? Anyway, can probably be fixed by stopping the load, copying files around, and moving them back. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-02-07 10:54 ` Avi Kivity @ 2018-02-07 23:43 ` Dave Chinner 0 siblings, 0 replies; 20+ messages in thread From: Dave Chinner @ 2018-02-07 23:43 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, linux-xfs On Wed, Feb 07, 2018 at 12:54:43PM +0200, Avi Kivity wrote: > On 02/07/2018 03:57 AM, Dave Chinner wrote: > >IOWs, inode32 limits where and how many inodes you can > >create, not how much user data you can write inode the filesystem. > > Thanks a lot for the clarifications. Looks like inode32 can be used > to reduce some of our pain. > > There's a danger that when switching from inode64 to inode32 you end > up with the inode32 address space already exhausted, right? Does > that result in ENOSPC or what? ENOSPC on inode allocation. > Anyway, can probably be fixed by stopping the load, copying files > around, and moving them back. Yup, assuming you're able to find the files that need to be moved in a finite period of time. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: xfs_extent_busy_flush vs. aio 2018-01-25 13:08 ` Brian Foster 2018-01-29 9:40 ` Avi Kivity @ 2018-02-02 9:48 ` Christoph Hellwig 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2018-02-02 9:48 UTC (permalink / raw) To: Brian Foster; +Cc: Avi Kivity, linux-xfs On Thu, Jan 25, 2018 at 08:08:31AM -0500, Brian Foster wrote: > I suppose it's possible that this was some kind of transient state, or > perhaps only a small set of AGs are affected, etc. It's also possible > this may have been improved in more recent kernels by Christoph's rework > of some of that code. In any event, this would probably require a bit > more runtime analysis to figure out where/why allocations are getting > stalled as such. I'd probably start by looking at the xfs_extent_busy_* > tracepoints (also note that if there's potentially something to be > improved on here, it's more useful to do so against current upstream). > > Or you could just move to something that supports RWF_NOWAIT.. ;) The way the XFS allocator works has always had a fundamental flaw since we intorduced the ocncept of busy extents, and that is we need to lock ourselves into an AG or sometimes even range without taking said busy extents into account. The proper fix is to separate the in-core and in-memory data structures for free space tracking, and only release the busy extents to the in-memory one once they aren't busy anymore. Looking into this has been on my todo list for a long time, but I never go to it. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2018-02-07 23:43 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-23 14:57 xfs_extent_busy_flush vs. aio Avi Kivity 2018-01-23 15:28 ` Brian Foster 2018-01-23 15:45 ` Avi Kivity 2018-01-23 16:11 ` Brian Foster 2018-01-23 16:22 ` Avi Kivity 2018-01-23 16:47 ` Brian Foster 2018-01-23 17:00 ` Avi Kivity 2018-01-23 17:39 ` Brian Foster 2018-01-25 8:50 ` Avi Kivity 2018-01-25 13:08 ` Brian Foster 2018-01-29 9:40 ` Avi Kivity 2018-01-29 11:35 ` Dave Chinner 2018-01-29 11:44 ` Avi Kivity 2018-01-29 21:56 ` Dave Chinner 2018-01-30 8:58 ` Avi Kivity 2018-02-06 14:10 ` Avi Kivity 2018-02-07 1:57 ` Dave Chinner 2018-02-07 10:54 ` Avi Kivity 2018-02-07 23:43 ` Dave Chinner 2018-02-02 9:48 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).