xfs_extent_busy_flush vs. aio

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* xfs_extent_busy_flush vs. aio
@ 2018-01-23 14:57 Avi Kivity
  2018-01-23 15:28 ` Brian Foster
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-01-23 14:57 UTC (permalink / raw)
  To: linux-xfs

I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my 
beautiful io_submit() calls.

Questions:

  - Is it correct that RWF_NOWAIT will not detect the condition that led 
to the log being forced?

  - If so, can it be fixed?

  - Can I do something to reduce the odds of this occurring? larger 
logs, more logs, flush more often, resurrect extinct species and 
sacrifice them to the xfs gods?

  - Can an xfs developer do something? For example, make it RWF_NOWAIT 
friendly (if the answer to the first question was "correct")

[*] equivalent, because I'm actually looking at an older kernel that 
lacks this function. But I'm moderately confident that the xfs_log_force 
I'm seeing was transformed into xfs_extent_busy_flush by 
ebf55872616c7d4754db5a318591a72a8d5e6896

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 14:57 xfs_extent_busy_flush vs. aio Avi Kivity
@ 2018-01-23 15:28 ` Brian Foster
  2018-01-23 15:45   ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2018-01-23 15:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
> beautiful io_submit() calls.
> 
> 
> Questions:
> 
>  - Is it correct that RWF_NOWAIT will not detect the condition that led to
> the log being forced?
> 
>  - If so, can it be fixed?
> 
>  - Can I do something to reduce the odds of this occurring? larger logs,
> more logs, flush more often, resurrect extinct species and sacrifice them to
> the xfs gods?
> 
>  - Can an xfs developer do something? For example, make it RWF_NOWAIT
> friendly (if the answer to the first question was "correct")
> 

So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
it skips any write call that would require allocation in
xfs_file_iomap_begin(). The busy flush should only happen in the block
allocation path, so something is missing here. Do you have a backtrace
for the log force you're seeing?

Brian

> 
> [*] equivalent, because I'm actually looking at an older kernel that lacks
> this function. But I'm moderately confident that the xfs_log_force I'm
> seeing was transformed into xfs_extent_busy_flush by
> ebf55872616c7d4754db5a318591a72a8d5e6896
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 15:28 ` Brian Foster
@ 2018-01-23 15:45   ` Avi Kivity
  2018-01-23 16:11     ` Brian Foster
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-01-23 15:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs



On 01/23/2018 05:28 PM, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
>> beautiful io_submit() calls.
>>
>>
>> Questions:
>>
>>   - Is it correct that RWF_NOWAIT will not detect the condition that led to
>> the log being forced?
>>
>>   - If so, can it be fixed?
>>
>>   - Can I do something to reduce the odds of this occurring? larger logs,
>> more logs, flush more often, resurrect extinct species and sacrifice them to
>> the xfs gods?
>>
>>   - Can an xfs developer do something? For example, make it RWF_NOWAIT
>> friendly (if the answer to the first question was "correct")
>>
> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
> it skips any write call that would require allocation in
> xfs_file_iomap_begin(). The busy flush should only happen in the block
> allocation path, so something is missing here. Do you have a backtrace
> for the log force you're seeing?
>
>

Here's a trace. It's from a kernel that lacks RWF_NOWAIT.

  0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel]
  0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel]
  0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel]
  0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel]
  0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel]
  0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs]
  0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs]
  0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs]
  0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs]
  0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs]
  0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
  0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
  0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
  0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs]
  0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs]


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 15:45   ` Avi Kivity
@ 2018-01-23 16:11     ` Brian Foster
  2018-01-23 16:22       ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2018-01-23 16:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
> 
> 
> On 01/23/2018 05:28 PM, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
> > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
> > > beautiful io_submit() calls.
> > > 
> > > 
> > > Questions:
> > > 
> > >   - Is it correct that RWF_NOWAIT will not detect the condition that led to
> > > the log being forced?
> > > 
> > >   - If so, can it be fixed?
> > > 
> > >   - Can I do something to reduce the odds of this occurring? larger logs,
> > > more logs, flush more often, resurrect extinct species and sacrifice them to
> > > the xfs gods?
> > > 
> > >   - Can an xfs developer do something? For example, make it RWF_NOWAIT
> > > friendly (if the answer to the first question was "correct")
> > > 
> > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
> > it skips any write call that would require allocation in
> > xfs_file_iomap_begin(). The busy flush should only happen in the block
> > allocation path, so something is missing here. Do you have a backtrace
> > for the log force you're seeing?
> > 
> > 
> 
> Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
> 

Oh, so the case below is roughly how I would have expected to hit the
flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
answer your first question.

For the follow up question, I think this should only occur when the fs
is fairly low on free space. Is that the case here? I'm not sure there's
a specific metric, fwiw, but it's just a matter of attempting an (user
data) allocation that only finds busy extents in the free space btrees
and thus has to the force the log to satisfy the allocation. I suppose
running with more free space available would avoid this. I think running
with less in-core log space could indirectly reduce extent busy time,
but that may also have other performance ramifications and so is
probably not a great idea.

Brian

>  0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel]
>  0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel]
>  0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel]
>  0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel]
>  0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel]
>  0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs]
>  0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs]
>  0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs]
>  0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs]
>  0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs]
>  0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
>  0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
>  0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
>  0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs]
>  0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs]
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 16:11     ` Brian Foster
@ 2018-01-23 16:22       ` Avi Kivity
  2018-01-23 16:47         ` Brian Foster
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-01-23 16:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs



On 01/23/2018 06:11 PM, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
>>
>> On 01/23/2018 05:28 PM, Brian Foster wrote:
>>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
>>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
>>>> beautiful io_submit() calls.
>>>>
>>>>
>>>> Questions:
>>>>
>>>>    - Is it correct that RWF_NOWAIT will not detect the condition that led to
>>>> the log being forced?
>>>>
>>>>    - If so, can it be fixed?
>>>>
>>>>    - Can I do something to reduce the odds of this occurring? larger logs,
>>>> more logs, flush more often, resurrect extinct species and sacrifice them to
>>>> the xfs gods?
>>>>
>>>>    - Can an xfs developer do something? For example, make it RWF_NOWAIT
>>>> friendly (if the answer to the first question was "correct")
>>>>
>>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
>>> it skips any write call that would require allocation in
>>> xfs_file_iomap_begin(). The busy flush should only happen in the block
>>> allocation path, so something is missing here. Do you have a backtrace
>>> for the log force you're seeing?
>>>
>>>
>> Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
>>
> Oh, so the case below is roughly how I would have expected to hit the
> flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
> answer your first question.

Thanks, that's very encouraging. We are exploring recommending 
upstream-ish kernels to users and customers, given their relative 
stability these days and aio-related improvements (not to mention the 
shame of having to admit to running an old kernel when reporting a 
problem to an upstream list).

>
> For the follow up question, I think this should only occur when the fs
> is fairly low on free space. Is that the case here?

No:

/dev/md0        3.0T  1.2T  1.8T  40% /var/lib/scylla


> I'm not sure there's
> a specific metric, fwiw, but it's just a matter of attempting an (user
> data) allocation that only finds busy extents in the free space btrees
> and thus has to the force the log to satisfy the allocation.

What does "busy" mean here? recently freed so we want to force the log 
to make sure the extent isn't doubly-allocated? (wild guess)

>   I suppose
> running with more free space available would avoid this. I think running
> with less in-core log space could indirectly reduce extent busy time,
> but that may also have other performance ramifications and so is
> probably not a great idea.

At 60%, I hope low free space  is not a problem.

btw, I'm also seeing 10ms+ periods of high CPU utilization:

  0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel]
  0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel]
  0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs]
  0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs]
  0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs]
  0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs]
  0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs]
  0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs]
  0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs]
  0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs]
  0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
  0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
  0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]

Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent?  
Should I be increasing my extent hint (currently at 32MB)?

>
> Brian
>
>>   0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel]
>>   0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel]
>>   0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel]
>>   0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel]
>>   0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel]
>>   0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs]
>>   0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs]
>>   0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs]
>>   0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs]
>>   0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs]
>>   0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
>>   0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
>>   0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
>>   0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs]
>>   0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs]
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 16:22       ` Avi Kivity
@ 2018-01-23 16:47         ` Brian Foster
  2018-01-23 17:00           ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2018-01-23 16:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote:
> 
> 
> On 01/23/2018 06:11 PM, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
> > > 
> > > On 01/23/2018 05:28 PM, Brian Foster wrote:
> > > > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
> > > > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
> > > > > beautiful io_submit() calls.
> > > > > 
> > > > > 
> > > > > Questions:
> > > > > 
> > > > >    - Is it correct that RWF_NOWAIT will not detect the condition that led to
> > > > > the log being forced?
> > > > > 
> > > > >    - If so, can it be fixed?
> > > > > 
> > > > >    - Can I do something to reduce the odds of this occurring? larger logs,
> > > > > more logs, flush more often, resurrect extinct species and sacrifice them to
> > > > > the xfs gods?
> > > > > 
> > > > >    - Can an xfs developer do something? For example, make it RWF_NOWAIT
> > > > > friendly (if the answer to the first question was "correct")
> > > > > 
> > > > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
> > > > it skips any write call that would require allocation in
> > > > xfs_file_iomap_begin(). The busy flush should only happen in the block
> > > > allocation path, so something is missing here. Do you have a backtrace
> > > > for the log force you're seeing?
> > > > 
> > > > 
> > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
> > > 
> > Oh, so the case below is roughly how I would have expected to hit the
> > flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
> > answer your first question.
> 
> Thanks, that's very encouraging. We are exploring recommending upstream-ish
> kernels to users and customers, given their relative stability these days
> and aio-related improvements (not to mention the shame of having to admit to
> running an old kernel when reporting a problem to an upstream list).
> 
> > 
> > For the follow up question, I think this should only occur when the fs
> > is fairly low on free space. Is that the case here?
> 
> No:
> 
> /dev/md0        3.0T  1.2T  1.8T  40% /var/lib/scylla
> 
> 
> > I'm not sure there's
> > a specific metric, fwiw, but it's just a matter of attempting an (user
> > data) allocation that only finds busy extents in the free space btrees
> > and thus has to the force the log to satisfy the allocation.
> 
> What does "busy" mean here? recently freed so we want to force the log to
> make sure the extent isn't doubly-allocated? (wild guess)
> 

Recently freed and the transaction that freed the blocks has not yet
been persisted to the on-disk log. A subsequent attempt to allocate
those blocks for user data waits for the transaction to commit to disk
to ensure that the block is not written before the filesystem has
persisted the fact that it has been freed. Otherwise, my understanding
is that if the blocks are written to and the filesystem crashes before
the previous free was persisted, we'd have allowed an overwrite of a
still-used metadata block.

> >   I suppose
> > running with more free space available would avoid this. I think running
> > with less in-core log space could indirectly reduce extent busy time,
> > but that may also have other performance ramifications and so is
> > probably not a great idea.
> 
> At 60%, I hope low free space  is not a problem.
> 

Yeah, that seems strange. I wouldn't expect busy extents to be a problem
with that much free space. 

> btw, I'm also seeing 10ms+ periods of high CPU utilization:
> 
>  0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel]
>  0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel]
>  0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs]
>  0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs]
>  0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs]
>  0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs]
>  0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs]
>  0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs]
>  0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs]
>  0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs]
>  0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
>  0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
>  0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
> 
> Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent? 
> Should I be increasing my extent hint (currently at 32MB)?
> 

I haven't done enough performance testing to have an intuition on the
typical CPU time required to allocate blocks. Somebody else may be able
to chime in on that. I suppose it could depend on the level of free
space fragmentation, which can be observed via 'xfs_db -c "freesp -s"
<dev>', whether I/Os or btree splits/joins were required, etc.

FWIW, the above stack looks like it's stuck waiting on a memory
allocation for a btree buffer xfs_buf_log_item, which is an internal
data structure used to track metadata objects through the log subsystem.
We have a kmem zone for such objects because they are allocated/freed
frequently, but perhaps the zone had to grow..? We do pass KM_SLEEP
there..

Brian

> > 
> > Brian
> > 
> > >   0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel]
> > >   0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel]
> > >   0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel]
> > >   0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel]
> > >   0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel]
> > >   0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs]
> > >   0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs]
> > >   0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs]
> > >   0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs]
> > >   0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs]
> > >   0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
> > >   0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
> > >   0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
> > >   0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs]
> > >   0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs]
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 16:47         ` Brian Foster
@ 2018-01-23 17:00           ` Avi Kivity
  2018-01-23 17:39             ` Brian Foster
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-01-23 17:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs



On 01/23/2018 06:47 PM, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote:
>>
>> On 01/23/2018 06:11 PM, Brian Foster wrote:
>>> On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
>>>> On 01/23/2018 05:28 PM, Brian Foster wrote:
>>>>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
>>>>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
>>>>>> beautiful io_submit() calls.
>>>>>>
>>>>>>
>>>>>> Questions:
>>>>>>
>>>>>>     - Is it correct that RWF_NOWAIT will not detect the condition that led to
>>>>>> the log being forced?
>>>>>>
>>>>>>     - If so, can it be fixed?
>>>>>>
>>>>>>     - Can I do something to reduce the odds of this occurring? larger logs,
>>>>>> more logs, flush more often, resurrect extinct species and sacrifice them to
>>>>>> the xfs gods?
>>>>>>
>>>>>>     - Can an xfs developer do something? For example, make it RWF_NOWAIT
>>>>>> friendly (if the answer to the first question was "correct")
>>>>>>
>>>>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
>>>>> it skips any write call that would require allocation in
>>>>> xfs_file_iomap_begin(). The busy flush should only happen in the block
>>>>> allocation path, so something is missing here. Do you have a backtrace
>>>>> for the log force you're seeing?
>>>>>
>>>>>
>>>> Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
>>>>
>>> Oh, so the case below is roughly how I would have expected to hit the
>>> flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
>>> answer your first question.
>> Thanks, that's very encouraging. We are exploring recommending upstream-ish
>> kernels to users and customers, given their relative stability these days
>> and aio-related improvements (not to mention the shame of having to admit to
>> running an old kernel when reporting a problem to an upstream list).
>>
>>> For the follow up question, I think this should only occur when the fs
>>> is fairly low on free space. Is that the case here?
>> No:
>>
>> /dev/md0        3.0T  1.2T  1.8T  40% /var/lib/scylla
>>
>>
>>> I'm not sure there's
>>> a specific metric, fwiw, but it's just a matter of attempting an (user
>>> data) allocation that only finds busy extents in the free space btrees
>>> and thus has to the force the log to satisfy the allocation.
>> What does "busy" mean here? recently freed so we want to force the log to
>> make sure the extent isn't doubly-allocated? (wild guess)
>>
> Recently freed and the transaction that freed the blocks has not yet
> been persisted to the on-disk log. A subsequent attempt to allocate
> those blocks for user data waits for the transaction to commit to disk
> to ensure that the block is not written before the filesystem has
> persisted the fact that it has been freed. Otherwise, my understanding
> is that if the blocks are written to and the filesystem crashes before
> the previous free was persisted, we'd have allowed an overwrite of a
> still-used metadata block.

Understood, thanks.

>
>>>    I suppose
>>> running with more free space available would avoid this. I think running
>>> with less in-core log space could indirectly reduce extent busy time,
>>> but that may also have other performance ramifications and so is
>>> probably not a great idea.
>> At 60%, I hope low free space  is not a problem.
>>
> Yeah, that seems strange. I wouldn't expect busy extents to be a problem
> with that much free space.

The workload creates new files, appends to them, lets them stew for a 
while, then deletes them. Maybe something is preventing xfs from seeing 
non-busy extents?

The disk is writing at 300-600MB/s for several days, so quite some churn.

>
>> btw, I'm also seeing 10ms+ periods of high CPU utilization:
>>
>>   0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel]
>>   0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel]
>>   0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs]
>>   0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs]
>>   0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs]
>>   0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs]
>>   0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs]
>>   0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs]
>>   0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs]
>>   0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs]
>>   0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
>>   0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
>>   0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
>>
>> Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent?
>> Should I be increasing my extent hint (currently at 32MB)?
>>
> I haven't done enough performance testing to have an intuition on the
> typical CPU time required to allocate blocks. Somebody else may be able
> to chime in on that. I suppose it could depend on the level of free
> space fragmentation, which can be observed via 'xfs_db -c "freesp -s"
> <dev>', whether I/Os or btree splits/joins were required, etc.
>
> FWIW, the above stack looks like it's stuck waiting on a memory
> allocation for a btree buffer xfs_buf_log_item, which is an internal
> data structure used to track metadata objects through the log subsystem.
> We have a kmem zone for such objects because they are allocated/freed
> frequently, but perhaps the zone had to grow..? We do pass KM_SLEEP
> there..

It's not really waiting, that's a cond_resched. The scheduler switched 
away because some other task needed its attention, not because memory 
was not available. That's understandable since xfs hogged the cpu for 10ms.

I will look at xfs_bmap output later, after I renew my friendship with 
trace-cmd.

> Brian
>
>>> Brian
>>>
>>>>    0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel]
>>>>    0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel]
>>>>    0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel]
>>>>    0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel]
>>>>    0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel]
>>>>    0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs]
>>>>    0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs]
>>>>    0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs]
>>>>    0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs]
>>>>    0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs]
>>>>    0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
>>>>    0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
>>>>    0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
>>>>    0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs]
>>>>    0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs]
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 17:00           ` Avi Kivity
@ 2018-01-23 17:39             ` Brian Foster
  2018-01-25  8:50               ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2018-01-23 17:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote:
> 
> 
> On 01/23/2018 06:47 PM, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote:
> > > 
> > > On 01/23/2018 06:11 PM, Brian Foster wrote:
> > > > On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
> > > > > On 01/23/2018 05:28 PM, Brian Foster wrote:
> > > > > > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
> > > > > > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
> > > > > > > beautiful io_submit() calls.
> > > > > > > 
> > > > > > > 
> > > > > > > Questions:
> > > > > > > 
> > > > > > >     - Is it correct that RWF_NOWAIT will not detect the condition that led to
> > > > > > > the log being forced?
> > > > > > > 
> > > > > > >     - If so, can it be fixed?
> > > > > > > 
> > > > > > >     - Can I do something to reduce the odds of this occurring? larger logs,
> > > > > > > more logs, flush more often, resurrect extinct species and sacrifice them to
> > > > > > > the xfs gods?
> > > > > > > 
> > > > > > >     - Can an xfs developer do something? For example, make it RWF_NOWAIT
> > > > > > > friendly (if the answer to the first question was "correct")
> > > > > > > 
> > > > > > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
> > > > > > it skips any write call that would require allocation in
> > > > > > xfs_file_iomap_begin(). The busy flush should only happen in the block
> > > > > > allocation path, so something is missing here. Do you have a backtrace
> > > > > > for the log force you're seeing?
> > > > > > 
> > > > > > 
> > > > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
> > > > > 
> > > > Oh, so the case below is roughly how I would have expected to hit the
> > > > flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
> > > > answer your first question.
> > > Thanks, that's very encouraging. We are exploring recommending upstream-ish
> > > kernels to users and customers, given their relative stability these days
> > > and aio-related improvements (not to mention the shame of having to admit to
> > > running an old kernel when reporting a problem to an upstream list).
> > > 
> > > > For the follow up question, I think this should only occur when the fs
> > > > is fairly low on free space. Is that the case here?
> > > No:
> > > 
> > > /dev/md0        3.0T  1.2T  1.8T  40% /var/lib/scylla
> > > 
> > > 
> > > > I'm not sure there's
> > > > a specific metric, fwiw, but it's just a matter of attempting an (user
> > > > data) allocation that only finds busy extents in the free space btrees
> > > > and thus has to the force the log to satisfy the allocation.
> > > What does "busy" mean here? recently freed so we want to force the log to
> > > make sure the extent isn't doubly-allocated? (wild guess)
> > > 
> > Recently freed and the transaction that freed the blocks has not yet
> > been persisted to the on-disk log. A subsequent attempt to allocate
> > those blocks for user data waits for the transaction to commit to disk
> > to ensure that the block is not written before the filesystem has
> > persisted the fact that it has been freed. Otherwise, my understanding
> > is that if the blocks are written to and the filesystem crashes before
> > the previous free was persisted, we'd have allowed an overwrite of a
> > still-used metadata block.
> 
> Understood, thanks.
> 
> > 
> > > >    I suppose
> > > > running with more free space available would avoid this. I think running
> > > > with less in-core log space could indirectly reduce extent busy time,
> > > > but that may also have other performance ramifications and so is
> > > > probably not a great idea.
> > > At 60%, I hope low free space  is not a problem.
> > > 
> > Yeah, that seems strange. I wouldn't expect busy extents to be a problem
> > with that much free space.
> 
> The workload creates new files, appends to them, lets them stew for a while,
> then deletes them. Maybe something is preventing xfs from seeing non-busy
> extents?
> 

Yeah, could be.. perhaps the issue is that despite the large amount of
total free space, the free space is too fragmented to satisfy a
particular allocation request..?

> The disk is writing at 300-600MB/s for several days, so quite some churn.
> 
> > 
> > > btw, I'm also seeing 10ms+ periods of high CPU utilization:
> > > 
> > >   0xffffffff816ab97a : _cond_resched+0x3a/0x50 [kernel]
> > >   0xffffffff811e1495 : kmem_cache_alloc+0x35/0x1e0 [kernel]
> > >   0xffffffffc00d8477 : kmem_zone_alloc+0x97/0x130 [xfs]
> > >   0xffffffffc00deae2 : xfs_buf_item_init+0x42/0x190 [xfs]
> > >   0xffffffffc00e89c3 : _xfs_trans_bjoin+0x23/0x60 [xfs]
> > >   0xffffffffc00e8f17 : xfs_trans_read_buf_map+0x247/0x400 [xfs]
> > >   0xffffffffc008f248 : xfs_btree_read_buf_block.constprop.29+0x78/0xc0 [xfs]
> > >   0xffffffffc009221e : xfs_btree_increment+0x21e/0x350 [xfs]
> > >   0xffffffffc00796a8 : xfs_alloc_ag_vextent_near+0x368/0xab0 [xfs]
> > >   0xffffffffc0079efd : xfs_alloc_ag_vextent+0x10d/0x150 [xfs]
> > >   0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
> > >   0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
> > >   0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
> > > 
> > > Is it normal for xfs to spend 10ms+ of CPU time to allocate an extent?
> > > Should I be increasing my extent hint (currently at 32MB)?
> > > 
> > I haven't done enough performance testing to have an intuition on the
> > typical CPU time required to allocate blocks. Somebody else may be able
> > to chime in on that. I suppose it could depend on the level of free
> > space fragmentation, which can be observed via 'xfs_db -c "freesp -s"
> > <dev>', whether I/Os or btree splits/joins were required, etc.
> > 
> > FWIW, the above stack looks like it's stuck waiting on a memory
> > allocation for a btree buffer xfs_buf_log_item, which is an internal
> > data structure used to track metadata objects through the log subsystem.
> > We have a kmem zone for such objects because they are allocated/freed
> > frequently, but perhaps the zone had to grow..? We do pass KM_SLEEP
> > there..
> 
> It's not really waiting, that's a cond_resched. The scheduler switched away
> because some other task needed its attention, not because memory was not
> available. That's understandable since xfs hogged the cpu for 10ms.
> 

Ah, I misread it as you were blocked in that callchain. I suppose ftrace
or something could help annotate the time spent in the allocation path.
Free space fragmentation could potentially be a factor here as well,
causing the search algorithm(s) to run through a lot of records/blocks
to find something usable, for example.

Brian

> I will look at xfs_bmap output later, after I renew my friendship with
> trace-cmd.
> 
> > Brian
> > 
> > > > Brian
> > > > 
> > > > >    0xffffffff816ab231 : __schedule+0x531/0x9b0 [kernel]
> > > > >    0xffffffff816ab6d9 : schedule+0x29/0x70 [kernel]
> > > > >    0xffffffff816a90e9 : schedule_timeout+0x239/0x2c0 [kernel]
> > > > >    0xffffffff816aba8d : wait_for_completion+0xfd/0x140 [kernel]
> > > > >    0xffffffff810ab41d : flush_work+0xfd/0x190 [kernel]
> > > > >    0xffffffffc00ddb3a : xlog_cil_force_lsn+0x8a/0x210 [xfs]
> > > > >    0xffffffffc00dbbf5 : _xfs_log_force+0x85/0x2c0 [xfs]
> > > > >    0xffffffffc00dbe5c : xfs_log_force+0x2c/0x70 [xfs]
> > > > >    0xffffffffc0078f60 : xfs_alloc_ag_vextent_size+0x250/0x630 [xfs]
> > > > >    0xffffffffc0079ed5 : xfs_alloc_ag_vextent+0xe5/0x150 [xfs]
> > > > >    0xffffffffc007abc6 : xfs_alloc_vextent+0x446/0x5f0 [xfs]
> > > > >    0xffffffffc008b123 : xfs_bmap_btalloc+0x3f3/0x780 [xfs]
> > > > >    0xffffffffc008b4be : xfs_bmap_alloc+0xe/0x10 [xfs]
> > > > >    0xffffffffc008bef9 : xfs_bmapi_write+0x499/0xab0 [xfs]
> > > > >    0xffffffffc00c6ec8 : xfs_iomap_write_direct+0x1b8/0x390 [xfs]
> > > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-23 17:39             ` Brian Foster
@ 2018-01-25  8:50               ` Avi Kivity
  2018-01-25 13:08                 ` Brian Foster
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-01-25  8:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On 01/23/2018 07:39 PM, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote:
>>
>> On 01/23/2018 06:47 PM, Brian Foster wrote:
>>> On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote:
>>>> On 01/23/2018 06:11 PM, Brian Foster wrote:
>>>>> On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
>>>>>> On 01/23/2018 05:28 PM, Brian Foster wrote:
>>>>>>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
>>>>>>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
>>>>>>>> beautiful io_submit() calls.
>>>>>>>>
>>>>>>>>
>>>>>>>> Questions:
>>>>>>>>
>>>>>>>>      - Is it correct that RWF_NOWAIT will not detect the condition that led to
>>>>>>>> the log being forced?
>>>>>>>>
>>>>>>>>      - If so, can it be fixed?
>>>>>>>>
>>>>>>>>      - Can I do something to reduce the odds of this occurring? larger logs,
>>>>>>>> more logs, flush more often, resurrect extinct species and sacrifice them to
>>>>>>>> the xfs gods?
>>>>>>>>
>>>>>>>>      - Can an xfs developer do something? For example, make it RWF_NOWAIT
>>>>>>>> friendly (if the answer to the first question was "correct")
>>>>>>>>
>>>>>>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
>>>>>>> it skips any write call that would require allocation in
>>>>>>> xfs_file_iomap_begin(). The busy flush should only happen in the block
>>>>>>> allocation path, so something is missing here. Do you have a backtrace
>>>>>>> for the log force you're seeing?
>>>>>>>
>>>>>>>
>>>>>> Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
>>>>>>
>>>>> Oh, so the case below is roughly how I would have expected to hit the
>>>>> flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
>>>>> answer your first question.
>>>> Thanks, that's very encouraging. We are exploring recommending upstream-ish
>>>> kernels to users and customers, given their relative stability these days
>>>> and aio-related improvements (not to mention the shame of having to admit to
>>>> running an old kernel when reporting a problem to an upstream list).
>>>>
>>>>> For the follow up question, I think this should only occur when the fs
>>>>> is fairly low on free space. Is that the case here?
>>>> No:
>>>>
>>>> /dev/md0        3.0T  1.2T  1.8T  40% /var/lib/scylla
>>>>
>>>>
>>>>> I'm not sure there's
>>>>> a specific metric, fwiw, but it's just a matter of attempting an (user
>>>>> data) allocation that only finds busy extents in the free space btrees
>>>>> and thus has to the force the log to satisfy the allocation.
>>>> What does "busy" mean here? recently freed so we want to force the log to
>>>> make sure the extent isn't doubly-allocated? (wild guess)
>>>>
>>> Recently freed and the transaction that freed the blocks has not yet
>>> been persisted to the on-disk log. A subsequent attempt to allocate
>>> those blocks for user data waits for the transaction to commit to disk
>>> to ensure that the block is not written before the filesystem has
>>> persisted the fact that it has been freed. Otherwise, my understanding
>>> is that if the blocks are written to and the filesystem crashes before
>>> the previous free was persisted, we'd have allowed an overwrite of a
>>> still-used metadata block.
>> Understood, thanks.
>>
>>>>>     I suppose
>>>>> running with more free space available would avoid this. I think running
>>>>> with less in-core log space could indirectly reduce extent busy time,
>>>>> but that may also have other performance ramifications and so is
>>>>> probably not a great idea.
>>>> At 60%, I hope low free space  is not a problem.
>>>>
>>> Yeah, that seems strange. I wouldn't expect busy extents to be a problem
>>> with that much free space.
>> The workload creates new files, appends to them, lets them stew for a while,
>> then deletes them. Maybe something is preventing xfs from seeing non-busy
>> extents?
>>
> Yeah, could be.. perhaps the issue is that despite the large amount of
> total free space, the free space is too fragmented to satisfy a
> particular allocation request..?

    from      to extents  blocks    pct
       1       1    2702    2702   0.00
       2       3     690    1547   0.00
       4       7     115     568   0.00
       8      15      60     634   0.00
      16      31      63    1457   0.00
      32      63     102    4751   0.00
      64     127    7940  895365   0.19
     128     255   49680 12422100   2.67
     256     511    1025  417078   0.09
     512    1023    4170 3660771   0.79
    1024    2047    2168 3503054   0.75
    2048    4095    2567 7729442   1.66
    4096    8191    8688 59394413  12.76
    8192   16383     310 3100186   0.67
   16384   32767     112 2339935   0.50
   32768   65535      35 1381122   0.30
   65536  131071       8  651391   0.14
  131072  262143       2  344196   0.07
  524288 1048575       4 2909925   0.62
1048576 2097151       3 3550680   0.76
4194304 8388607      10 82497658  17.72
8388608 16777215      10 158022653  33.94
16777216 24567552       5 122778062  26.37
total free extents 80469
total free blocks 465609690
average free extent size 5786.2

Looks like plenty of free large extents, with most of the free space 
completely, unfragmented.

Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could 
have something to do with it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-25  8:50               ` Avi Kivity
@ 2018-01-25 13:08                 ` Brian Foster
  2018-01-29  9:40                   ` Avi Kivity
  2018-02-02  9:48                   ` Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Brian Foster @ 2018-01-25 13:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote:
> On 01/23/2018 07:39 PM, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote:
> > > 
> > > On 01/23/2018 06:47 PM, Brian Foster wrote:
> > > > On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote:
> > > > > On 01/23/2018 06:11 PM, Brian Foster wrote:
> > > > > > On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
> > > > > > > On 01/23/2018 05:28 PM, Brian Foster wrote:
> > > > > > > > On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
> > > > > > > > > I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
> > > > > > > > > beautiful io_submit() calls.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Questions:
> > > > > > > > > 
> > > > > > > > >      - Is it correct that RWF_NOWAIT will not detect the condition that led to
> > > > > > > > > the log being forced?
> > > > > > > > > 
> > > > > > > > >      - If so, can it be fixed?
> > > > > > > > > 
> > > > > > > > >      - Can I do something to reduce the odds of this occurring? larger logs,
> > > > > > > > > more logs, flush more often, resurrect extinct species and sacrifice them to
> > > > > > > > > the xfs gods?
> > > > > > > > > 
> > > > > > > > >      - Can an xfs developer do something? For example, make it RWF_NOWAIT
> > > > > > > > > friendly (if the answer to the first question was "correct")
> > > > > > > > > 
> > > > > > > > So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
> > > > > > > > it skips any write call that would require allocation in
> > > > > > > > xfs_file_iomap_begin(). The busy flush should only happen in the block
> > > > > > > > allocation path, so something is missing here. Do you have a backtrace
> > > > > > > > for the log force you're seeing?
> > > > > > > > 
> > > > > > > > 
> > > > > > > Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
> > > > > > > 
> > > > > > Oh, so the case below is roughly how I would have expected to hit the
> > > > > > flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
> > > > > > answer your first question.
> > > > > Thanks, that's very encouraging. We are exploring recommending upstream-ish
> > > > > kernels to users and customers, given their relative stability these days
> > > > > and aio-related improvements (not to mention the shame of having to admit to
> > > > > running an old kernel when reporting a problem to an upstream list).
> > > > > 
> > > > > > For the follow up question, I think this should only occur when the fs
> > > > > > is fairly low on free space. Is that the case here?
> > > > > No:
> > > > > 
> > > > > /dev/md0        3.0T  1.2T  1.8T  40% /var/lib/scylla
> > > > > 
> > > > > 
> > > > > > I'm not sure there's
> > > > > > a specific metric, fwiw, but it's just a matter of attempting an (user
> > > > > > data) allocation that only finds busy extents in the free space btrees
> > > > > > and thus has to the force the log to satisfy the allocation.
> > > > > What does "busy" mean here? recently freed so we want to force the log to
> > > > > make sure the extent isn't doubly-allocated? (wild guess)
> > > > > 
> > > > Recently freed and the transaction that freed the blocks has not yet
> > > > been persisted to the on-disk log. A subsequent attempt to allocate
> > > > those blocks for user data waits for the transaction to commit to disk
> > > > to ensure that the block is not written before the filesystem has
> > > > persisted the fact that it has been freed. Otherwise, my understanding
> > > > is that if the blocks are written to and the filesystem crashes before
> > > > the previous free was persisted, we'd have allowed an overwrite of a
> > > > still-used metadata block.
> > > Understood, thanks.
> > > 
> > > > > >     I suppose
> > > > > > running with more free space available would avoid this. I think running
> > > > > > with less in-core log space could indirectly reduce extent busy time,
> > > > > > but that may also have other performance ramifications and so is
> > > > > > probably not a great idea.
> > > > > At 60%, I hope low free space  is not a problem.
> > > > > 
> > > > Yeah, that seems strange. I wouldn't expect busy extents to be a problem
> > > > with that much free space.
> > > The workload creates new files, appends to them, lets them stew for a while,
> > > then deletes them. Maybe something is preventing xfs from seeing non-busy
> > > extents?
> > > 
> > Yeah, could be.. perhaps the issue is that despite the large amount of
> > total free space, the free space is too fragmented to satisfy a
> > particular allocation request..?
> 
>    from      to extents  blocks    pct
>       1       1    2702    2702   0.00
>       2       3     690    1547   0.00
>       4       7     115     568   0.00
>       8      15      60     634   0.00
>      16      31      63    1457   0.00
>      32      63     102    4751   0.00
>      64     127    7940  895365   0.19
>     128     255   49680 12422100   2.67
>     256     511    1025  417078   0.09
>     512    1023    4170 3660771   0.79
>    1024    2047    2168 3503054   0.75
>    2048    4095    2567 7729442   1.66
>    4096    8191    8688 59394413  12.76
>    8192   16383     310 3100186   0.67
>   16384   32767     112 2339935   0.50
>   32768   65535      35 1381122   0.30
>   65536  131071       8  651391   0.14
>  131072  262143       2  344196   0.07
>  524288 1048575       4 2909925   0.62
> 1048576 2097151       3 3550680   0.76
> 4194304 8388607      10 82497658  17.72
> 8388608 16777215      10 158022653  33.94
> 16777216 24567552       5 122778062  26.37
> total free extents 80469
> total free blocks 465609690
> average free extent size 5786.2
> 
> Looks like plenty of free large extents, with most of the free space
> completely, unfragmented.
> 

Indeed..

> Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have
> something to do with it.
> 

Most likely. Based on this, it's hard to say for certain why you'd be
running into allocation latency caused by busy extents. Does this
filesystem use the '-o discard' mount option by any chance?

I suppose it's possible that this was some kind of transient state, or
perhaps only a small set of AGs are affected, etc. It's also possible
this may have been improved in more recent kernels by Christoph's rework
of some of that code. In any event, this would probably require a bit
more runtime analysis to figure out where/why allocations are getting
stalled as such. I'd probably start by looking at the xfs_extent_busy_*
tracepoints (also note that if there's potentially something to be
improved on here, it's more useful to do so against current upstream).

Or you could just move to something that supports RWF_NOWAIT.. ;)

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-25 13:08                 ` Brian Foster
@ 2018-01-29  9:40                   ` Avi Kivity
  2018-01-29 11:35                     ` Dave Chinner
  2018-02-02  9:48                   ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-01-29  9:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs



On 01/25/2018 03:08 PM, Brian Foster wrote:
> On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote:
>> On 01/23/2018 07:39 PM, Brian Foster wrote:
>>> On Tue, Jan 23, 2018 at 07:00:31PM +0200, Avi Kivity wrote:
>>>> On 01/23/2018 06:47 PM, Brian Foster wrote:
>>>>> On Tue, Jan 23, 2018 at 06:22:07PM +0200, Avi Kivity wrote:
>>>>>> On 01/23/2018 06:11 PM, Brian Foster wrote:
>>>>>>> On Tue, Jan 23, 2018 at 05:45:39PM +0200, Avi Kivity wrote:
>>>>>>>> On 01/23/2018 05:28 PM, Brian Foster wrote:
>>>>>>>>> On Tue, Jan 23, 2018 at 04:57:03PM +0200, Avi Kivity wrote:
>>>>>>>>>> I'm seeing the equivalent[*] of xfs_extent_busy_flush() sleeping in my
>>>>>>>>>> beautiful io_submit() calls.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Questions:
>>>>>>>>>>
>>>>>>>>>>       - Is it correct that RWF_NOWAIT will not detect the condition that led to
>>>>>>>>>> the log being forced?
>>>>>>>>>>
>>>>>>>>>>       - If so, can it be fixed?
>>>>>>>>>>
>>>>>>>>>>       - Can I do something to reduce the odds of this occurring? larger logs,
>>>>>>>>>> more logs, flush more often, resurrect extinct species and sacrifice them to
>>>>>>>>>> the xfs gods?
>>>>>>>>>>
>>>>>>>>>>       - Can an xfs developer do something? For example, make it RWF_NOWAIT
>>>>>>>>>> friendly (if the answer to the first question was "correct")
>>>>>>>>>>
>>>>>>>>> So RWF_NOWAIT eventually works its way to IOMAP_NOWAIT, which looks like
>>>>>>>>> it skips any write call that would require allocation in
>>>>>>>>> xfs_file_iomap_begin(). The busy flush should only happen in the block
>>>>>>>>> allocation path, so something is missing here. Do you have a backtrace
>>>>>>>>> for the log force you're seeing?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Here's a trace. It's from a kernel that lacks RWF_NOWAIT.
>>>>>>>>
>>>>>>> Oh, so the case below is roughly how I would have expected to hit the
>>>>>>> flush/wait without RWF_NOWAIT. The latter flag should prevent this, to
>>>>>>> answer your first question.
>>>>>> Thanks, that's very encouraging. We are exploring recommending upstream-ish
>>>>>> kernels to users and customers, given their relative stability these days
>>>>>> and aio-related improvements (not to mention the shame of having to admit to
>>>>>> running an old kernel when reporting a problem to an upstream list).
>>>>>>
>>>>>>> For the follow up question, I think this should only occur when the fs
>>>>>>> is fairly low on free space. Is that the case here?
>>>>>> No:
>>>>>>
>>>>>> /dev/md0        3.0T  1.2T  1.8T  40% /var/lib/scylla
>>>>>>
>>>>>>
>>>>>>> I'm not sure there's
>>>>>>> a specific metric, fwiw, but it's just a matter of attempting an (user
>>>>>>> data) allocation that only finds busy extents in the free space btrees
>>>>>>> and thus has to the force the log to satisfy the allocation.
>>>>>> What does "busy" mean here? recently freed so we want to force the log to
>>>>>> make sure the extent isn't doubly-allocated? (wild guess)
>>>>>>
>>>>> Recently freed and the transaction that freed the blocks has not yet
>>>>> been persisted to the on-disk log. A subsequent attempt to allocate
>>>>> those blocks for user data waits for the transaction to commit to disk
>>>>> to ensure that the block is not written before the filesystem has
>>>>> persisted the fact that it has been freed. Otherwise, my understanding
>>>>> is that if the blocks are written to and the filesystem crashes before
>>>>> the previous free was persisted, we'd have allowed an overwrite of a
>>>>> still-used metadata block.
>>>> Understood, thanks.
>>>>
>>>>>>>      I suppose
>>>>>>> running with more free space available would avoid this. I think running
>>>>>>> with less in-core log space could indirectly reduce extent busy time,
>>>>>>> but that may also have other performance ramifications and so is
>>>>>>> probably not a great idea.
>>>>>> At 60%, I hope low free space  is not a problem.
>>>>>>
>>>>> Yeah, that seems strange. I wouldn't expect busy extents to be a problem
>>>>> with that much free space.
>>>> The workload creates new files, appends to them, lets them stew for a while,
>>>> then deletes them. Maybe something is preventing xfs from seeing non-busy
>>>> extents?
>>>>
>>> Yeah, could be.. perhaps the issue is that despite the large amount of
>>> total free space, the free space is too fragmented to satisfy a
>>> particular allocation request..?
>>     from      to extents  blocks    pct
>>        1       1    2702    2702   0.00
>>        2       3     690    1547   0.00
>>        4       7     115     568   0.00
>>        8      15      60     634   0.00
>>       16      31      63    1457   0.00
>>       32      63     102    4751   0.00
>>       64     127    7940  895365   0.19
>>      128     255   49680 12422100   2.67
>>      256     511    1025  417078   0.09
>>      512    1023    4170 3660771   0.79
>>     1024    2047    2168 3503054   0.75
>>     2048    4095    2567 7729442   1.66
>>     4096    8191    8688 59394413  12.76
>>     8192   16383     310 3100186   0.67
>>    16384   32767     112 2339935   0.50
>>    32768   65535      35 1381122   0.30
>>    65536  131071       8  651391   0.14
>>   131072  262143       2  344196   0.07
>>   524288 1048575       4 2909925   0.62
>> 1048576 2097151       3 3550680   0.76
>> 4194304 8388607      10 82497658  17.72
>> 8388608 16777215      10 158022653  33.94
>> 16777216 24567552       5 122778062  26.37
>> total free extents 80469
>> total free blocks 465609690
>> average free extent size 5786.2
>>
>> Looks like plenty of free large extents, with most of the free space
>> completely, unfragmented.
>>
> Indeed..
>
>> Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have
>> something to do with it.
>>
> Most likely. Based on this, it's hard to say for certain why you'd be
> running into allocation latency caused by busy extents. Does this
> filesystem use the '-o discard' mount option by any chance?

No. We'd love to but have had bad experience + strong recommendation 
from this list not to use it.

> I suppose it's possible that this was some kind of transient state, or
> perhaps only a small set of AGs are affected, etc. It's also possible
> this may have been improved in more recent kernels by Christoph's rework
> of some of that code. In any event, this would probably require a bit
> more runtime analysis to figure out where/why allocations are getting
> stalled as such. I'd probably start by looking at the xfs_extent_busy_*
> tracepoints (also note that if there's potentially something to be
> improved on here, it's more useful to do so against current upstream).
>
> Or you could just move to something that supports RWF_NOWAIT.. ;)
>

NOWAIT only helps with blocking, not high CPU usage. Even if we moved it 
to another thread, it would still take the same time (but granted, our 
important thread could still do work, sharing the core with the one 
doing the allocation).

Regardless, we are recommending to our users and customers to move to 
NOWAIT kernels, but of course some are justifiable cautious. Upstream 
kernels are now quite stable, so enterprise distribution kernels don't 
provide the same value they used to.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-29  9:40                   ` Avi Kivity
@ 2018-01-29 11:35                     ` Dave Chinner
  2018-01-29 11:44                       ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-01-29 11:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, linux-xfs

On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote:
> 
> 
> On 01/25/2018 03:08 PM, Brian Foster wrote:
> > On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote:
> > > On 01/23/2018 07:39 PM, Brian Foster wrote:
> > > > Yeah, could be.. perhaps the issue is that despite the large amount of
> > > > total free space, the free space is too fragmented to satisfy a
> > > > particular allocation request..?
> > >     from      to extents  blocks    pct
> > >        1       1    2702    2702   0.00
> > >        2       3     690    1547   0.00
> > >        4       7     115     568   0.00
> > >        8      15      60     634   0.00
> > >       16      31      63    1457   0.00
> > >       32      63     102    4751   0.00
> > >       64     127    7940  895365   0.19
> > >      128     255   49680 12422100   2.67
> > >      256     511    1025  417078   0.09
> > >      512    1023    4170 3660771   0.79
> > >     1024    2047    2168 3503054   0.75
> > >     2048    4095    2567 7729442   1.66
> > >     4096    8191    8688 59394413  12.76
> > >     8192   16383     310 3100186   0.67
> > >    16384   32767     112 2339935   0.50
> > >    32768   65535      35 1381122   0.30
> > >    65536  131071       8  651391   0.14
> > >   131072  262143       2  344196   0.07
> > >   524288 1048575       4 2909925   0.62
> > > 1048576 2097151       3 3550680   0.76
> > > 4194304 8388607      10 82497658  17.72
> > > 8388608 16777215      10 158022653  33.94
> > > 16777216 24567552       5 122778062  26.37
> > > total free extents 80469
> > > total free blocks 465609690
> > > average free extent size 5786.2
> > > 
> > > Looks like plenty of free large extents, with most of the free space
> > > completely, unfragmented.
> > > 
> > Indeed..

You need to look at each AG, not the overall summary. You could have
a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's
that one AG that is causing all your problems.

There's many reasons this can happen, but the most common is the
working files in a directory (or subset of directories in the same
AG) have a combined space usage of larger than an AG ....

> > > Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have
> > > something to do with it.
> > > 
> > Most likely. Based on this, it's hard to say for certain why you'd be
> > running into allocation latency caused by busy extents.

One of only two reasons:

	1. the AG has a large enough free space, but they are all
	marked busy (i.e. just been freed), or

	2. The extent selected has had the busy range trimmed out of
	it and it's now less than the minimum extent length
	requested.

Both cases imply that we're allocating extents that have been very
recently freed, and that implies there is no other suitable non-busy
free space in the AG. Hence the need to look at the per-AG freespace
pattern rather than the global summary.

Also, it's worth dumping the freespace via xfs_spaceman as it walks
the in memory trees rather than the on-disk trees and so is properly
coherent with operations in progress. (i.e. xfs_spaceman -c "freesp
..." /mntpt)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-29 11:35                     ` Dave Chinner
@ 2018-01-29 11:44                       ` Avi Kivity
  2018-01-29 21:56                         ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-01-29 11:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs



On 01/29/2018 01:35 PM, Dave Chinner wrote:
> On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote:
>>
>> On 01/25/2018 03:08 PM, Brian Foster wrote:
>>> On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote:
>>>> On 01/23/2018 07:39 PM, Brian Foster wrote:
>>>>> Yeah, could be.. perhaps the issue is that despite the large amount of
>>>>> total free space, the free space is too fragmented to satisfy a
>>>>> particular allocation request..?
>>>>      from      to extents  blocks    pct
>>>>         1       1    2702    2702   0.00
>>>>         2       3     690    1547   0.00
>>>>         4       7     115     568   0.00
>>>>         8      15      60     634   0.00
>>>>        16      31      63    1457   0.00
>>>>        32      63     102    4751   0.00
>>>>        64     127    7940  895365   0.19
>>>>       128     255   49680 12422100   2.67
>>>>       256     511    1025  417078   0.09
>>>>       512    1023    4170 3660771   0.79
>>>>      1024    2047    2168 3503054   0.75
>>>>      2048    4095    2567 7729442   1.66
>>>>      4096    8191    8688 59394413  12.76
>>>>      8192   16383     310 3100186   0.67
>>>>     16384   32767     112 2339935   0.50
>>>>     32768   65535      35 1381122   0.30
>>>>     65536  131071       8  651391   0.14
>>>>    131072  262143       2  344196   0.07
>>>>    524288 1048575       4 2909925   0.62
>>>> 1048576 2097151       3 3550680   0.76
>>>> 4194304 8388607      10 82497658  17.72
>>>> 8388608 16777215      10 158022653  33.94
>>>> 16777216 24567552       5 122778062  26.37
>>>> total free extents 80469
>>>> total free blocks 465609690
>>>> average free extent size 5786.2
>>>>
>>>> Looks like plenty of free large extents, with most of the free space
>>>> completely, unfragmented.
>>>>
>>> Indeed..
> You need to look at each AG, not the overall summary. You could have
> a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's
> that one AG that is causing all your problems.
>
> There's many reasons this can happen, but the most common is the
> working files in a directory (or subset of directories in the same
> AG) have a combined space usage of larger than an AG ....

That's certainly possible, even likely (one huge directory with all of 
the files).

This layout is imposed on us by the compatibility gods. Is there a way 
to tell XFS to change its policy of on-ag-per-directory?

If not, then we'll have to find some workaround to the compatibility 
problem.

>
>>>> Lots of 16MB-32MB extents, too. 32MB is our allocation hint size, could have
>>>> something to do with it.
>>>>
>>> Most likely. Based on this, it's hard to say for certain why you'd be
>>> running into allocation latency caused by busy extents.
> One of only two reasons:
>
> 	1. the AG has a large enough free space, but they are all
> 	marked busy (i.e. just been freed), or
>
> 	2. The extent selected has had the busy range trimmed out of
> 	it and it's now less than the minimum extent length
> 	requested.
>
> Both cases imply that we're allocating extents that have been very
> recently freed, and that implies there is no other suitable non-busy
> free space in the AG. Hence the need to look at the per-AG freespace
> pattern rather than the global summary.
>
> Also, it's worth dumping the freespace via xfs_spaceman as it walks
> the in memory trees rather than the on-disk trees and so is properly
> coherent with operations in progress. (i.e. xfs_spaceman -c "freesp
> ..." /mntpt)
>

That system has since been recycled, but we'll keep it in mind the next 
time we see it. Also, we'll have to fix the one-huge-directory problem 
for sure, one way or another.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-29 11:44                       ` Avi Kivity
@ 2018-01-29 21:56                         ` Dave Chinner
  2018-01-30  8:58                           ` Avi Kivity
  2018-02-06 14:10                           ` Avi Kivity
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2018-01-29 21:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, linux-xfs

On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote:
> 
> 
> On 01/29/2018 01:35 PM, Dave Chinner wrote:
> > On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote:
> > > 
> > > On 01/25/2018 03:08 PM, Brian Foster wrote:
> > > > On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote:
> > > > > On 01/23/2018 07:39 PM, Brian Foster wrote:
> > > > > > Yeah, could be.. perhaps the issue is that despite the large amount of
> > > > > > total free space, the free space is too fragmented to satisfy a
> > > > > > particular allocation request..?
> > > > >      from      to extents  blocks    pct
> > > > >         1       1    2702    2702   0.00
> > > > >         2       3     690    1547   0.00
> > > > >         4       7     115     568   0.00
> > > > >         8      15      60     634   0.00
> > > > >        16      31      63    1457   0.00
> > > > >        32      63     102    4751   0.00
> > > > >        64     127    7940  895365   0.19
> > > > >       128     255   49680 12422100   2.67
> > > > >       256     511    1025  417078   0.09
> > > > >       512    1023    4170 3660771   0.79
> > > > >      1024    2047    2168 3503054   0.75
> > > > >      2048    4095    2567 7729442   1.66
> > > > >      4096    8191    8688 59394413  12.76
> > > > >      8192   16383     310 3100186   0.67
> > > > >     16384   32767     112 2339935   0.50
> > > > >     32768   65535      35 1381122   0.30
> > > > >     65536  131071       8  651391   0.14
> > > > >    131072  262143       2  344196   0.07
> > > > >    524288 1048575       4 2909925   0.62
> > > > > 1048576 2097151       3 3550680   0.76
> > > > > 4194304 8388607      10 82497658  17.72
> > > > > 8388608 16777215      10 158022653  33.94
> > > > > 16777216 24567552       5 122778062  26.37
> > > > > total free extents 80469
> > > > > total free blocks 465609690
> > > > > average free extent size 5786.2
> > > > > 
> > > > > Looks like plenty of free large extents, with most of the free space
> > > > > completely, unfragmented.
> > > > > 
> > > > Indeed..
> > You need to look at each AG, not the overall summary. You could have
> > a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's
> > that one AG that is causing all your problems.
> > 
> > There's many reasons this can happen, but the most common is the
> > working files in a directory (or subset of directories in the same
> > AG) have a combined space usage of larger than an AG ....
> 
> That's certainly possible, even likely (one huge directory with all of the
> files).
> 
> This layout is imposed on us by the compatibility gods. Is there a way to
> tell XFS to change its policy of on-ag-per-directory?

mount with inode32. That rotors files around all AGs in a round
robin fashion instead of trying to keep directory locality for  a
working set. i.e. it distributes the files evenly across the
filesystem.

However, this can have substantial impact on performance if the
workload requires locality of files for performance and you're
running on spinning rust. OTOH, if locality is not needed then
distributing all files across all AGs evenly should give more even
capacity usage.

And, based on what was discussed on #xfs overnight, don't bother
with the filestreams allocator - it will not solve the "full AG"
problem, but it will introduce a whole bunch of new problems for
you.

> That system has since been recycled, but we'll keep it in mind the next time
> we see it. Also, we'll have to fix the one-huge-directory problem for sure,
> one way or another.

If inode32 doesn't work for you, the easiest way is to use a small
directory hash - create top level directory with ~2x AG count child
directories and hash your files into those. This will distribute the
files roughly evenly across all AGs in the filesystem whilst still
maintaining locality within directories. It's kind of half way
between inode64 and inode32....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-29 21:56                         ` Dave Chinner
@ 2018-01-30  8:58                           ` Avi Kivity
  2018-02-06 14:10                           ` Avi Kivity
  1 sibling, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2018-01-30  8:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs



On 01/29/2018 11:56 PM, Dave Chinner wrote:
> On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote:
>>
>> On 01/29/2018 01:35 PM, Dave Chinner wrote:
>>> On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote:
>>>> On 01/25/2018 03:08 PM, Brian Foster wrote:
>>>>> On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote:
>>>>>> On 01/23/2018 07:39 PM, Brian Foster wrote:
>>>>>>> Yeah, could be.. perhaps the issue is that despite the large amount of
>>>>>>> total free space, the free space is too fragmented to satisfy a
>>>>>>> particular allocation request..?
>>>>>>       from      to extents  blocks    pct
>>>>>>          1       1    2702    2702   0.00
>>>>>>          2       3     690    1547   0.00
>>>>>>          4       7     115     568   0.00
>>>>>>          8      15      60     634   0.00
>>>>>>         16      31      63    1457   0.00
>>>>>>         32      63     102    4751   0.00
>>>>>>         64     127    7940  895365   0.19
>>>>>>        128     255   49680 12422100   2.67
>>>>>>        256     511    1025  417078   0.09
>>>>>>        512    1023    4170 3660771   0.79
>>>>>>       1024    2047    2168 3503054   0.75
>>>>>>       2048    4095    2567 7729442   1.66
>>>>>>       4096    8191    8688 59394413  12.76
>>>>>>       8192   16383     310 3100186   0.67
>>>>>>      16384   32767     112 2339935   0.50
>>>>>>      32768   65535      35 1381122   0.30
>>>>>>      65536  131071       8  651391   0.14
>>>>>>     131072  262143       2  344196   0.07
>>>>>>     524288 1048575       4 2909925   0.62
>>>>>> 1048576 2097151       3 3550680   0.76
>>>>>> 4194304 8388607      10 82497658  17.72
>>>>>> 8388608 16777215      10 158022653  33.94
>>>>>> 16777216 24567552       5 122778062  26.37
>>>>>> total free extents 80469
>>>>>> total free blocks 465609690
>>>>>> average free extent size 5786.2
>>>>>>
>>>>>> Looks like plenty of free large extents, with most of the free space
>>>>>> completely, unfragmented.
>>>>>>
>>>>> Indeed..
>>> You need to look at each AG, not the overall summary. You could have
>>> a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's
>>> that one AG that is causing all your problems.
>>>
>>> There's many reasons this can happen, but the most common is the
>>> working files in a directory (or subset of directories in the same
>>> AG) have a combined space usage of larger than an AG ....
>> That's certainly possible, even likely (one huge directory with all of the
>> files).
>>
>> This layout is imposed on us by the compatibility gods. Is there a way to
>> tell XFS to change its policy of on-ag-per-directory?
> mount with inode32. That rotors files around all AGs in a round
> robin fashion instead of trying to keep directory locality for  a
> working set. i.e. it distributes the files evenly across the
> filesystem.

I remember you recommending it a couple of years ago, with the 
qualification to only do it if we see a problem. I think we qualify now.

>
> However, this can have substantial impact on performance if the
> workload requires locality of files for performance and you're
> running on spinning rust.

This is not a problem. We do have a few users on spinning disks, but 
they're a small minority, and our installer can avoid the inode32 option 
if it sees a rotational device.

>   OTOH, if locality is not needed then
> distributing all files across all AGs evenly should give more even
> capacity usage.
>
> And, based on what was discussed on #xfs overnight, don't bother
> with the filestreams allocator - it will not solve the "full AG"
> problem, but it will introduce a whole bunch of new problems for
> you.
>
>> That system has since been recycled, but we'll keep it in mind the next time
>> we see it. Also, we'll have to fix the one-huge-directory problem for sure,
>> one way or another.
> If inode32 doesn't work for you, the easiest way is to use a small
> directory hash - create top level directory with ~2x AG count child
> directories and hash your files into those. This will distribute the
> files roughly evenly across all AGs in the filesystem whilst still
> maintaining locality within directories. It's kind of half way
> between inode64 and inode32....
>

Yeah. Directory layout is sorta imposed on us, but if we have to break 
it, we will.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-29 21:56                         ` Dave Chinner
  2018-01-30  8:58                           ` Avi Kivity
@ 2018-02-06 14:10                           ` Avi Kivity
  2018-02-07  1:57                             ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-06 14:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs


On 01/29/2018 11:56 PM, Dave Chinner wrote:
> On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote:
>>
>> On 01/29/2018 01:35 PM, Dave Chinner wrote:
>>> On Mon, Jan 29, 2018 at 11:40:27AM +0200, Avi Kivity wrote:
>>>> On 01/25/2018 03:08 PM, Brian Foster wrote:
>>>>> On Thu, Jan 25, 2018 at 10:50:40AM +0200, Avi Kivity wrote:
>>>>>> On 01/23/2018 07:39 PM, Brian Foster wrote:
>>>>>>> Yeah, could be.. perhaps the issue is that despite the large amount of
>>>>>>> total free space, the free space is too fragmented to satisfy a
>>>>>>> particular allocation request..?
>>>>>>       from      to extents  blocks    pct
>>>>>>          1       1    2702    2702   0.00
>>>>>>          2       3     690    1547   0.00
>>>>>>          4       7     115     568   0.00
>>>>>>          8      15      60     634   0.00
>>>>>>         16      31      63    1457   0.00
>>>>>>         32      63     102    4751   0.00
>>>>>>         64     127    7940  895365   0.19
>>>>>>        128     255   49680 12422100   2.67
>>>>>>        256     511    1025  417078   0.09
>>>>>>        512    1023    4170 3660771   0.79
>>>>>>       1024    2047    2168 3503054   0.75
>>>>>>       2048    4095    2567 7729442   1.66
>>>>>>       4096    8191    8688 59394413  12.76
>>>>>>       8192   16383     310 3100186   0.67
>>>>>>      16384   32767     112 2339935   0.50
>>>>>>      32768   65535      35 1381122   0.30
>>>>>>      65536  131071       8  651391   0.14
>>>>>>     131072  262143       2  344196   0.07
>>>>>>     524288 1048575       4 2909925   0.62
>>>>>> 1048576 2097151       3 3550680   0.76
>>>>>> 4194304 8388607      10 82497658  17.72
>>>>>> 8388608 16777215      10 158022653  33.94
>>>>>> 16777216 24567552       5 122778062  26.37
>>>>>> total free extents 80469
>>>>>> total free blocks 465609690
>>>>>> average free extent size 5786.2
>>>>>>
>>>>>> Looks like plenty of free large extents, with most of the free space
>>>>>> completely, unfragmented.
>>>>>>
>>>>> Indeed..
>>> You need to look at each AG, not the overall summary. You could have
>>> a suboptimal AG hidden in amongst that (e.g. near ENOSPC) and it's
>>> that one AG that is causing all your problems.
>>>
>>> There's many reasons this can happen, but the most common is the
>>> working files in a directory (or subset of directories in the same
>>> AG) have a combined space usage of larger than an AG ....
>> That's certainly possible, even likely (one huge directory with all of the
>> files).
>>
>> This layout is imposed on us by the compatibility gods. Is there a way to
>> tell XFS to change its policy of on-ag-per-directory?
> mount with inode32. That rotors files around all AGs in a round
> robin fashion instead of trying to keep directory locality for  a
> working set. i.e. it distributes the files evenly across the
> filesystem.

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s09.html 
says:

"When 32 bit inode numbers are used on a volume larger than 1TB in size, 
several changes occur.

A 100TB volume using 256 byte inodes mounted in the default inode32 mode 
has just one percent of its space available for allocating inodes.

XFS will reserve the first 1TB of disk space exclusively for inodes to 
ensure that the imbalance is no worse than this due to file data 
allocations."

Does this mean that a 1.1TB disk has 1TB reserved for inodes and 0.1TB 
left over for data? Or is it driven by the "one percent" which is 
mentioned above, so it would be 0.011TB?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-02-06 14:10                           ` Avi Kivity
@ 2018-02-07  1:57                             ` Dave Chinner
  2018-02-07 10:54                               ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-02-07  1:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, linux-xfs

On Tue, Feb 06, 2018 at 04:10:12PM +0200, Avi Kivity wrote:
> 
> On 01/29/2018 11:56 PM, Dave Chinner wrote:
> > On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote:
> > > > There's many reasons this can happen, but the most common is the
> > > > working files in a directory (or subset of directories in the same
> > > > AG) have a combined space usage of larger than an AG ....
> > > That's certainly possible, even likely (one huge directory with all of the
> > > files).
> > > 
> > > This layout is imposed on us by the compatibility gods. Is there a way to
> > > tell XFS to change its policy of on-ag-per-directory?
> > mount with inode32. That rotors files around all AGs in a round
> > robin fashion instead of trying to keep directory locality for  a
> > working set. i.e. it distributes the files evenly across the
> > filesystem.
> 
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s09.html
> says:
> 
> "When 32 bit inode numbers are used on a volume larger than 1TB in size,
> several changes occur.
> 
> A 100TB volume using 256 byte inodes mounted in the default inode32 mode has
> just one percent of its space available for allocating inodes.
> 
> XFS will reserve the first 1TB of disk space exclusively for inodes to
> ensure that the imbalance is no worse than this due to file data
> allocations."

s/exclusively//

> Does this mean that a 1.1TB disk has 1TB reserved for inodes and 0.1TB left
> over for data?

No, that would be silly.

> Or is it driven by the "one percent" which is mentioned
> above, so it would be 0.011TB?

No, you're inferring behavioural rules that don't exist from a
simple example.

Maximum number of inodes is controlled by min(imaxpct, free space).
For inode32, "free space" is what's in the first 32 bits of the inode
address space. For inode64, it's global free space.

To enable this, inode32 sets the AGs wholly within the first 32 bits
of the inode address space to be "metadata prefered" and "inode
capable".

Important things to note:

	- "first 32 bits of inode address space" means the range of
	  space that inode32 reserves for inodes changes according
	  to inode size. 256 byte inodes = 1TB, 2kB inodes = 8TB. If
	  the filesystem is smaller than this threshold, then it
	  will silently use the inode64 allocation policy until the
	  filesystem is grown beyond 32 bit inode address space
	  size.

	- "inode capable" means inodes can be allocated in the AG

	- "metadata preferred" means user data will not get
	  allocated in this AG unless all non-prefered AGs are full.


So, assuming 256 byte inodes, you 1.1TB fs will have a imaxpct of
~25%, allowing a maximum of 256GB of inodes or about a billion
inodes.  But once you put more than 0.1TB of data into the
filesystem, data will start filling up the inode capable AGs as
well, and then your limit for inodes looks just like inode64 (i.e.
depedent on free space).

IOWs, inode32 limits where and how many inodes you can
create, not how much user data you can write inode the filesystem.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-02-07  1:57                             ` Dave Chinner
@ 2018-02-07 10:54                               ` Avi Kivity
  2018-02-07 23:43                                 ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-07 10:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs



On 02/07/2018 03:57 AM, Dave Chinner wrote:
> On Tue, Feb 06, 2018 at 04:10:12PM +0200, Avi Kivity wrote:
>> On 01/29/2018 11:56 PM, Dave Chinner wrote:
>>> On Mon, Jan 29, 2018 at 01:44:14PM +0200, Avi Kivity wrote:
>>>>> There's many reasons this can happen, but the most common is the
>>>>> working files in a directory (or subset of directories in the same
>>>>> AG) have a combined space usage of larger than an AG ....
>>>> That's certainly possible, even likely (one huge directory with all of the
>>>> files).
>>>>
>>>> This layout is imposed on us by the compatibility gods. Is there a way to
>>>> tell XFS to change its policy of on-ag-per-directory?
>>> mount with inode32. That rotors files around all AGs in a round
>>> robin fashion instead of trying to keep directory locality for  a
>>> working set. i.e. it distributes the files evenly across the
>>> filesystem.
>> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s09.html
>> says:
>>
>> "When 32 bit inode numbers are used on a volume larger than 1TB in size,
>> several changes occur.
>>
>> A 100TB volume using 256 byte inodes mounted in the default inode32 mode has
>> just one percent of its space available for allocating inodes.
>>
>> XFS will reserve the first 1TB of disk space exclusively for inodes to
>> ensure that the imbalance is no worse than this due to file data
>> allocations."
> s/exclusively//
>
>> Does this mean that a 1.1TB disk has 1TB reserved for inodes and 0.1TB left
>> over for data?
> No, that would be silly.

Suggest doc changes for both.

>
>> Or is it driven by the "one percent" which is mentioned
>> above, so it would be 0.011TB?
> No, you're inferring behavioural rules that don't exist from a
> simple example.
>
> Maximum number of inodes is controlled by min(imaxpct, free space).
> For inode32, "free space" is what's in the first 32 bits of the inode
> address space. For inode64, it's global free space.
>
> To enable this, inode32 sets the AGs wholly within the first 32 bits
> of the inode address space to be "metadata prefered" and "inode
> capable".
>
> Important things to note:
>
> 	- "first 32 bits of inode address space" means the range of
> 	  space that inode32 reserves for inodes changes according
> 	  to inode size. 256 byte inodes = 1TB, 2kB inodes = 8TB. If
> 	  the filesystem is smaller than this threshold, then it
> 	  will silently use the inode64 allocation policy until the
> 	  filesystem is grown beyond 32 bit inode address space
> 	  size.
>
> 	- "inode capable" means inodes can be allocated in the AG
>
> 	- "metadata preferred" means user data will not get
> 	  allocated in this AG unless all non-prefered AGs are full.
>
>
> So, assuming 256 byte inodes, you 1.1TB fs will have a imaxpct of
> ~25%, allowing a maximum of 256GB of inodes or about a billion
> inodes.  But once you put more than 0.1TB of data into the
> filesystem, data will start filling up the inode capable AGs as
> well, and then your limit for inodes looks just like inode64 (i.e.
> depedent on free space).
>
> IOWs, inode32 limits where and how many inodes you can
> create, not how much user data you can write inode the filesystem.
>

Thanks a lot for the clarifications. Looks like inode32 can be used to 
reduce some of our pain.

There's a danger that when switching from inode64 to inode32 you end up 
with the inode32 address space already exhausted, right? Does that 
result in ENOSPC or what?

Anyway, can probably be fixed by stopping the load, copying files 
around, and moving them back.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-02-07 10:54                               ` Avi Kivity
@ 2018-02-07 23:43                                 ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2018-02-07 23:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, linux-xfs

On Wed, Feb 07, 2018 at 12:54:43PM +0200, Avi Kivity wrote:
> On 02/07/2018 03:57 AM, Dave Chinner wrote:
> >IOWs, inode32 limits where and how many inodes you can
> >create, not how much user data you can write inode the filesystem.
> 
> Thanks a lot for the clarifications. Looks like inode32 can be used
> to reduce some of our pain.
> 
> There's a danger that when switching from inode64 to inode32 you end
> up with the inode32 address space already exhausted, right? Does
> that result in ENOSPC or what?

ENOSPC on inode allocation.

> Anyway, can probably be fixed by stopping the load, copying files
> around, and moving them back.

Yup, assuming you're able to find the files that need to be moved in
a finite period of time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_extent_busy_flush vs. aio
  2018-01-25 13:08                 ` Brian Foster
  2018-01-29  9:40                   ` Avi Kivity
@ 2018-02-02  9:48                   ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2018-02-02  9:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, linux-xfs

On Thu, Jan 25, 2018 at 08:08:31AM -0500, Brian Foster wrote:
> I suppose it's possible that this was some kind of transient state, or
> perhaps only a small set of AGs are affected, etc. It's also possible
> this may have been improved in more recent kernels by Christoph's rework
> of some of that code. In any event, this would probably require a bit
> more runtime analysis to figure out where/why allocations are getting
> stalled as such. I'd probably start by looking at the xfs_extent_busy_*
> tracepoints (also note that if there's potentially something to be
> improved on here, it's more useful to do so against current upstream).
> 
> Or you could just move to something that supports RWF_NOWAIT.. ;)

The way the XFS allocator works has always had a fundamental flaw
since we intorduced the ocncept of busy extents, and that is we need
to lock ourselves into an AG or sometimes even range without taking
said busy extents into account.

The proper fix is to separate the in-core and in-memory data structures
for free space tracking, and only release the busy extents to the
in-memory one once they aren't busy anymore.

Looking into this has been on my todo list for a long time, but I
never go to it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-02-07 23:43 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-23 14:57 xfs_extent_busy_flush vs. aio Avi Kivity
2018-01-23 15:28 ` Brian Foster
2018-01-23 15:45   ` Avi Kivity
2018-01-23 16:11     ` Brian Foster
2018-01-23 16:22       ` Avi Kivity
2018-01-23 16:47         ` Brian Foster
2018-01-23 17:00           ` Avi Kivity
2018-01-23 17:39             ` Brian Foster
2018-01-25  8:50               ` Avi Kivity
2018-01-25 13:08                 ` Brian Foster
2018-01-29  9:40                   ` Avi Kivity
2018-01-29 11:35                     ` Dave Chinner
2018-01-29 11:44                       ` Avi Kivity
2018-01-29 21:56                         ` Dave Chinner
2018-01-30  8:58                           ` Avi Kivity
2018-02-06 14:10                           ` Avi Kivity
2018-02-07  1:57                             ` Dave Chinner
2018-02-07 10:54                               ` Avi Kivity
2018-02-07 23:43                                 ` Dave Chinner
2018-02-02  9:48                   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).