[BUG] bcachefs fallocate btree lock contention

Linux bcachefs list
 help / color / mirror / Atom feed

* [BUG] bcachefs fallocate btree lock contention
@ 2023-07-21 13:28 Brian Foster
  2023-07-21 20:49 ` Kent Overstreet
  2023-07-21 21:02 ` Kent Overstreet
  0 siblings, 2 replies; 3+ messages in thread
From: Brian Foster @ 2023-07-21 13:28 UTC (permalink / raw)
  To: linux-bcachefs

Hi all,

When testing the recent write buffer journaling series, I reproduced
several fstests (i.e. generic/013) that seemed pretty much hung up in a
livelock in fsstress. On some further digging, it appears they were
stuck doing fallocates and were spinning heavily on transaction
restarts. I don't think these tests are stuck indefinitely, but rather
this manifests as some excessively long runtimes for tests that involve
concurrent fsstress runs. I was eventually able to reproduce the same
behavior without the write buffer patches, so it doesn't appear to be
related.

I think the issue is basically that if multiple fallocates are running
against independent inodes that might update the same extent btree node,
the __bchfs_fallocate() loop can get into a tight spin due to the lock
cycling around bch2_clamp_data_hole() contending with node updates. This
is where I see most restarts, and I've seen upwards of 100k+ restarts
and single fallocate latencies of tens of seconds. This is pretty
trivial to reproduce by just running concurrent sequential fallocates to
different files (8x or so on my test vm pretty much grinds things to a
halt) [1].

One question that comes to mind: why do we cycle locks here? Is this a
lock ordering requirement between folio locks and btree node locks?

To test the above, I ran with a quick hack to check for pagecache pages
before we decide to clamp the range during fallocate. This speeds up the
test significantly and pretty much removes the bottleneck. This only
handles the simple case and doesn't quite feel like the proper fix to
me, but since I'm low on time I threw it up on CI [2] for reference and
to get a test cycle. I'm heading on vacation for the next week+, so I
wanted to throw this up on the list so at least folks are aware of it if
any excessive test latencies are observed.

Any thoughts are appreciated in the meantime. I'll pick it up once I'm
back..

Brian

[1] Example sequential fallocate reproducer. Run against multiple files:

offset=0
while [ true ]; do
        xfs_io -fc "falloc ${offset}k 512k" $file
        offset=$((offset + 512))
done

[2] https://evilpiepirate.org/~testdashboard/ci?branch=bfoster&commit=d755bfd22fe0fabf8def3bfa0b758864538f79cd

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] bcachefs fallocate btree lock contention
  2023-07-21 13:28 [BUG] bcachefs fallocate btree lock contention Brian Foster
@ 2023-07-21 20:49 ` Kent Overstreet
  2023-07-21 21:02 ` Kent Overstreet
  1 sibling, 0 replies; 3+ messages in thread
From: Kent Overstreet @ 2023-07-21 20:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Fri, Jul 21, 2023 at 09:28:39AM -0400, Brian Foster wrote:
> Hi all,
> 
> When testing the recent write buffer journaling series, I reproduced
> several fstests (i.e. generic/013) that seemed pretty much hung up in a
> livelock in fsstress. On some further digging, it appears they were
> stuck doing fallocates and were spinning heavily on transaction
> restarts. I don't think these tests are stuck indefinitely, but rather
> this manifests as some excessively long runtimes for tests that involve
> concurrent fsstress runs. I was eventually able to reproduce the same
> behavior without the write buffer patches, so it doesn't appear to be
> related.
> 
> I think the issue is basically that if multiple fallocates are running
> against independent inodes that might update the same extent btree node,
> the __bchfs_fallocate() loop can get into a tight spin due to the lock
> cycling around bch2_clamp_data_hole() contending with node updates. This
> is where I see most restarts, and I've seen upwards of 100k+ restarts
> and single fallocate latencies of tens of seconds. This is pretty
> trivial to reproduce by just running concurrent sequential fallocates to
> different files (8x or so on my test vm pretty much grinds things to a
> halt) [1].
> 
> One question that comes to mind: why do we cycle locks here? Is this a
> lock ordering requirement between folio locks and btree node locks?
> 
> To test the above, I ran with a quick hack to check for pagecache pages
> before we decide to clamp the range during fallocate. This speeds up the
> test significantly and pretty much removes the bottleneck. This only
> handles the simple case and doesn't quite feel like the proper fix to
> me, but since I'm low on time I threw it up on CI [2] for reference and
> to get a test cycle. I'm heading on vacation for the next week+, so I
> wanted to throw this up on the list so at least folks are aware of it if
> any excessive test latencies are observed.
> 
> Any thoughts are appreciated in the meantime. I'll pick it up once I'm
> back..

What about adding a nonblocking flag to bch2_clamp_data_hole() that only
uses trylock for locking folios?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] bcachefs fallocate btree lock contention
  2023-07-21 13:28 [BUG] bcachefs fallocate btree lock contention Brian Foster
  2023-07-21 20:49 ` Kent Overstreet
@ 2023-07-21 21:02 ` Kent Overstreet
  1 sibling, 0 replies; 3+ messages in thread
From: Kent Overstreet @ 2023-07-21 21:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Fri, Jul 21, 2023 at 09:28:39AM -0400, Brian Foster wrote:
> One question that comes to mind: why do we cycle locks here? Is this a
> lock ordering requirement between folio locks and btree node locks?

There is a lock ordering requirement, but more broadly we don't like to
hold btree locks while doing anything else that could block - we strive
to keep the lock hold time for btree node locks as low as possible, as
that's one of the biggest factors for overal tail latency.

btree node locks shouldn't be held for more than tens of microseconds -
folio lock can be held for the duration of IO, so we definitely don't
want to be blocked on that.

Recently I was doing some work to avoid ever invoking memory reclaim
while holding btree node locks - same deal. We first try memory
allocations with GFP_NOWAIT|__GFP_NOWARN, then if that fails, call
bch2_trans_unlock() and retry with GFP_KERNEL. Also means we don't have
to muck with GFP_NOFS/GFP_NOIO as much...

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-21 21:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-21 13:28 [BUG] bcachefs fallocate btree lock contention Brian Foster
2023-07-21 20:49 ` Kent Overstreet
2023-07-21 21:02 ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox