public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* Saw your commit: Use mutex_lock_io() for journal->j_checkpoint_mutex
@ 2017-02-21 20:23 Theodore Ts'o
  2017-02-21 20:45 ` Tejun Heo
  0 siblings, 1 reply; 2+ messages in thread
From: Theodore Ts'o @ 2017-02-21 20:23 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ext4

Hi Tejun, I saw your commit 6fa7aa50b2c484: "fs/jbd2, locking/mutex,
sched/wait: Use mutex_lock_io() for journal->j_checkpoint_mutex",
which just landed in Linus's tree.  The change makes sense, but I
wanted to make a comment about this part of the commit description:

    When an ext4 fs is bogged down by a lot of metadata IOs (in the
    reported case, it was deletion of millions of files, but any massive
    amount of journal writes would do), after the journal is filled up,
    tasks which try to access the filesystem and aren't currently
    performing the journal writes end up waiting in
    __jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.

If this happens, it almost certainly means that the journal is too
small.  This was something that grad student I was mentoring found
when we were benchmarking our SMR-friendly jbd2 changes.  There's a
footnote to this effect in the Fast 2017 paper[1] 

[1] https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev
    (if you want early access to the paper let me know; it's currently
    available to registered FAST 2017 attendees and will be opened up
    at the start of the FAST 2017 conference next week)

The short version is that on average, with a 5 second commit window
and a 30 second dirty writeback timeout, if you assume the worst case
of 100% of the metadata blocks being already in the buffer cache (so
they don't need to be read from disk), in 5 seconds the journal thread
could potential spew 150*5 == 750MB in a journal transaction.  But
that data won't be written back until 30 seconds later.  So if you are
continuously deleting files for 30 seconds, the journal should have
room for at least around 4500 megs worth of sequential writing.  Now,
that's an extreme worst case.  In reality there will be some disk
reads, not to mention the metadata writebacks, which will be random.

The bottom line is that 128MiB, which was the previous maximum journal
size, is simply way too small.  So in the latest e2fsprogs 1.43.x
release, the default has been changed so that for a sufficiently large
disk, the default journal size is 1 gig.

If you are using faster media (say, SSD or PCie-attached flash), and
you expect to have workloads that are extreme with respect to huge
amounts of metadata changes, an even bigger journal might be called
for.  (And these are the workloads where the lazy journalling that we
studied in the FAST paper is helpful, even on convential HDD's.)

Anyway, you might want to pass onto the system administrators (or the
SRE's, as applicable :-) that if they were hitting this case often,
they should seriously consider increasing the size of their ext4
journal.

						- Ted

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Saw your commit: Use mutex_lock_io() for journal->j_checkpoint_mutex
  2017-02-21 20:23 Saw your commit: Use mutex_lock_io() for journal->j_checkpoint_mutex Theodore Ts'o
@ 2017-02-21 20:45 ` Tejun Heo
  0 siblings, 0 replies; 2+ messages in thread
From: Tejun Heo @ 2017-02-21 20:45 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

Hello, Ted.

> If this happens, it almost certainly means that the journal is too
> small.  This was something that grad student I was mentoring found
> when we were benchmarking our SMR-friendly jbd2 changes.  There's a
> footnote to this effect in the Fast 2017 paper[1] 
> 
> [1] https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev
>     (if you want early access to the paper let me know; it's currently
>     available to registered FAST 2017 attendees and will be opened up
>     at the start of the FAST 2017 conference next week)
>
> The short version is that on average, with a 5 second commit window
> and a 30 second dirty writeback timeout, if you assume the worst case
> of 100% of the metadata blocks being already in the buffer cache (so
> they don't need to be read from disk), in 5 seconds the journal thread
> could potential spew 150*5 == 750MB in a journal transaction.  But
> that data won't be written back until 30 seconds later.  So if you are
> continuously deleting files for 30 seconds, the journal should have
> room for at least around 4500 megs worth of sequential writing.  Now,
> that's an extreme worst case.  In reality there will be some disk
> reads, not to mention the metadata writebacks, which will be random.

I see.  Yeah, that's close to what we were seeing.  We had a
malfunctioning workload which was deleting extremely high number of
files locking up the filesystem and thus other things on the host.
This was a clear misbehavior on the workload but debugging it took
longer than necessary because the waits didn't get accounted as
iowait, so the patch.

> The bottom line is that 128MiB, which was the previous maximum journal
> size, is simply way too small.  So in the latest e2fsprogs 1.43.x
> release, the default has been changed so that for a sufficiently large
> disk, the default journal size is 1 gig.
> 
> If you are using faster media (say, SSD or PCie-attached flash), and
> you expect to have workloads that are extreme with respect to huge
> amounts of metadata changes, an even bigger journal might be called
> for.  (And these are the workloads where the lazy journalling that we
> studied in the FAST paper is helpful, even on convential HDD's.)
> 
> Anyway, you might want to pass onto the system administrators (or the
> SRE's, as applicable :-) that if they were hitting this case often,
> they should seriously consider increasing the size of their ext4
> journal.

Thanks a lot for the explanation!

-- 
tejun

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-02-21 20:45 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-21 20:23 Saw your commit: Use mutex_lock_io() for journal->j_checkpoint_mutex Theodore Ts'o
2017-02-21 20:45 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox