From: Chandan Babu R <chandanrlinux@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 08/45] xfs: journal IO cache flush reductions
Date: Mon, 08 Mar 2021 16:19:44 +0530 [thread overview]
Message-ID: <87czw95393.fsf@garuda> (raw)
In-Reply-To: <20210305051143.182133-9-david@fromorbit.com>
On 05 Mar 2021 at 10:41, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
>
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
>
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
>
> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> causes the journal IO to issue a cache flush and wait for it to
> complete before issuing the write IO to the journal. Hence all
> completed metadata IO is guaranteed to be stable before the journal
> overwrites the old metadata.
>
> The ordering guarantees of #2 are provided by the REQ_FUA, which
> ensures the journal writes do not complete until they are on stable
> storage. Hence by the time the last journal IO in a checkpoint
> completes, we know that the entire checkpoint is on stable storage
> and we can unpin the dirty metadata and allow it to be written back.
>
> This is the mechanism by which ordering was first implemented in XFS
> way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> ("Add support for drive write cache flushing") in the xfs-archive
> tree.
>
> A lot has changed since then, most notably we now use delayed
> logging to checkpoint the filesystem to the journal rather than
> write each individual transaction to the journal. Cache flushes on
> journal IO are necessary when individual transactions are wholly
> contained within a single iclog. However, CIL checkpoints are single
> transactions that typically span hundreds to thousands of individual
> journal writes, and so the requirements for device cache flushing
> have changed.
>
> That is, the ordering rules I state above apply to ordering of
> atomic transactions recorded in the journal, not to the journal IO
> itself. Hence we need to ensure metadata is stable before we start
> writing a new transaction to the journal (guarantee #1), and we need
> to ensure the entire transaction is stable in the journal before we
> start metadata writeback (guarantee #2).
>
> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> new journal transaction to provide #1, and it is not on any other
> journal IO done within the context of that journal transaction.
>
> The CIL checkpoint already issues a cache flush before it starts
> writing to the log, so we no longer need the iclog IO to issue a
> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> to xlog_write(), we no longer need to mark the first iclog in
> the log write with REQ_PREFLUSH for this case. As an added bonus,
> this ordering mechanism works for both internal and external logs,
> meaning we can remove the explicit data device cache flushes from
> the iclog write code when using external logs.
>
> Given the new ordering semantics of commit records for the CIL, we
> need iclogs containing commit records to issue a REQ_PREFLUSH. We
> also require unmount records to do this. Hence for both
> XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
> to mark the first iclog being written with REQ_PREFLUSH.
>
> For both commit records and unmount records, we also want them
> immediately on stable storage, so we want to also mark the iclogs
> that contain these records to be marked REQ_FUA. That means if a
> record is split across multiple iclogs, they are all marked REQ_FUA
> and not just the last one so that when the transaction is completed
> all the parts of the record are on stable storage.
>
> And for external logs, unmount records need a pre-write data device
> cache flush similar to the CIL checkpoint cache pre-flush as the
> internal iclog write code does not do this implicitly anymore.
>
> As an optimisation, when the commit record lands in the same iclog
> as the journal transaction starts, we don't need to wait for
> anything and can simply use REQ_FUA to provide guarantee #2. This
> means that for fsync() heavy workloads, the cache flush behaviour is
> completely unchanged and there is no degradation in performance as a
> result of optimise the multi-IO transaction case.
>
> The most notable sign that there is less IO latency on my test
> machine (nvme SSDs) is that the "noiclogs" rate has dropped
> substantially. This metric indicates that the CIL push is blocking
> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
> is blocking waiting for log IO. With the changes in this patch, this
> drops to 1 noiclog event for every 100 iclog writes. Hence it is
> clear that log IO is completing much faster than it was previously,
> but it is also clear that for large iclog sizes, this isn't the
> performance limiting factor on this hardware.
>
> With smaller iclogs (32kB), however, there is a sustantial
> difference. With the cache flush modifications, the journal is now
> running at over 4000 write IOPS, and the journal throughput is
> largely identical to the 256kB iclogs and the noiclog event rate
> stays low at about 1:50 iclog writes. The existing code tops out at
> about 2500 IOPS as the number of cache flushes dominate performance
> and latency. The noiclog event rate is about 1:4, and the
> performance variance is quite large as the journal throughput can
> fall to less than half the peak sustained rate when the cache flush
> rate prevents metadata writeback from keeping up and the log runs
> out of space and throttles reservations.
>
> As a result:
>
> logbsize fsmark create rate rm -rf
> before 32kb 152851+/-5.3e+04 5m28s
> patched 32kb 221533+/-1.1e+04 5m24s
>
> before 256kb 220239+/-6.2e+03 4m58s
> patched 256kb 228286+/-9.2e+03 5m06s
>
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.
>
I see that the missing preflush w.r.t previous iclogs of a multi-iclog
checkpoint transaction has been handled in this version. Hence,
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
--
chandan
next prev parent reply other threads:[~2021-03-08 10:50 UTC|newest]
Thread overview: 145+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-05 5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
2021-03-05 5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
2021-03-08 22:20 ` Darrick J. Wong
2021-03-16 8:35 ` Christoph Hellwig
2021-03-05 5:11 ` [PATCH 02/45] xfs: log stripe roundoff is a property of the log Dave Chinner
2021-03-05 5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
2021-03-08 8:34 ` Chandan Babu R
2021-03-15 14:40 ` Brian Foster
2021-03-16 8:40 ` Christoph Hellwig
2021-03-05 5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
2021-03-08 9:31 ` Chandan Babu R
2021-03-08 22:21 ` Darrick J. Wong
2021-03-15 14:40 ` Brian Foster
2021-03-16 8:41 ` Christoph Hellwig
2021-03-05 5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
2021-03-08 9:48 ` Chandan Babu R
2021-03-08 22:24 ` Darrick J. Wong
2021-03-15 14:41 ` Brian Foster
2021-03-15 16:32 ` Darrick J. Wong
2021-03-16 8:43 ` Christoph Hellwig
2021-03-08 22:26 ` Darrick J. Wong
2021-03-15 14:42 ` Brian Foster
2021-03-05 5:11 ` [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
2021-03-15 14:43 ` Brian Foster
2021-03-16 8:47 ` Christoph Hellwig
2021-03-05 5:11 ` [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
2021-03-15 14:45 ` Brian Foster
2021-03-16 14:15 ` Christoph Hellwig
2021-03-05 5:11 ` [PATCH 08/45] xfs: journal IO cache flush reductions Dave Chinner
2021-03-08 10:49 ` Chandan Babu R [this message]
2021-03-08 12:25 ` Brian Foster
2021-03-09 1:13 ` Dave Chinner
2021-03-10 20:49 ` Brian Foster
2021-03-10 21:28 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 09/45] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
2021-03-05 5:11 ` [PATCH 10/45] xfs: reduce buffer log item shadow allocations Dave Chinner
2021-03-15 14:52 ` Brian Foster
2021-03-05 5:11 ` [PATCH 11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
2021-03-05 5:11 ` [PATCH 12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
2021-03-05 5:11 ` [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
2021-03-08 22:53 ` Darrick J. Wong
2021-03-11 0:26 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing Dave Chinner
2021-03-08 23:45 ` Darrick J. Wong
2021-03-05 5:11 ` [PATCH 15/45] xfs: CIL work is serialised, not pipelined Dave Chinner
2021-03-08 23:14 ` Darrick J. Wong
2021-03-08 23:38 ` Dave Chinner
2021-03-09 1:55 ` Darrick J. Wong
2021-03-09 22:35 ` Andi Kleen
2021-03-10 6:11 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 16/45] xfs: type verification is expensive Dave Chinner
2021-03-05 5:11 ` [PATCH 17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check Dave Chinner
2021-03-05 5:11 ` [PATCH 18/45] xfs: reduce debug overhead of dir leaf/node checks Dave Chinner
2021-03-05 5:11 ` [PATCH 19/45] xfs: factor out the CIL transaction header building Dave Chinner
2021-03-08 23:47 ` Darrick J. Wong
2021-03-16 14:50 ` Brian Foster
2021-03-05 5:11 ` [PATCH 20/45] xfs: only CIL pushes require a start record Dave Chinner
2021-03-09 0:07 ` Darrick J. Wong
2021-03-16 14:51 ` Brian Foster
2021-03-05 5:11 ` [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record Dave Chinner
2021-03-09 0:15 ` Darrick J. Wong
2021-03-11 2:54 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 22/45] xfs: embed the xlog_op_header in the commit record Dave Chinner
2021-03-09 0:17 ` Darrick J. Wong
2021-03-05 5:11 ` [PATCH 23/45] xfs: log tickets don't need log client id Dave Chinner
2021-03-09 0:21 ` Darrick J. Wong
2021-03-09 1:19 ` Dave Chinner
2021-03-09 1:48 ` Darrick J. Wong
2021-03-11 3:01 ` Dave Chinner
2021-03-16 14:51 ` Brian Foster
2021-03-05 5:11 ` [PATCH 24/45] xfs: move log iovec alignment to preparation function Dave Chinner
2021-03-09 2:14 ` Darrick J. Wong
2021-03-16 14:51 ` Brian Foster
2021-03-05 5:11 ` [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
2021-03-09 2:21 ` Darrick J. Wong
2021-03-11 3:29 ` Dave Chinner
2021-03-11 3:41 ` Darrick J. Wong
2021-03-16 14:54 ` Brian Foster
2021-03-16 14:53 ` Brian Foster
2021-05-19 3:18 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 26/45] xfs: log ticket region debug is largely useless Dave Chinner
2021-03-09 2:31 ` Darrick J. Wong
2021-03-16 14:55 ` Brian Foster
2021-05-19 3:27 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 27/45] xfs: pass lv chain length into xlog_write() Dave Chinner
2021-03-09 2:36 ` Darrick J. Wong
2021-03-11 3:37 ` Dave Chinner
2021-03-16 18:38 ` Brian Foster
2021-03-05 5:11 ` [PATCH 28/45] xfs: introduce xlog_write_single() Dave Chinner
2021-03-09 2:39 ` Darrick J. Wong
2021-03-11 4:19 ` Dave Chinner
2021-03-16 18:39 ` Brian Foster
2021-05-19 3:44 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 29/45] xfs:_introduce xlog_write_partial() Dave Chinner
2021-03-09 2:59 ` Darrick J. Wong
2021-03-11 4:33 ` Dave Chinner
2021-03-18 13:22 ` Brian Foster
2021-05-19 4:49 ` Dave Chinner
2021-05-20 12:33 ` Brian Foster
2021-05-27 18:03 ` Darrick J. Wong
2021-03-05 5:11 ` [PATCH 30/45] xfs: xlog_write() no longer needs contwr state Dave Chinner
2021-03-09 3:01 ` Darrick J. Wong
2021-03-05 5:11 ` [PATCH 31/45] xfs: CIL context doesn't need to count iovecs Dave Chinner
2021-03-09 3:16 ` Darrick J. Wong
2021-03-11 5:03 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks Dave Chinner
2021-03-10 23:01 ` Darrick J. Wong
2021-03-05 5:11 ` [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
2021-03-10 23:25 ` Darrick J. Wong
2021-03-11 5:42 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 34/45] xfs: rework per-iclog header CIL reservation Dave Chinner
2021-03-11 0:03 ` Darrick J. Wong
2021-03-11 6:03 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure Dave Chinner
2021-03-11 0:11 ` Darrick J. Wong
2021-03-11 6:33 ` Dave Chinner
2021-03-11 6:42 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 36/45] xfs: implement percpu cil space used calculation Dave Chinner
2021-03-11 0:20 ` Darrick J. Wong
2021-03-11 6:51 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure Dave Chinner
2021-03-11 0:26 ` Darrick J. Wong
2021-03-12 0:47 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 38/45] xfs: convert CIL busy extents to per-cpu Dave Chinner
2021-03-11 0:36 ` Darrick J. Wong
2021-03-12 1:15 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 39/45] xfs: Add order IDs to log items in CIL Dave Chinner
2021-03-11 1:00 ` Darrick J. Wong
2021-03-05 5:11 ` [PATCH 40/45] xfs: convert CIL to unordered per cpu lists Dave Chinner
2021-03-11 1:15 ` Darrick J. Wong
2021-03-12 2:18 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 41/45] xfs: move CIL ordering to the logvec chain Dave Chinner
2021-03-11 1:34 ` Darrick J. Wong
2021-03-12 2:29 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive Dave Chinner
2021-03-11 1:36 ` Darrick J. Wong
2021-03-05 5:11 ` [PATCH 43/45] xfs: avoid cil push lock if possible Dave Chinner
2021-03-11 1:47 ` Darrick J. Wong
2021-03-12 2:36 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
2021-03-11 2:00 ` Darrick J. Wong
2021-03-16 3:04 ` Dave Chinner
2021-03-05 5:11 ` [PATCH 45/45] xfs: expanding delayed logging design with background material Dave Chinner
2021-03-11 2:30 ` Darrick J. Wong
2021-03-16 3:28 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87czw95393.fsf@garuda \
--to=chandanrlinux@gmail.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox