linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chandan Babu R <chandanrlinux@gmail.com>
To: Chandan Babu R <chandanrlinux@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
Date: Thu, 25 Feb 2021 12:43:02 +0530	[thread overview]
Message-ID: <871rd47h8h.fsf@garuda> (raw)
In-Reply-To: <8735xk7pr2.fsf@garuda>

On 25 Feb 2021 at 09:39, Chandan Babu R wrote:
> On 23 Feb 2021 at 13:35, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
>> guarantee the ordering requirements the journal has w.r.t. metadata
>> writeback. THe two ordering constraints are:
>>
>> 1. we cannot overwrite metadata in the journal until we guarantee
>> that the dirty metadata has been written back in place and is
>> stable.
>>
>> 2. we cannot write back dirty metadata until it has been written to
>> the journal and guaranteed to be stable (and hence recoverable) in
>> the journal.
>>
>> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
>> causes the journal IO to issue a cache flush and wait for it to
>> complete before issuing the write IO to the journal. Hence all
>> completed metadata IO is guaranteed to be stable before the journal
>> overwrites the old metadata.
>>
>> The ordering guarantees of #2 are provided by the REQ_FUA, which
>> ensures the journal writes do not complete until they are on stable
>> storage. Hence by the time the last journal IO in a checkpoint
>> completes, we know that the entire checkpoint is on stable storage
>> and we can unpin the dirty metadata and allow it to be written back.
>>
>> This is the mechanism by which ordering was first implemented in XFS
>> way back in 2002 by this commit:
>>
>> commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
>> Author: Steve Lord <lord@sgi.com>
>> Date:   Fri May 24 14:30:21 2002 +0000
>>
>>     Add support for drive write cache flushing - should the kernel
>>     have the infrastructure
>>
>> A lot has changed since then, most notably we now use delayed
>> logging to checkpoint the filesystem to the journal rather than
>> write each individual transaction to the journal. Cache flushes on
>> journal IO are necessary when individual transactions are wholly
>> contained within a single iclog. However, CIL checkpoints are single
>> transactions that typically span hundreds to thousands of individual
>> journal writes, and so the requirements for device cache flushing
>> have changed.
>>
>> That is, the ordering rules I state above apply to ordering of
>> atomic transactions recorded in the journal, not to the journal IO
>> itself. Hence we need to ensure metadata is stable before we start
>> writing a new transaction to the journal (guarantee #1), and we need
>> to ensure the entire transaction is stable in the journal before we
>> start metadata writeback (guarantee #2).
>>
>> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
>> new journal transaction to provide #1, and it is not on any other
>> journal IO done within the context of that journal transaction.
>>
>> The CIL checkpoint already issues a cache flush before it starts
>> writing to the log, so we no longer need the iclog IO to issue a
>> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
>> to xlog_write(), we no longer need to mark the first iclog in
>> the log write with REQ_PREFLUSH for this case.
>>
>> Given the new ordering semantics of commit records for the CIL, we
>> need iclogs containing commit to issue a REQ_PREFLUSH. We also
>> require unmount records to do this. Hence for both XLOG_COMMIT_TRANS
>> and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark
>> the first iclog being written with REQ_PREFLUSH.
>>
>> For both commit records and unmount records, we also want them
>> immediately on stable storage, so we want to also mark the iclogs
>> that contain these records to be marked REQ_FUA. That means if a
>> record is split across multiple iclogs, they are all marked REQ_FUA
>> and not just the last one so that when the transaction is completed
>> all the parts of the record are on stable storage.
>>
>> As an optimisation, when the commit record lands in the same iclog
>> as the journal transaction starts, we don't need to wait for
>> anything and can simply use REQ_FUA to provide guarantee #2.  This
>> means that for fsync() heavy workloads, the cache flush behaviour is
>> completely unchanged and there is no degradation in performance as a
>> result of optimise the multi-IO transaction case.
>>
>> The most notable sign that there is less IO latency on my test
>> machine (nvme SSDs) is that the "noiclogs" rate has dropped
>> substantially. This metric indicates that the CIL push is blocking
>> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
>> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
>> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
>> is blocking waiting for log IO. With the changes in this patch, this
>> drops to 1 noiclog event for every 100 iclog writes. Hence it is
>> clear that log IO is completing much faster than it was previously,
>> but it is also clear that for large iclog sizes, this isn't the
>> performance limiting factor on this hardware.
>>
>> With smaller iclogs (32kB), however, there is a sustantial
>> difference. With the cache flush modifications, the journal is now
>> running at over 4000 write IOPS, and the journal throughput is
>> largely identical to the 256kB iclogs and the noiclog event rate
>> stays low at about 1:50 iclog writes. The existing code tops out at
>> about 2500 IOPS as the number of cache flushes dominate performance
>> and latency. The noiclog event rate is about 1:4, and the
>> performance variance is quite large as the journal throughput can
>> fall to less than half the peak sustained rate when the cache flush
>> rate prevents metadata writeback from keeping up and the log runs
>> out of space and throttles reservations.
>>
>> As a result:
>>
>> 	logbsize	fsmark create rate	rm -rf
>> before	32kb		152851+/-5.3e+04	5m28s
>> patched	32kb		221533+/-1.1e+04	5m24s
>>
>> before	256kb		220239+/-6.2e+03	4m58s
>> patched	256kb		228286+/-9.2e+03	5m06s
>>
>> The rm -rf times are included because I ran them, but the
>> differences are largely noise. This workload is largely metadata
>> read IO latency bound and the changes to the journal cache flushing
>> doesn't really make any noticable difference to behaviour apart from
>> a reduction in noiclog events from background CIL pushing.
>>
>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>> ---
>> Version 2:
>> - repost manually without git/guilt mangling the patch author
>> - fix bug in XLOG_ICL_NEED_FUA definition that didn't manifest as an
>>   ordering bug in generic/45[57] until testing the CIL pipelining
>>   changes much later in the series.
>>
>>  fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
>>  fs/xfs/xfs_log_cil.c  |  7 ++++++-
>>  fs/xfs/xfs_log_priv.h |  4 ++++
>>  3 files changed, 33 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
>> index 6c3fb6dcb505..08d68a6161ae 100644
>> --- a/fs/xfs/xfs_log.c
>> +++ b/fs/xfs/xfs_log.c
>> @@ -1806,8 +1806,7 @@ xlog_write_iclog(
>>  	struct xlog		*log,
>>  	struct xlog_in_core	*iclog,
>>  	uint64_t		bno,
>> -	unsigned int		count,
>> -	bool			need_flush)
>> +	unsigned int		count)
>>  {
>>  	ASSERT(bno < log->l_logBBsize);
>>
>> @@ -1845,10 +1844,12 @@ xlog_write_iclog(
>>  	 * writeback throttle from throttling log writes behind background
>>  	 * metadata writeback and causing priority inversions.
>>  	 */
>> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
>> -				REQ_IDLE | REQ_FUA;
>> -	if (need_flush)
>> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
>> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
>>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
>> +	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
>> +		iclog->ic_bio.bi_opf |= REQ_FUA;
>> +	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>>
>>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
>>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>> @@ -1951,7 +1952,7 @@ xlog_sync(
>>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>>  	uint64_t		bno;
>>  	unsigned int		size;
>> -	bool			need_flush = true, split = false;
>> +	bool			split = false;
>>
>>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>>
>> @@ -2009,13 +2010,14 @@ xlog_sync(
>>  	 * synchronously here; for an internal log we can simply use the block
>>  	 * layer state machine for preflushes.
>>  	 */
>> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
>> +	if (log->l_targ != log->l_mp->m_ddev_targp ||
>> +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
>>  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
>> -		need_flush = false;
>> +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>>  	}
>
> If a checkpoint transaction spans across 2 or more iclogs and the log is
> stored on an external device, then the above would remove XLOG_ICL_NEED_FLUSH
> flag from iclog->ic_flags causing xlog_write_iclog() to include only REQ_FUA

... would remove XLOG_ICL_NEED_FLUSH flag from *commit iclog's* ic_flags

> flag in the corresponding bio.
>
> Documentation/block/writeback_cache_control.rst seems to suggest that REQ_FUA
> guarantees only that the data associated with the bio is stable on disk before
> I/O completion is signalled. So looks like REQ_PREFLUSH is required in this
> scenario.


-- 
chandan

  reply	other threads:[~2021-02-25  7:15 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
2021-02-23 10:29   ` Chandan Babu R
2021-02-24 20:14   ` Darrick J. Wong
2021-02-25  8:32   ` Christoph Hellwig
2021-03-01 15:13   ` Brian Foster
2021-02-23  3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
2021-02-23 12:12   ` Chandan Babu R
2021-02-24 20:34   ` Darrick J. Wong
2021-02-24 21:44     ` Dave Chinner
2021-02-24 23:06       ` Darrick J. Wong
2021-02-25  8:34       ` Christoph Hellwig
2021-02-25 20:47         ` Dave Chinner
2021-03-01  9:09           ` Christoph Hellwig
2021-03-03  0:11             ` Dave Chinner
2021-02-26  2:48         ` Darrick J. Wong
2021-02-28 16:36           ` Brian Foster
2021-02-28 23:46             ` Dave Chinner
2021-03-01 15:33               ` Brian Foster
2021-03-01 15:19   ` Brian Foster
2021-03-03  0:41     ` Dave Chinner
2021-03-03 15:22       ` Brian Foster
2021-03-04 22:57         ` Dave Chinner
2021-03-05  0:44       ` Dave Chinner
2021-02-23  3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
2021-02-23 12:57   ` Chandan Babu R
2021-02-24 20:45   ` Darrick J. Wong
2021-02-24 22:01     ` Dave Chinner
2021-02-25  8:36   ` Christoph Hellwig
2021-02-23  3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
2021-02-23  5:29   ` Chaitanya Kulkarni
2021-02-23 14:02   ` Chandan Babu R
2021-02-24 20:51   ` Darrick J. Wong
2021-02-23  3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
2021-02-24  7:16   ` Chandan Babu R
2021-02-24 20:57   ` Darrick J. Wong
2021-02-25  8:42   ` Christoph Hellwig
2021-02-25 21:07     ` Dave Chinner
2021-02-23  3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
2021-02-24  7:17   ` Chandan Babu R
2021-02-24 20:59   ` Darrick J. Wong
2021-02-25  8:49   ` Christoph Hellwig
2021-02-25 20:55     ` Dave Chinner
2021-02-23  3:34 ` [PATCH 7/8] xfs: journal IO cache flush reductions Dave Chinner
2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
2021-02-24 12:27     ` Chandan Babu R
2021-02-24 20:32       ` Dave Chinner
2021-02-24 21:13     ` Darrick J. Wong
2021-02-24 22:03       ` Dave Chinner
2021-02-25  4:09     ` Chandan Babu R
2021-02-25  7:13       ` Chandan Babu R [this message]
2021-03-01  5:44       ` Dave Chinner
2021-03-01  5:56         ` Dave Chinner
2021-02-25  8:58     ` Christoph Hellwig
2021-02-25 21:06       ` Dave Chinner
2021-03-01 19:29     ` Brian Foster
2021-02-23  3:34 ` [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
2021-02-24 21:18   ` Darrick J. Wong
2021-02-24 22:05     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871rd47h8h.fsf@garuda \
    --to=chandanrlinux@gmail.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).