Re: understanding xfs vs. ext4 log performance

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Alan Jenkins <alan.christopher.jenkins@gmail.com>
To: Lucas Stach <dev@lynxeye.de>, linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: understanding xfs vs. ext4 log performance
Date: Tue, 4 Jun 2019 14:46:24 +0100	[thread overview]
Message-ID: <e3721341-2ea0-f13f-ae42-890209736eaa@gmail.com> (raw)
In-Reply-To: <7a642f570980609ccff126a78f1546265ba913e2.camel@lynxeye.de>

On 04/06/2019 10:21, Lucas Stach wrote:
> Hi all,
>
> this question is more out of curiosity and because I want to take the
> chance to learn something.
>
> At work we've stumbled over a workload that seems to hit pathological
> performance on XFS. Basically the critical part of the workload is a
> "rm -rf" of a pretty large directory tree, filled with files of mixed
> size ranging from a few KB to a few MB. The filesystem resides on quite
> slow spinning rust disks, directly attached to the host, so no
> controller with a BBU or something like that involved.
>
> We've tested the workload with both xfs and ext4, and while the numbers
> aren't completely accurate due to other factors playing into the
> runtime, performance difference between XFS and ext4 seems to be an
> order of magnitude. (Ballpark runtime XFS is 30 mins, while ext4
> handles the remove in ~3 mins).
>
> The XFS performance seems to be completly dominated by log buffer
> writes, which happen with both REQ_PREFLUSH and REQ_FUA set. It's
> pretty obvious why this kills performance on slow spinning rust.
>
> Now the thing I wonder about is why ext4 seems to get a away without
> those costly flags for its log writes. At least blktrace shows almost
> zero PREFLUSH or FUA requests. Is there some fundamental difference in
> how ext4 handles its logging to avoid the need for this ordering and
> forced access, or is it ext just living more dangerously with regard to
> reordered writes?
>
> Does XFS really require such a strong ordering on the log buffer
> writes? I don't understand enough of the XFS transaction code and
> wonder if it would be possible to do the strongly ordered writes only
> on transaction commit.
>
> Regards,
> Lucas

Your immediate question sounds like an artefact.  I think both XFS and 
ext4 flush the cache when writing to the log.  The difference I see is 
that xlog_sync() writes the log in one IO.  By contrast, 
jbd2_journal_commit_transaction() has several steps that submit IO. The 
last IO is a "commit descriptor", and that IO is strictly ordered 
(PREFLUSH+FUA).

Unless you have enabled `journal_async_commit` in ext4.  But I think you 
would know if you had.  I am not sure whether that feature is now 
considered mature, but it is not compatible with the default option 
`data=ordered`.  And this fact is still not in the documentation, so I 
think it is at least not used very widely :-). 
https://unix.stackexchange.com/questions/520379/

Maybe XFS is generating much more log IO.  Alternatively, something that 
you do not expect might be causing calls to xfs_log_force_lsn() / 
xfs_log_force().

In future, it would be helpful to include details such as the kernel 
version you tested :-).

Regards
Alan

Google pointed me to xfs_log.c.  There is only one place that submits 
IO: xlog_sync().  As you observe, this write uses PREFLUSH+FUA.  But I 
think this is the *only* time we write to the journal.

/*
* Flush out the in-core log (iclog) to the on-disk log in an asynchronous
* fashion. ... bp->b_io_length = BTOBB(count); bp->b_log_item = iclog; 
bp->b_flags &= ~XBF_FLUSH; bp->b_flags |= (XBF_ASYNC | XBF_SYNCIO | 
XBF_WRITE | XBF_FUA); /* * Flush the data device before flushing the log 
to make sure all meta * data written back from the AIL actually made it 
to disk before * stamping the new log tail LSN into the log buffer. For 
an external * log we need to issue the flush explicitly, and 
unfortunately * synchronously here; for an internal log we can simply 
use the block * layer state machine for preflushes. */ if 
(log->l_mp->m_logdev_targp != log->l_mp->m_ddev_targp) 
xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp); else bp->b_flags |= 
XBF_FLUSH; ... error = xlog_bdstrat(bp);

Whereas I see at least three steps in 
jbd2_journal_commit_transaction().  Step 1,  write all the data to the 
journal without flushes:

	while (commit_transaction->t_buffers) {

		/* Find the next buffer to be journaled... */

                 ...

		/* If there's no more to do, or if the descriptor is full,
		   let the IO rip! */

		if (bufs == journal->j_wbufsize ||
		    commit_transaction->t_buffers == NULL ||
		    space_left < tag_bytes + 16 + csum_size) {

                         ...

			for (i = 0; i < bufs; i++) {

                                 ...

				bh->b_end_io = journal_end_buffer_io_sync;
				submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
			}

Step 2:

	err = journal_finish_inode_data_buffers(journal, commit_transaction);
	if (err) {
		printk(KERN_WARNING
			"JBD2: Detected IO errors while flushing file data "
		       "on %s\n", journal->j_devname);

Step 3, commit:

	if (!jbd2_has_feature_async_commit(journal)) {
		err = journal_submit_commit_record(journal, commit_transaction,
						&cbh, crc32_sum);
		if (err)
			__jbd2_journal_abort_hard(journal);
	}
	if (cbh)
		err = journal_wait_on_commit_record(journal, cbh);

static int journal_submit_commit_record(journal_t *journal,
					transaction_t *commit_transaction,
					struct buffer_head **cbh,
					__u32 crc32_sum)
{
...

	if (journal->j_flags & JBD2_BARRIER &&
	    !jbd2_has_feature_async_commit(journal))
		ret = submit_bh(REQ_OP_WRITE,
			REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);

next prev parent reply	other threads:[~2019-06-04 13:46 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-04  9:21 understanding xfs vs. ext4 log performance Lucas Stach
2019-06-04 13:46 ` Alan Jenkins [this message]
2019-06-04 22:01 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e3721341-2ea0-f13f-ae42-890209736eaa@gmail.com \
    --to=alan.christopher.jenkins@gmail.com \
    --cc=dev@lynxeye.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).