From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB0ADC28CC3 for ; Tue, 4 Jun 2019 13:46:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7B0E724B46 for ; Tue, 4 Jun 2019 13:46:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ibD43TNm" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727327AbfFDNq2 (ORCPT ); Tue, 4 Jun 2019 09:46:28 -0400 Received: from mail-wr1-f66.google.com ([209.85.221.66]:36091 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727033AbfFDNq2 (ORCPT ); Tue, 4 Jun 2019 09:46:28 -0400 Received: by mail-wr1-f66.google.com with SMTP id n4so12868055wrs.3; Tue, 04 Jun 2019 06:46:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:cc:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding:content-language; bh=BdJYLZrrSWJNT4lfUWGV9fqHUaCf6Pcc9zbBPoLUgkk=; b=ibD43TNmQVoGey9FPCc9C1K7dD/Fxum/wxfrphIGryulQ2P44yVWPqFcoRR06Rk1EP FZKXT+F8LcrkIQt+e0g+zZtzYLELnTOUpQTjWdYO8xLM2m+ZzVXDEYvrieEQ5fSqMVJV nI0sWUhiv8fRHUoK8Yxe1yP2SigpKf6Y8RbN9MVBFDJeh0K7nayaaFRtFt/aiGVUQgUX jUhHCaz22i9tIgJodIWRynBfED+y86fXzCsQuiGoAlUfhBKzfAUcRjiZzrQSNSLpYArT 7X+7HkEbNEdm6GSgwqXAGh37xYO/PW3KiqD4Hnb7uMKEqO1zMSPGCGnbA3t6A7iuJ45m ayZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:cc:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=BdJYLZrrSWJNT4lfUWGV9fqHUaCf6Pcc9zbBPoLUgkk=; b=rcDjgBl0LKCG86kOu00V5FYJFEd7KGISRBFrwlKYw9DMq6rpbxofGeGF88MDkojoBh 5tQXc0a+ROcJwVSUzyoQJPEztrA0sP8hcEHyS8QBa0A05yBIdNSkCwv3ElxfejacBvC0 hD1FCbJN5Xw87JQ4jhyxhDMGsRnR+E8sYWDsMen2OyrBetuKdr+krJpgbw4Bxx2LExhJ ncBn+9FTr4qajLHVpoLbo/8XGYSmoY0vWQ1VP9uHy+cetanNrvLNLc2oMfSU/DfSD4mU iGhv5EIfO38WIxO9uk7csNAiaHYu4DoPp8mNRZGDNTjVc/DUFH8QFCOg7vLM8lR222Rg EHEA== X-Gm-Message-State: APjAAAXUQzaPWeBj9xxteO3S1F4/FqeP56WyYo8ckYl7qyd1UaY04T/b IAljHzVGFbgRVYMVTxzTihL0t31p X-Google-Smtp-Source: APXvYqxdxaFPLf3Lo2yUH5EBfNAqFC8EG2jh/+kRhb3z/97TrAggjd2w5A7D9BgFwpOJniOrgbe89w== X-Received: by 2002:a05:6000:10c2:: with SMTP id b2mr7895458wrx.57.1559655986210; Tue, 04 Jun 2019 06:46:26 -0700 (PDT) Received: from [172.16.8.139] (host-78-151-217-120.as13285.net. [78.151.217.120]) by smtp.gmail.com with ESMTPSA id f20sm13034887wmh.22.2019.06.04.06.46.24 (version=TLS1_3 cipher=AEAD-AES128-GCM-SHA256 bits=128/128); Tue, 04 Jun 2019 06:46:25 -0700 (PDT) Subject: Re: understanding xfs vs. ext4 log performance To: Lucas Stach , linux-xfs@vger.kernel.org References: <7a642f570980609ccff126a78f1546265ba913e2.camel@lynxeye.de> From: Alan Jenkins Cc: linux-fsdevel@vger.kernel.org Message-ID: Date: Tue, 4 Jun 2019 14:46:24 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <7a642f570980609ccff126a78f1546265ba913e2.camel@lynxeye.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On 04/06/2019 10:21, Lucas Stach wrote: > Hi all, > > this question is more out of curiosity and because I want to take the > chance to learn something. > > At work we've stumbled over a workload that seems to hit pathological > performance on XFS. Basically the critical part of the workload is a > "rm -rf" of a pretty large directory tree, filled with files of mixed > size ranging from a few KB to a few MB. The filesystem resides on quite > slow spinning rust disks, directly attached to the host, so no > controller with a BBU or something like that involved. > > We've tested the workload with both xfs and ext4, and while the numbers > aren't completely accurate due to other factors playing into the > runtime, performance difference between XFS and ext4 seems to be an > order of magnitude. (Ballpark runtime XFS is 30 mins, while ext4 > handles the remove in ~3 mins). > > The XFS performance seems to be completly dominated by log buffer > writes, which happen with both REQ_PREFLUSH and REQ_FUA set. It's > pretty obvious why this kills performance on slow spinning rust. > > Now the thing I wonder about is why ext4 seems to get a away without > those costly flags for its log writes. At least blktrace shows almost > zero PREFLUSH or FUA requests. Is there some fundamental difference in > how ext4 handles its logging to avoid the need for this ordering and > forced access, or is it ext just living more dangerously with regard to > reordered writes? > > Does XFS really require such a strong ordering on the log buffer > writes? I don't understand enough of the XFS transaction code and > wonder if it would be possible to do the strongly ordered writes only > on transaction commit. > > Regards, > Lucas Your immediate question sounds like an artefact.  I think both XFS and ext4 flush the cache when writing to the log.  The difference I see is that xlog_sync() writes the log in one IO.  By contrast, jbd2_journal_commit_transaction() has several steps that submit IO. The last IO is a "commit descriptor", and that IO is strictly ordered (PREFLUSH+FUA). Unless you have enabled `journal_async_commit` in ext4.  But I think you would know if you had.  I am not sure whether that feature is now considered mature, but it is not compatible with the default option `data=ordered`.  And this fact is still not in the documentation, so I think it is at least not used very widely :-). https://unix.stackexchange.com/questions/520379/ Maybe XFS is generating much more log IO.  Alternatively, something that you do not expect might be causing calls to xfs_log_force_lsn() / xfs_log_force(). In future, it would be helpful to include details such as the kernel version you tested :-). Regards Alan Google pointed me to xfs_log.c.  There is only one place that submits IO: xlog_sync().  As you observe, this write uses PREFLUSH+FUA.  But I think this is the *only* time we write to the journal. /* * Flush out the in-core log (iclog) to the on-disk log in an asynchronous * fashion. ... bp->b_io_length = BTOBB(count); bp->b_log_item = iclog; bp->b_flags &= ~XBF_FLUSH; bp->b_flags |= (XBF_ASYNC | XBF_SYNCIO | XBF_WRITE | XBF_FUA); /* * Flush the data device before flushing the log to make sure all meta * data written back from the AIL actually made it to disk before * stamping the new log tail LSN into the log buffer. For an external * log we need to issue the flush explicitly, and unfortunately * synchronously here; for an internal log we can simply use the block * layer state machine for preflushes. */ if (log->l_mp->m_logdev_targp != log->l_mp->m_ddev_targp) xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp); else bp->b_flags |= XBF_FLUSH; ... error = xlog_bdstrat(bp); Whereas I see at least three steps in jbd2_journal_commit_transaction().  Step 1,  write all the data to the journal without flushes: while (commit_transaction->t_buffers) { /* Find the next buffer to be journaled... */ ... /* If there's no more to do, or if the descriptor is full, let the IO rip! */ if (bufs == journal->j_wbufsize || commit_transaction->t_buffers == NULL || space_left < tag_bytes + 16 + csum_size) { ... for (i = 0; i < bufs; i++) { ... bh->b_end_io = journal_end_buffer_io_sync; submit_bh(REQ_OP_WRITE, REQ_SYNC, bh); } Step 2: err = journal_finish_inode_data_buffers(journal, commit_transaction); if (err) { printk(KERN_WARNING "JBD2: Detected IO errors while flushing file data " "on %s\n", journal->j_devname); Step 3, commit: if (!jbd2_has_feature_async_commit(journal)) { err = journal_submit_commit_record(journal, commit_transaction, &cbh, crc32_sum); if (err) __jbd2_journal_abort_hard(journal); } if (cbh) err = journal_wait_on_commit_record(journal, cbh); static int journal_submit_commit_record(journal_t *journal, transaction_t *commit_transaction, struct buffer_head **cbh, __u32 crc32_sum) { ... if (journal->j_flags & JBD2_BARRIER && !jbd2_has_feature_async_commit(journal)) ret = submit_bh(REQ_OP_WRITE, REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);