From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFCADC433E6 for ; Wed, 24 Feb 2021 20:33:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AA3D464E20 for ; Wed, 24 Feb 2021 20:33:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235428AbhBXUdh (ORCPT ); Wed, 24 Feb 2021 15:33:37 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:58550 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235783AbhBXUdg (ORCPT ); Wed, 24 Feb 2021 15:33:36 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 49B7A827FFE; Thu, 25 Feb 2021 07:32:54 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1lF0q5-002fmM-2T; Thu, 25 Feb 2021 07:32:53 +1100 Date: Thu, 25 Feb 2021 07:32:53 +1100 From: Dave Chinner To: Chandan Babu R Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions Message-ID: <20210224203253.GZ4662@dread.disaster.area> References: <20210223033442.3267258-1-david@fromorbit.com> <20210223033442.3267258-8-david@fromorbit.com> <20210223080503.GW4662@dread.disaster.area> <87sg5lps5z.fsf@garuda> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87sg5lps5z.fsf@garuda> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=kj9zAlcOel0A:10 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=eJfxgxciAAAA:8 a=7-415B0cAAAA:8 a=C7IU5Li1-fztAd9e-tUA:9 a=CjuIK1q_8ugA:10 a=xM9caqqi1sUkTy8OJ5Uh:22 a=biEYGPWJfzWAr4FL6Ov7:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Wed, Feb 24, 2021 at 05:57:20PM +0530, Chandan Babu R wrote: > On 23 Feb 2021 at 13:35, Dave Chinner wrote: > > From: Dave Chinner > > > > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to > > guarantee the ordering requirements the journal has w.r.t. metadata > > writeback. THe two ordering constraints are: > > > > 1. we cannot overwrite metadata in the journal until we guarantee > > that the dirty metadata has been written back in place and is > > stable. > > > > 2. we cannot write back dirty metadata until it has been written to > > the journal and guaranteed to be stable (and hence recoverable) in > > the journal. > > > > The ordering guarantees of #1 are provided by REQ_PREFLUSH. This > > causes the journal IO to issue a cache flush and wait for it to > > complete before issuing the write IO to the journal. Hence all > > completed metadata IO is guaranteed to be stable before the journal > > overwrites the old metadata. > > > > The ordering guarantees of #2 are provided by the REQ_FUA, which > > ensures the journal writes do not complete until they are on stable > > storage. Hence by the time the last journal IO in a checkpoint > > completes, we know that the entire checkpoint is on stable storage > > and we can unpin the dirty metadata and allow it to be written back. > > > > This is the mechanism by which ordering was first implemented in XFS > > way back in 2002 by this commit: > > > > commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96 > > Author: Steve Lord > > Date: Fri May 24 14:30:21 2002 +0000 > > > > Add support for drive write cache flushing - should the kernel > > have the infrastructure > > > > A lot has changed since then, most notably we now use delayed > > logging to checkpoint the filesystem to the journal rather than > > write each individual transaction to the journal. Cache flushes on > > journal IO are necessary when individual transactions are wholly > > contained within a single iclog. However, CIL checkpoints are single > > transactions that typically span hundreds to thousands of individual > > journal writes, and so the requirements for device cache flushing > > have changed. > > > > That is, the ordering rules I state above apply to ordering of > > atomic transactions recorded in the journal, not to the journal IO > > itself. Hence we need to ensure metadata is stable before we start > > writing a new transaction to the journal (guarantee #1), and we need > > to ensure the entire transaction is stable in the journal before we > > start metadata writeback (guarantee #2). > > > > Hence we only need a REQ_PREFLUSH on the journal IO that starts a > > new journal transaction to provide #1, and it is not on any other > > journal IO done within the context of that journal transaction. > > > > The CIL checkpoint already issues a cache flush before it starts > > writing to the log, so we no longer need the iclog IO to issue a > > REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed > > to xlog_write(), we no longer need to mark the first iclog in > > the log write with REQ_PREFLUSH for this case. > > > > Given the new ordering semantics of commit records for the CIL, we > > need iclogs containing commit to issue a REQ_PREFLUSH. We also > > We flush the data device before writing the first iclog (containing > XLOG_START_TRANS) to the disk. This satisfies the first ordering constraint > listed above. Why is it required to have another REQ_PREFLUSH when writing the > iclog containing XLOG_COMMIT_TRANS? I am guessing that it is required to > make sure that the previous iclogs (belonging to the same checkpoint > transaction) have indeed been written to the disk. Yes, that is correct. Cheers, Dave. -- Dave Chinner david@fromorbit.com