From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F35EB1D8DEA for ; Wed, 15 Jan 2025 21:41:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736977287; cv=none; b=jwm6ZzWn6NXnI/jXNdNCZNsuu8sGykDekRAVX2Xgm3Ruw7n9OlPGNeenkv0JcdjdS+y0DzXDm4csTQH2YhAE2qlpln+saQA0fpaipjo/fZDJYUHQCHLDSyaRXGPE54eDB9JhEOw3V4KNYJ7v9665+gyrCPQ/gRLu9enzVkpa3bY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736977287; c=relaxed/simple; bh=UnypkakFh9Sy4d5cl/pyyfhYnwQ3OSP4C3p6B5y8KRg=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=J5NWD+uwQ0IpYDTLxjp0WADF4ewSR+mDLgNJegtB7wJR59XU+pHNw3NniTyYADmtT8diprNo06LkysQokDDSlFIEVvjDVbS0UGJtVTFIv2s150T6CvuFzXb/cTUQYg5FIzTbn8FC7/4gDoNjl9T2CCg71LC75RZ2QQ0+LTmEB/I= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=wUSTFnKp; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="wUSTFnKp" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-216395e151bso4378405ad.0 for ; Wed, 15 Jan 2025 13:41:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1736977285; x=1737582085; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=+7RqsypsFMbQh8it8gHq1oAn4Wjjku2FuWrarD2dysU=; b=wUSTFnKpwFW1j3+RLuTMqK/nurhxzk7BOy2etItpFScXGmCwXG4H4C/KvDhFm6imIX JJ+8ejav1ExoROFcbWQUKbGe01fgT4aoNzdUyDKEzQn3+373UWPUyox4UO2rVFiKVBIO iue0daOCuXOdkh/HQapxUgmywoiwOLrtXyXqLj0cD+zi9TJS/htcZDkoB6JDxULh3f/9 36hE7FoT2ZocyPJl3QIFwQpmu8dbXF57wQFnLqv5uWS9S+BbT0doMCGUs8QAu6EhjtWE wFOxj2pmfPiRU14tTsphqk5nytfN2rEJpeyi++T9vDj39Xn9n4JGMmvjODM3CEOG6GV7 6z2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736977285; x=1737582085; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=+7RqsypsFMbQh8it8gHq1oAn4Wjjku2FuWrarD2dysU=; b=RLC5+Nj/TeNC8dXxYXuazQPc/ZCEWxRR5A/E2QFARcMo13ijHym4+CDy4Azz9tvOOt D717trrKHes9KWHkFMyupGHSby0UUDy2BbiqdyZZBtwg4blgxXx/7PiNDu9/NKZeUNuK 08TC+0ABkgIBSkD/Mv98v17ZVNlDJ7XBQ9hiEsb+LQ4z/hEznObjX6KwyJXBp1ciXt8/ E1QkBgchPLXkIL2KBVQ4OuyoQ3wr+V1USO6v5KF50BjLRUgB+SRhasWIXT0CtzjqcB23 LAoNmOWt1K2ejAIqDjv27SQfxTZQEImB8I5KY4x5hR3PzGDHVfiKethR0C0RpYb0VFVY K0sA== X-Forwarded-Encrypted: i=1; AJvYcCX2bW+XbfXk7z+LWAI0YHFJJ1bHWczBrBzF6DbUWAskWJdHhZGzM2DxlESxobJdvBuHICgjRaAoL6DzUmM=@vger.kernel.org X-Gm-Message-State: AOJu0Yyq3iejOj1+IhmbFdxqaOubq7v0BM645e6oV9trZgJgDSG8fmNJ sSJu961NHQyf6Z+ikuRcf9/9GHiW38kiLHb/S7s1TpWAQiWOp4yKVncQ7zZy/kk= X-Gm-Gg: ASbGncuiXD3ViwfH6I6bhzYsFdtxKgGShEwLYY8AEDawFggK5EK8bhdkMGWDlN4Of42 46i7QioZ1lxuGVeA9mr40oLU5tUgZjZc7Wh998Wmy4vyJjWjo1bBaQ98Qrg5pzR0iP+mwEcA8fv kOksWloQwI6RI+tM6dHwQay03kaTogZf+EtWLWgGGP9d12sluO+CX8cs6B6Z+H49vm8MF3WSKam 0s03EX5dZXBfnAP7tNMD7zBZwQMzEpvCTRqy+V4WEfKpMifujEIeIOAKGSE/cK3+TapvP3qc5DH ypxNhAD4P1464h4QSYOUeA== X-Google-Smtp-Source: AGHT+IEF9DsGk/eqKSv91kWaf6atrZs2CIMnQO1CTfN7kM9Vq3qR1ckQCkXdRiuQgz0CIbhZuVJZJw== X-Received: by 2002:a17:902:cecf:b0:216:5cc8:44e7 with SMTP id d9443c01a7336-21bf0d31063mr66954335ad.25.1736977285271; Wed, 15 Jan 2025 13:41:25 -0800 (PST) Received: from dread.disaster.area (pa49-186-89-135.pa.vic.optusnet.com.au. [49.186.89.135]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f219aefsm86190185ad.107.2025.01.15.13.41.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Jan 2025 13:41:24 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1tYB8b-00000006Kqt-449P; Thu, 16 Jan 2025 08:41:21 +1100 Date: Thu, 16 Jan 2025 08:41:21 +1100 From: Dave Chinner To: Christoph Hellwig Cc: Brian Foster , "Darrick J. Wong" , Chi Zhiling , Amir Goldstein , cem@kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, Chi Zhiling , John Garry Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read Message-ID: References: <24b1edfc-2b78-434d-825c-89708d9589b7@163.com> <953b0499-5832-49dc-8580-436cf625db8c@163.com> <20250108173547.GI1306365@frogsfrogsfrogs> <20250113024401.GU1306365@frogsfrogsfrogs> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > > Sorry if this is out of left field as I haven't followed the discussion > > closely, but I presumed one of the reasons Darrick and Christoph raised > > the idea of using the folio batch thing I'm playing around with on zero > > range for buffered writes would be to acquire and lock all targeted > > folios up front. If so, would that help with what you're trying to > > achieve here? (If not, nothing to see here, move along.. ;). > > I mostly thought about acquiring, as locking doesn't really have much > batching effects. That being said, no that you got the idea in my mind > here's my early morning brainfart on it: > > Let's ignore DIRECT I/O for the first step. In that case lookup / > allocation and locking all folios for write before copying data will > remove the need for i_rwsem in the read and write path. In a way that > sounds perfect, and given that btrfs already does that (although in a > very convoluted way) we know it's possible. Yes, this seems like a sane, general approach to allowing concurrent buffered writes (and reads). > But direct I/O throws a big monkey wrench here as already mentioned by > others. Now one interesting thing some file systems have done is > to serialize buffered against direct I/O, either by waiting for one > to finish, or by simply forcing buffered I/O when direct I/O would > conflict. Right. We really don't want to downgrade to buffered IO if we can help it, though. > It's easy to detect outstanding direct I/O using i_dio_count > so buffered I/O could wait for that, and downgrading to buffered I/O > (potentially using the new uncached mode from Jens) if there are any > pages on the mapping after the invalidation also sounds pretty doable. It's much harder to sanely serialise DIO against buffered writes this way, because i_dio_count only forms a submission barrier in conjunction with the i_rwsem being held exclusively. e.g. ongoing DIO would result in the buffered write being indefinitely delayed. I think the model and method that bcachefs uses is probably the best way to move forward - the "two-state exclusive shared" lock which it uses to do buffered vs direct exclusion is a simple, easy way to handle this problem. The same-state shared locking fast path is a single atomic cmpxchg operation, so it has neglible extra overhead compared to using a rwsem in the shared DIO fast path. The lock also has non-owner semantics, so DIO can take it during submission and then drop it during IO completion. This solves the problem we currently use the i_rwsem and inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission barrier and waiting for all existing DIO to drain). IOWs, a two-state shared lock provides the mechanism to allow DIO to be done without holding the i_rwsem at all, as well as being able to elide two atomic operations per DIO to track in-flight DIOs. We'd get this whilst maintaining buffered/DIO coherency without adding any new overhead to the DIO path, and allow concurrent buffered reads and writes that have their atomicity defined by the batched folio locking strategy that Brian is working on... This only leaves DIO coherency issues with mmap() based IO as an issue, but that's a problem for a different day... > I don't really have time to turn this hand waving into, but maybe we > should think if it's worthwhile or if I'm missing something important. If people are OK with XFS moving to exclusive buffered or DIO submission model, then I can find some time to work on the converting the IO path locking to use a two-state shared lock in preparation for the batched folio stuff that will allow concurrent buffered writes... -Dave. -- Dave Chinner david@fromorbit.com