From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 833DDCD1283 for ; Sat, 30 Mar 2024 01:03:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 62B2F6B00A8; Fri, 29 Mar 2024 21:03:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5DA6A6B00AF; Fri, 29 Mar 2024 21:03:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 47B766B00B1; Fri, 29 Mar 2024 21:03:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 276AD6B00A8 for ; Fri, 29 Mar 2024 21:03:35 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A56E780E80 for ; Sat, 30 Mar 2024 01:03:34 +0000 (UTC) X-FDA: 81951907548.05.40FC678 Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) by imf19.hostedemail.com (Postfix) with ESMTP id AC8E71A0008 for ; Sat, 30 Mar 2024 01:03:31 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=nEiRr5d0; spf=pass (imf19.hostedemail.com: domain of vadim.fedorenko@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=vadim.fedorenko@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711760612; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JXiJJHZRlrxCQbjNpD/DG/Sam9Mk/EeRhDZTqjMOOdA=; b=IhoyTXdiRFldAHhOxXmpmNl8QR8vaGddmFn0K3FuMIMc3dUVfBESMicaTKG8vnShKzOON0 6N/CUyEH2+2ErXiR73yB0KXf5Fth5RygmyI/4GCgE+cK2Ak/D9OpSlwBdqNqE797hxxLn2 LsBnbLCvOhELbINeaAtcsrjjqXuOTvc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711760612; a=rsa-sha256; cv=none; b=ByFxYzlFuTow7V4WyWPUunllKRBjYq0i9Zt0cwFpIHSNGNtLnjD+Qrx5aWyi2gZHlSnFRv D4OwvNWiF7GSpzd+NeNp77E6MBRUEAENuaFJUcDunonqKy6pHzNjqWq6SLtAyIJqNHKK2U lsEoJm6n7Q1D1STqyEZR/BkaGpcpRE0= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=nEiRr5d0; spf=pass (imf19.hostedemail.com: domain of vadim.fedorenko@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=vadim.fedorenko@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1711760609; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JXiJJHZRlrxCQbjNpD/DG/Sam9Mk/EeRhDZTqjMOOdA=; b=nEiRr5d0ISkrWl+RgoStfXsEYYaBXkKMHJrzYX/lNHAgBPe02h6dEOPwkvTr3HEfXDaU3o ro9FTWY02LthEpm6/PRMqPJHmn81nw7jWZKghjOT1Yu4tyPiSQ2KJpcQGeb4WR7DoJDRZP CRlh1My2puEF8N8UqRBF1F80IqgE6hw= Date: Fri, 29 Mar 2024 18:03:21 -0700 MIME-Version: 1.0 Subject: Re: [PATCH 19/26] netfs: New writeback implementation Content-Language: en-US To: David Howells , Christian Brauner , Jeff Layton , Gao Xiang , Dominique Martinet Cc: Matthew Wilcox , Steve French , Marc Dionne , Paulo Alcantara , Shyam Prasad N , Tom Talpey , Eric Van Hensbergen , Ilya Dryomov , netfs@lists.linux.dev, linux-cachefs@redhat.com, linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs@lists.linux.dev, linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Latchesar Ionkov , Christian Schoenebeck References: <20240328163424.2781320-1-dhowells@redhat.com> <20240328163424.2781320-20-dhowells@redhat.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Vadim Fedorenko In-Reply-To: <20240328163424.2781320-20-dhowells@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: AC8E71A0008 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 3h9pshxmx8j7rfudajtqs51qrjam4h3z X-HE-Tag: 1711760611-448232 X-HE-Meta: U2FsdGVkX1+VHyipIK0DyzwhP2ky5AlK3LhCAwGJlN+8nQfN3AeSodhQ8+B1iO65/T1stGXG1QnSOs2SyiCQyWb3H65p4GmZ0iERYARhbgispatQ93d9ZEw72mY5AnHXYtI/suXIWc25tO/2PLEKx3tCmnyK+IlYPZb0rLvR2vB5uyBup8aYsPonQx9B/RLR/P1z+m4DF2/HhuCzFxEmuBczEz45s2DHENUCd30Gtjmj9j/UC5hEoLAX6xXJbSbv9MaD7pSm0/Mue1dr4SVfpuusilaCBQWbXP4LC3N80ZJKBdUTkqEsdDknu7+Zn7yL8VQzi+OB+BZw0M4+HEktXgIjePsH46LwNYMz6IjrNjKi32Cw8yPF8kqoY+xQMk+YiTMhFLvUltHUCPtRre8hqWIjIffdFao2DmLI4TBEq5KzWveDZUmJJHlQTSWKgJ1qBdg6CnOuzvRGZvZ7wYGtMzrWfQQtKQwsyuq9TI1OnGgse1M1ZFM8bcex4gQz8JlltHUYXRfrR8Ogy9jl53wuE51nnSDuJyOBHVrj0V0+7Ru38O6ysctPc8aMTk+D4QQT1Z6Zudc1QW+b3cEZDoVkscFopEqWDhc9l0dlfysdzY25IJUSM+a7EP63wpvW6dfOMFtsVMi1JFXqxDdmyXV/44hQ43J9phxsppETAa9JnzauL8LV3+OuILC8qsvQsuF13GoAEviRzhU9oroxOpsV0urVzbidmCopCjG6lmP9JbHZSDYwlsJkAAkZjclQ2A2h3ceXSny4HAOzXq07yJtrxTP9ZxN3ZDKkWmUXVz1nvqBdPH6+wtuxzdbISIa+RMK81mhUBWuNRtjpoiC8hkCEa8lWoOVsVGzB91+IHOzQ4fEhvrtCeOhCoO29Mlrmxzcab8INbV0/6IsDRERemJOgkzoGJGiaDbYDb1CREgKjwyaa9FbiDZbbBE0qTSVGfl5JlLhlZC2fUka7CU0djhh em5R7uyY VBlDHHPYSTW/7YC/tBM4hVzpv/rldYwNBdylLPAm/nV8C92EWPqXtQAuueSO5ee7Gr+v9yeG/qruuHgwl7KPwoBS3anU1zdBf9neZ4gZseCHgQc3y2YNZBdsscTZN0O8hbLI5OUppdypMJ9CQs0fzHvM4Lv94k+MHR5KBmmhUKrdiBWE5AVi5Sg2/ZCK6AoyAD+VW+XAPRNQ1bcMIHnfK3Q68fWWbMP/Vg7xU8m/niCnpQ9zNA1XoNX+jM/ZjSFrJbNF2EMJ5vERkTv2JCzwnEHvuoood9kyy/a/Kdss/piqSKq0lmofd0EI72HyjuLLgVR1NfwfekbPQDXF9pXJ333DI1Y/iiSjghJPnNPsdrHGC7kkGYholiqNmBlG0vA/PZkUX1joA/QmsvHDAjyfYKB2IOYk+BMzgKVvLVlQDe8TTUJrA8AIcIoeBuAlvgy4sBoxh0z8ukRk0dTRuvxcU9fnt2afb92mSJndX X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 28/03/2024 16:34, David Howells wrote: > The current netfslib writeback implementation creates writeback requests of > contiguous folio data and then separately tiles subrequests over the space > twice, once for the server and once for the cache. This creates a few > issues: > > (1) Every time there's a discontiguity or a change between writing to only > one destination or writing to both, it must create a new request. > This makes it harder to do vectored writes. > > (2) The folios don't have the writeback mark removed until the end of the > request - and a request could be hundreds of megabytes. > > (3) In future, I want to support a larger cache granularity, which will > require aggregation of some folios that contain unmodified data (which > only need to go to the cache) and some which contain modifications > (which need to be uploaded and stored to the cache) - but, currently, > these are treated as discontiguous. > > There's also a move to get everyone to use writeback_iter() to extract > writable folios from the pagecache. That said, currently writeback_iter() > has some issues that make it less than ideal: > > (1) there's no way to cancel the iteration, even if you find a "temporary" > error that means the current folio and all subsequent folios are going > to fail; > > (2) there's no way to filter the folios being written back - something > that will impact Ceph with it's ordered snap system; > > (3) and if you get a folio you can't immediately deal with (say you need > to flush the preceding writes), you are left with a folio hanging in > the locked state for the duration, when really we should unlock it and > relock it later. > > In this new implementation, I use writeback_iter() to pump folios, > progressively creating two parallel, but separate streams and cleaning up > the finished folios as the subrequests complete. Either or both streams > can contain gaps, and the subrequests in each stream can be of variable > size, don't need to align with each other and don't need to align with the > folios. > > Indeed, subrequests can cross folio boundaries, may cover several folios or > a folio may be spanned by multiple folios, e.g.: > > +---+---+-----+-----+---+----------+ > Folios: | | | | | | | > +---+---+-----+-----+---+----------+ > > +------+------+ +----+----+ > Upload: | | |.....| | | > +------+------+ +----+----+ > > +------+------+------+------+------+ > Cache: | | | | | | > +------+------+------+------+------+ > > The progressive subrequest construction permits the algorithm to be > preparing both the next upload to the server and the next write to the > cache whilst the previous ones are already in progress. Throttling can be > applied to control the rate of production of subrequests - and, in any > case, we probably want to write them to the server in ascending order, > particularly if the file will be extended. > > Content crypto can also be prepared at the same time as the subrequests and > run asynchronously, with the prepped requests being stalled until the > crypto catches up with them. This might also be useful for transport > crypto, but that happens at a lower layer, so probably would be harder to > pull off. > > The algorithm is split into three parts: > > (1) The issuer. This walks through the data, packaging it up, encrypting > it and creating subrequests. The part of this that generates > subrequests only deals with file positions and spans and so is usable > for DIO/unbuffered writes as well as buffered writes. > > (2) The collector. This asynchronously collects completed subrequests, > unlocks folios, frees crypto buffers and performs any retries. This > runs in a work queue so that the issuer can return to the caller for > writeback (so that the VM can have its kswapd thread back) or async > writes. > > (3) The retryer. This pauses the issuer, waits for all outstanding > subrequests to complete and then goes through the failed subrequests > to reissue them. This may involve reprepping them (with cifs, the > credits must be renegotiated, and a subrequest may need splitting), > and doing RMW for content crypto if there's a conflicting change on > the server. > > [!] Note that some of the functions are prefixed with "new_" to avoid > clashes with existing functions. These will be renamed in a later patch > that cuts over to the new algorithm. > > Signed-off-by: David Howells > cc: Jeff Layton > cc: Eric Van Hensbergen > cc: Latchesar Ionkov > cc: Dominique Martinet > cc: Christian Schoenebeck > cc: Marc Dionne > cc: v9fs@lists.linux.dev > cc: linux-afs@lists.infradead.org > cc: netfs@lists.linux.dev > cc: linux-fsdevel@vger.kernel.org [..snip..] > +/* > + * Begin a write operation for writing through the pagecache. > + */ > +struct netfs_io_request *new_netfs_begin_writethrough(struct kiocb *iocb, size_t len) > +{ > + struct netfs_io_request *wreq = NULL; > + struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp)); > + > + mutex_lock(&ictx->wb_lock); > + > + wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, > + iocb->ki_pos, NETFS_WRITETHROUGH); > + if (IS_ERR(wreq)) > + mutex_unlock(&ictx->wb_lock); > + > + wreq->io_streams[0].avail = true; in case IS_ERR(wreq) is true, the execution falls through and this derefere is invalid. > + trace_netfs_write(wreq, netfs_write_trace_writethrough); not sure if we still need trace function call in case of error > + return wreq; > +} > + [..snip..]