From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A311C3524D for ; Mon, 3 Feb 2020 20:44:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E91F020658 for ; Mon, 3 Feb 2020 20:44:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=toxicpanda-com.20150623.gappssmtp.com header.i=@toxicpanda-com.20150623.gappssmtp.com header.b="stqD9Q9I" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726992AbgBCUoo (ORCPT ); Mon, 3 Feb 2020 15:44:44 -0500 Received: from mail-qv1-f66.google.com ([209.85.219.66]:36328 "EHLO mail-qv1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726278AbgBCUon (ORCPT ); Mon, 3 Feb 2020 15:44:43 -0500 Received: by mail-qv1-f66.google.com with SMTP id db9so7502097qvb.3 for ; Mon, 03 Feb 2020 12:44:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=YJgaIf+l/8/sv2PlPCSVHd6/iMye2LciouvA366THrs=; b=stqD9Q9ID7Vn0nLQ8iQpLwHodXOiaW/QTOEqscTZQIvnKti93P4QDj5dfuxiXQX+Ja Am4uW8dz53HTs/MVLAG0XZTpPvbUztV2xnZwroKYKzGEHSPaDkiBCwzAconcjD463hc7 OEVwUghxcGbW9DqkMav2bNsMmRDTKziYOdTzSHOIkmsEkIGiLvAd0TJCjm6GOmRBKKt+ yjHVLsEr/sJF121SJvkS36W1qJLKm7ed1yLjeHVUNC7wd2I/sBFOpumwpb7cMp943Sis G8E2iNzzFxHSG5flOLl5wU3EVW1AknDjhJoJbvXF817OMwj/6N7NDQaHfY/my9c4clBr cLaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=YJgaIf+l/8/sv2PlPCSVHd6/iMye2LciouvA366THrs=; b=DqMZ37kW71OJ77TC31mgNCrQEaovUdnJgKmoF2CVAbav6T4dUejHAR0LcFJWrEKSm5 OUVRwYkC9UddVZI0hZiN3cxbCxxmuexh8BZmFjopFNPhQvFPnWdvzge5SCYaRzgHp+0r xAjHfYq6xjj4rUhs5aSNX8kgWaUgb2IUv6sPY/0p3GDzklx2eF7ZZTqBu2OjoloFNLon erqb8DV4pNV/JQaZQsB4kZBmESypdVj3nTXJSW4oRaK4DD8AMeZk2Ciqjl1yKZLqtTet wvIPMuEWJCz75AW+QchRFwj52z345rTURjQlrHhxxDNNd2F4rJFfjowlw5sh0T7nnH1R kyxw== X-Gm-Message-State: APjAAAU/Z3i/1fRfQwyw9tjoznAFPQwUDJjfoKYr3bv55bR3reEfQauu stsuEtNPKaJoGvDr3KI2bP0fZfSeXIBuxQ== X-Google-Smtp-Source: APXvYqxUosChhyc+iAgX99x4svGkR+sHc1XMZgfPorVu9O8cHo+ucuTW0Glmo5kSjTD10B3DoxOThA== X-Received: by 2002:a05:6214:524:: with SMTP id x4mr22556162qvw.4.1580762682127; Mon, 03 Feb 2020 12:44:42 -0800 (PST) Received: from localhost ([107.15.81.208]) by smtp.gmail.com with ESMTPSA id w134sm10152689qka.127.2020.02.03.12.44.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Feb 2020 12:44:41 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 2/3] btrfs: add a comment describing delalloc space reservation Date: Mon, 3 Feb 2020 15:44:35 -0500 Message-Id: <20200203204436.517473-3-josef@toxicpanda.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200203204436.517473-1-josef@toxicpanda.com> References: <20200203204436.517473-1-josef@toxicpanda.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org delalloc space reservation is tricky because it encompasses both data and metadata. Make it clear what each side does, the general flow of how space is moved throughout the lifetime of a write, and what goes into the calculations. Signed-off-by: Josef Bacik --- fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+) diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c index c13d8609cc99..09a9c01fc1b5 100644 --- a/fs/btrfs/delalloc-space.c +++ b/fs/btrfs/delalloc-space.c @@ -9,6 +9,96 @@ #include "qgroup.h" #include "block-group.h" +/* + * HOW DOES THIS WORK + * + * There are two stages to data reservations, one for data and one for metadata + * to handle the new extents and checksums generated by writing data. + * + * + * DATA RESERVATION + * The data reservation stuff is relatively straightforward. We want X bytes, + * and thus need to make sure we have X bytes free in data space in order to + * write that data. If there is not X bytes free, allocate data chunks until + * we can satisfy that reservation. If we can no longer allocate data chunks, + * attempt to flush space to see if we can now make the reservaiton. See the + * comment for data_flush_states to see how that flushing is accomplished. + * + * Once this space is reserved, it is added to space_info->bytes_may_use. The + * caller must keep track of this reservation and free it up if it is never + * used. With the buffered IO case this is handled via the EXTENT_DELALLOC + * bit's on the inode's io_tree. For direct IO it's more straightforward, we + * take the reservation at the start of the operation, and if we write less + * than we reserved we free the excess. + * + * For the buffered case our reservation will take one of two paths + * + * 1) It is allocated. In find_free_extent() we will call + * btrfs_add_reserved_bytes() with the size of the extent we made, along with + * the size that we are covering with this allocation. For non-compressed + * these will be the same thing, but for compressed they could be different. + * In any case, we increase space_info->bytes_reserved by the extent size, and + * reduce the space_info->bytes_may_use by the ram_bytes size. From now on + * the handling of this reserved space is the responsibility of the ordered + * extent or the cow path. + * + * 2) There is an error, and we free it. This is handled with the + * EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's + * io_tree. + * + * METADATA RESERVATION + * The general metadata reservation lifetimes are discussed elsewhere, this + * will just focus on how it is used for delalloc space. + * + * There are 3 things we are keeping reservations for. + * + * 1) Updating the inode item. We hold a reservation for this inode as long + * as there are dirty bytes outstanding for this inode. This is because we + * may update the inode multiple times throughout an operation, and there is + * no telling when we may have to do a full cow back to that inode item. Thus + * we must always hold a reservation. + * + * 2) Adding an extent item. This is trickier, so a few sub points + * + * a) We keep track of how many extents an inode may need to create in + * inode->outstanding_extents. This is how many items we will have reserved + * for the extents for this inode. + * + * b) count_max_extents() is used to figure out how many extent items we + * will need based on the contiguous area we have dirtied. Thus if we are + * writing 4k extents but they coalesce into a very large extent, we will + * break this into smaller extents which means we'll need a reservation for + * each of those extents. + * + * c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out + * the nummber of extents needed for the contiguous area we just created, + * and add that to inode->outstanding_extents. + * + * d) We have no idea at reservation time how this new extent fits into + * existing extents. We unconditionally use count_max_extents() on the + * reservation we are currently doing. The reservation _must_ use + * btrfs_delalloc_release_extents() once it has done it's work to clear up + * this outstanding extents. This means that we will transiently have too + * many extent reservations for this inode than we need. For example say we + * have a clean inode, and we do a buffered write of 4k. The reservation + * code will mod outstanding_extents to 1, and then set_delalloc will + * increase it to 2. Then once we are finished, + * btrfs_delalloc_release_extents() will drop it back down to 1 again. + * + * e) Ordered extents take on the responsibility of their extent. We know + * that the ordered extent represents a single inode item, so it will modify + * ->outstanding_extents by 1, and will clear delalloc which will adjust the + * ->outstanding_extents by whatever value it needs to be adjusted to. Once + * the ordered io is finished we drop the ->outstanding_extents by 1 and if + * we are 0 we drop our inode item reservation as well. + * + * 3) Adding csums for the range. This is more straightforward than the + * extent items, as we just want to hold the number of bytes we'll need for + * checksums until the ordered extent is removed. If there is an error it is + * cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC + * on the inode io_tree. + */ + int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) { struct btrfs_root *root = inode->root; -- 2.24.1