From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52C8518755F for ; Mon, 26 Aug 2024 14:49:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724683792; cv=none; b=oftKp88aLzFFmHDqW2dJ90iJtHBCsg4ILA60f9MhXmKc4VHVfurzRTSEve16sHMOYrzo7vb7RUr24o1l8RVt0Q5qu2Rn9NbOJvaDwN7bUX5c7uJ3f43B7YLNBReVUg3GEbj472oxSN9+Vhvm6RYLN4pASyQHeQ/9So8LhCepBOg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724683792; c=relaxed/simple; bh=Xr+cw//Uhr7B46vQyK/ho7Ff738pS52+HyfM8ZUJP6c=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=RaNVpChGzLE/gO2JjAbkc+uL926XxRicaILcwdeQ5jUL0nDXqqBvv0XY/Bkx/wYL/dw26P9SijHif4HqNIvwBzhL9UeDlnl1lzYt0t3w//dYGvzmyIJFuqN99Os5hzN4uG7rDAm6KOX+P8Xy6ihbP9N4on8GHjitO4SzKp7vBwo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=toxicpanda.com; spf=none smtp.mailfrom=toxicpanda.com; dkim=pass (2048-bit key) header.d=toxicpanda-com.20230601.gappssmtp.com header.i=@toxicpanda-com.20230601.gappssmtp.com header.b=SFXfonh6; arc=none smtp.client-ip=209.85.222.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=toxicpanda.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=toxicpanda.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=toxicpanda-com.20230601.gappssmtp.com header.i=@toxicpanda-com.20230601.gappssmtp.com header.b="SFXfonh6" Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-7a1dd2004e1so298233085a.3 for ; Mon, 26 Aug 2024 07:49:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20230601.gappssmtp.com; s=20230601; t=1724683789; x=1725288589; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=xcpcZ/ZoAkdTw2x0sl/WNy1SqcZeX1HC28pG2oEIg38=; b=SFXfonh63D3JKS6BdePy3foSgB1M7Pb81pwJDdt2BSEcm2bGTsty5JjuTFS2X2aOuJ IMpsQqRmC0yEPUwObVCp0R3ZoicS9NfltQSPN1khtGnj+sLFChiTyA8o939cD3J8iKwU 114f8omcCH6fMQkR4KTCMoWxgSy15UBWVxq7oDCGXlfT9xe8aBdeIs3mTxbXJ/ZsYGZx hOb6MUpPWBl5RLfHsGseHEIh2+ipx6K+KSyi3XOOgWvIUmogxWy3afN7wW834D/K8cfy Vj5pKTDXKq8irNqP9UfmmFxnydrU6wNXkV+w7VlfNtn85b1UVI0GsUZjl807T0p4J/ay yuVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724683789; x=1725288589; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=xcpcZ/ZoAkdTw2x0sl/WNy1SqcZeX1HC28pG2oEIg38=; b=QlcfMFnFC12fFJleMi4UmT6zaUuHqYOxgprUEDdXHGwqj/Et4IrGS1whuUXA9YJiCW Qx40Bgc6o12doOIhi380VBj0b3eIo3b0DiEBQWTwLOxRTz7Lf5SkNoh2VYO+FL7pvRps nqVz04M05Vp/9TrGARMJr7QhgRJErDXbk0i2frGG/mKbmQrRJVmqTSOe7MT4r7l7uS2/ vYGSmvmKw/GY9oFf5A34Hf8flSktTlaglSQzhi/jH2WTK9zNR8QH4vaLNT/QzDC116Ka m0QqtsehJUD31rp4feBsUqtgqvzNVRu1UtijCor95u6iA32xVNLAkQyycv1hk45oEx+l 1Ebw== X-Gm-Message-State: AOJu0YzBF0OtSJkCgvZ2S+9PGcWcAHz5Cg0CkZ3G+ashmv1gFlh2xc8j eYlIfMD2VhCW/nNq68z87pTwqaTcr+DPCrWwZml9FSffnxhdm9UiY5PGO9cPvY7athH3LRhG88J P X-Google-Smtp-Source: AGHT+IGhGjt05TalWyccZHDlgmfW7e2eAVZdPR8xJVSQQG+JnC5YP9dWDx3R4IkKYhvYcH/BhJpqkg== X-Received: by 2002:a05:620a:46ab:b0:7a3:78ac:827c with SMTP id af79cd13be357-7a6896e3a0bmr1147399385a.10.1724683788790; Mon, 26 Aug 2024 07:49:48 -0700 (PDT) Received: from localhost (syn-076-182-020-124.res.spectrum.com. [76.182.20.124]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a67f3471aasm462365185a.39.2024.08.26.07.49.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 26 Aug 2024 07:49:48 -0700 (PDT) Date: Mon, 26 Aug 2024 10:49:47 -0400 From: Josef Bacik To: Goldwyn Rodrigues Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH 3/3] btrfs: do not hold the extent lock for entire read Message-ID: <20240826144947.GF2393039@perftesting> References: <174b23dd8cf1a03d54f5ca3dc03c3f5011988d8a.1724255475.git.josef@toxicpanda.com> Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Sat, Aug 24, 2024 at 12:03:47PM -0400, Goldwyn Rodrigues wrote: > On 12:02 21/08, Josef Bacik wrote: > > Historically we've held the extent lock throughout the entire read. > > There's been a few reasons for this, but it's mostly just caused us > > problems. For example, this prevents us from allowing page faults > > during direct io reads, because we could deadlock. This has forced us > > to only allow 4k reads at a time for io_uring NOWAIT requests because we > > have no idea if we'll be forced to page fault and thus have to do a > > whole lot of work. > > > > On the buffered side we are protected by the page lock, as long as we're > > reading things like buffered writes, punch hole, and even direct IO to a > > certain degree will get hung up on the page lock while the page is in > > flight. > > > > On the direct side we have the dio extent lock, which acts much like the > > way the extent lock worked previously to this patch, however just for > > direct reads. This protects direct reads from concurrent direct writes, > > while we're protected from buffered writes via the inode lock. > > > > Now that we're protected in all cases, narrow the extent lock to the > > part where we're getting the extent map to submit the reads, no longer > > holding the extent lock for the entire read operation. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/compression.c | 2 +- > > fs/btrfs/direct-io.c | 51 ++++++++++++------------ > > fs/btrfs/extent_io.c | 88 +++--------------------------------------- > > 3 files changed, 31 insertions(+), 110 deletions(-) > > > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > > index 832ab8984c41..511f81f312af 100644 > > --- a/fs/btrfs/compression.c > > +++ b/fs/btrfs/compression.c > > @@ -521,6 +521,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, > > } > > add_size = min(em->start + em->len, page_end + 1) - cur; > > free_extent_map(em); > > + unlock_extent(tree, cur, page_end, NULL); > > > > if (folio->index == end_index) { > > size_t zero_offset = offset_in_folio(folio, isize); > > @@ -534,7 +535,6 @@ static noinline int add_ra_bio_pages(struct inode *inode, > > > > if (!bio_add_folio(orig_bio, folio, add_size, > > offset_in_folio(folio, cur))) { > > - unlock_extent(tree, cur, page_end, NULL); > > folio_unlock(folio); > > folio_put(folio); > > break; > > diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c > > index 576f469cacee..66f1ce5fcfd2 100644 > > --- a/fs/btrfs/direct-io.c > > +++ b/fs/btrfs/direct-io.c > > @@ -366,7 +366,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > > int ret = 0; > > u64 len = length; > > const u64 data_alloc_len = length; > > - bool unlock_extents = false; > > + u32 unlock_bits = EXTENT_LOCKED; > > > > /* > > * We could potentially fault if we have a buffer > PAGE_SIZE, and if > > @@ -527,7 +527,6 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > > start, &len, flags); > > if (ret < 0) > > goto unlock_err; > > - unlock_extents = true; > > /* Recalc len in case the new em is smaller than requested */ > > len = min(len, em->len - (start - em->start)); > > if (dio_data->data_space_reserved) { > > @@ -548,23 +547,8 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > > release_offset, > > release_len); > > } > > - } else { > > - /* > > - * We need to unlock only the end area that we aren't using. > > - * The rest is going to be unlocked by the endio routine. > > - */ > > - lockstart = start + len; > > - if (lockstart < lockend) > > - unlock_extents = true; > > } > > > > - if (unlock_extents) > > - clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend, > > - EXTENT_LOCKED | EXTENT_DIO_LOCKED, > > - &cached_state); > > - else > > - free_extent_state(cached_state); > > - > > /* > > * Translate extent map information to iomap. > > * We trim the extents (and move the addr) even though iomap code does > > @@ -583,6 +567,23 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > > iomap->length = len; > > free_extent_map(em); > > > > + /* > > + * Reads will hold the EXTENT_DIO_LOCKED bit until the io is completed, > > + * writes only hold it for this part. We hold the extent lock until > > + * we're completely done with the extent map to make sure it remains > > + * valid. > > + */ > > + if (write) > > + unlock_bits |= EXTENT_DIO_LOCKED; > > + > > + clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend, > > + unlock_bits, &cached_state); > > + > > + /* We didn't use everything, unlock the dio extent for the remainder. */ > > + if (!write && (start + len) < lockend) > > + unlock_dio_extent(&BTRFS_I(inode)->io_tree, start + len, > > + lockend, NULL); > > + > > return 0; > > > > unlock_err: > > @@ -615,9 +616,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, > > > > if (!write && (iomap->type == IOMAP_HOLE)) { > > /* If reading from a hole, unlock and return */ > > - clear_extent_bit(&BTRFS_I(inode)->io_tree, pos, > > - pos + length - 1, > > - EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL); > > + unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos, > > + pos + length - 1, NULL); > > return 0; > > } > > > > @@ -628,10 +628,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, > > btrfs_finish_ordered_extent(dio_data->ordered, NULL, > > pos, length, false); > > else > > - clear_extent_bit(&BTRFS_I(inode)->io_tree, pos, > > - pos + length - 1, > > - EXTENT_LOCKED | EXTENT_DIO_LOCKED, > > - NULL); > > + unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos, > > + pos + length - 1, NULL); > > ret = -ENOTBLK; > > } > > if (write) { > > @@ -663,9 +661,8 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio) > > dip->file_offset, dip->bytes, > > !bio->bi_status); > > } else { > > - clear_extent_bit(&inode->io_tree, dip->file_offset, > > - dip->file_offset + dip->bytes - 1, > > - EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL); > > + unlock_dio_extent(&inode->io_tree, dip->file_offset, > > + dip->file_offset + dip->bytes - 1, NULL); > > } > > > > bbio->bio.bi_private = bbio->private; > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index 822e2bf8bc99..3f3baec23717 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -480,75 +480,6 @@ static void end_bbio_data_write(struct btrfs_bio *bbio) > > bio_put(bio); > > } > > > > -/* > > - * Record previously processed extent range > > - * > > - * For endio_readpage_release_extent() to handle a full extent range, reducing > > - * the extent io operations. > > - */ > > -struct processed_extent { > > - struct btrfs_inode *inode; > > - /* Start of the range in @inode */ > > - u64 start; > > - /* End of the range in @inode */ > > - u64 end; > > - bool uptodate; > > -}; > > - > > -/* > > - * Try to release processed extent range > > - * > > - * May not release the extent range right now if the current range is > > - * contiguous to processed extent. > > - * > > - * Will release processed extent when any of @inode, @uptodate, the range is > > - * no longer contiguous to the processed range. > > - * > > - * Passing @inode == NULL will force processed extent to be released. > > - */ > > -static void endio_readpage_release_extent(struct processed_extent *processed, > > - struct btrfs_inode *inode, u64 start, u64 end, > > - bool uptodate) > > -{ > > - struct extent_state *cached = NULL; > > - struct extent_io_tree *tree; > > - > > - /* The first extent, initialize @processed */ > > - if (!processed->inode) > > - goto update; > > - > > - /* > > - * Contiguous to processed extent, just uptodate the end. > > - * > > - * Several things to notice: > > - * > > - * - bio can be merged as long as on-disk bytenr is contiguous > > - * This means we can have page belonging to other inodes, thus need to > > - * check if the inode still matches. > > - * - bvec can contain range beyond current page for multi-page bvec > > - * Thus we need to do processed->end + 1 >= start check > > - */ > > - if (processed->inode == inode && processed->uptodate == uptodate && > > - processed->end + 1 >= start && end >= processed->end) { > > - processed->end = end; > > - return; > > - } > > - > > - tree = &processed->inode->io_tree; > > - /* > > - * Now we don't have range contiguous to the processed range, release > > - * the processed range now. > > - */ > > - unlock_extent(tree, processed->start, processed->end, &cached); > > - > > -update: > > - /* Update processed to current range */ > > - processed->inode = inode; > > - processed->start = start; > > - processed->end = end; > > - processed->uptodate = uptodate; > > -} > > - > > static void begin_folio_read(struct btrfs_fs_info *fs_info, struct folio *folio) > > { > > ASSERT(folio_test_locked(folio)); > > @@ -575,7 +506,6 @@ static void end_bbio_data_read(struct btrfs_bio *bbio) > > { > > struct btrfs_fs_info *fs_info = bbio->fs_info; > > struct bio *bio = &bbio->bio; > > - struct processed_extent processed = { 0 }; > > struct folio_iter fi; > > const u32 sectorsize = fs_info->sectorsize; > > > > @@ -640,11 +570,7 @@ static void end_bbio_data_read(struct btrfs_bio *bbio) > > > > /* Update page status and unlock. */ > > end_folio_read(folio, uptodate, start, len); > > - endio_readpage_release_extent(&processed, BTRFS_I(inode), > > - start, end, uptodate); > > } > > - /* Release the last extent */ > > - endio_readpage_release_extent(&processed, NULL, 0, 0, false); > > bio_put(bio); > > } > > > > @@ -1019,11 +945,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > > size_t pg_offset = 0; > > size_t iosize; > > size_t blocksize = fs_info->sectorsize; > > - struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; > > > > ret = set_folio_extent_mapped(folio); > > if (ret < 0) { > > - unlock_extent(tree, start, end, NULL); > > folio_unlock(folio); > > return ret; > > } > > @@ -1047,14 +971,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > > if (cur >= last_byte) { > > iosize = folio_size(folio) - pg_offset; > > folio_zero_range(folio, pg_offset, iosize); > > - unlock_extent(tree, cur, cur + iosize - 1, NULL); > > end_folio_read(folio, true, cur, iosize); > > break; > > } > > em = __get_extent_map(inode, folio, cur, end - cur + 1, > > em_cached); > > if (IS_ERR(em)) { > > - unlock_extent(tree, cur, end, NULL); > > end_folio_read(folio, false, cur, end + 1 - cur); > > return PTR_ERR(em); > > } > > @@ -1123,7 +1045,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > > if (block_start == EXTENT_MAP_HOLE) { > > folio_zero_range(folio, pg_offset, iosize); > > > > - unlock_extent(tree, cur, cur + iosize - 1, NULL); > > end_folio_read(folio, true, cur, iosize); > > cur = cur + iosize; > > pg_offset += iosize; > > @@ -1131,7 +1052,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > > } > > /* the get_extent function already copied into the folio */ > > if (block_start == EXTENT_MAP_INLINE) { > > - unlock_extent(tree, cur, cur + iosize - 1, NULL); > > end_folio_read(folio, true, cur, iosize); > > cur = cur + iosize; > > pg_offset += iosize; > > @@ -1161,11 +1081,13 @@ int btrfs_read_folio(struct file *file, struct folio *folio) > > u64 end = start + folio_size(folio) - 1; > > struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ }; > > struct extent_map *em_cached = NULL; > > + struct extent_state *cached = NULL; > > int ret; > > > > - btrfs_lock_and_flush_ordered_range(inode, start, end, NULL); > > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached); > > > > ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL); > > + unlock_extent(&inode->io_tree, start, end, &cached); > > free_extent_map(em_cached); > > > > /* > > @@ -2350,13 +2272,15 @@ void btrfs_readahead(struct readahead_control *rac) > > u64 start = readahead_pos(rac); > > u64 end = start + readahead_length(rac) - 1; > > struct extent_map *em_cached = NULL; > > + struct extent_state *cached = NULL; > > u64 prev_em_start = (u64)-1; > > > > - btrfs_lock_and_flush_ordered_range(inode, start, end, NULL); > > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached);; > > > > while ((folio = readahead_folio(rac)) != NULL) > > btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); > > > > + unlock_extent(&inode->io_tree, start, end, &cached); > > if (em_cached) > > free_extent_map(em_cached); > > submit_one_bio(&bio_ctrl); > > I think you can combine the patch I have sent a couple of days back with > this one. It will reduce the scope to extent locking-unlocking still > further and limit it to fetching extent_map within > __get_extent_map(). I have tested the following patch on top of this > one, and it passes xfstests. If not, I can send it as a part of my next > series. Ok I'll integrate this and re-send. Thanks! Josef