From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx3.molgen.mpg.de (mx3.molgen.mpg.de [141.14.17.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7E7AE3A9618; Mon, 27 Apr 2026 14:49:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=141.14.17.11 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777301406; cv=none; b=LnL2QMzbosM1bYEm1bRzz88s4zKC5oc38Pzct1DL24vjvXjOwiw8n23r+M9GUtG+4jTc8Jcb//UgT4BP0kMg8kSoAQF/z37zVLb9GCmMF6hpCi/hDUT5HuUKK2nlVKijuQJD+ofl5Ix0Mb6Qp7LmZmPwUDn9VnyyYNrBI3KxWxo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777301406; c=relaxed/simple; bh=D7IRaKBV2YwY4TVdSPFULkEV2LI8JltA1Rcms8R5cYM=; h=Message-ID:Date:MIME-Version:Subject:To:References:Cc:From: In-Reply-To:Content-Type; b=klY2zlM2zAfz0zHXtQv+ZTjmxe7Emh2+i2tGZnRSgmT+zsOeMIdi4Lvs2rCGvKfSzM+cnwc5KslW2xbanbwkv/cLw2NUkfBmttzGKL81N1Fto8YOm+s4UU0nCiXOIzVN/4lTGjbTygNheIZoXAas9bMj6iVEmhzMdZ7Br5U0pA4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=molgen.mpg.de; spf=pass smtp.mailfrom=molgen.mpg.de; arc=none smtp.client-ip=141.14.17.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=molgen.mpg.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=molgen.mpg.de Received: from [141.14.220.42] (g42.guest.molgen.mpg.de [141.14.220.42]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: pmenzel) by mx.molgen.mpg.de (Postfix) with ESMTPSA id 80E434C2C37D45; Mon, 27 Apr 2026 16:49:28 +0200 (CEST) Message-ID: Date: Mon, 27 Apr 2026 16:49:28 +0200 Precedence: bulk X-Mailing-List: linux-raid@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock To: Abd-Alrhman Masalkhi References: <20260427103446.300378-1-abd.masalkhi@gmail.com> Content-Language: en-US Cc: song@kernel.org, yukuai@fnnas.com, shli@fb.com, neilb@suse.com, linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org From: Paul Menzel In-Reply-To: <20260427103446.300378-1-abd.masalkhi@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Dear Abd-Alrhman, Thank you for your patch. Am 27.04.26 um 12:34 schrieb Abd-Alrhman Masalkhi: > Splitting a bio while executing in the raid1 thread can lead to > recursion, as task->bio_list is NULL in this context. > > In addition, resubmitting an md_cloned_bio after splitting may lead to > a deadlock if the array is suspended before the md driver calls > percpu_ref_tryget_live(&mddev->active_io) on it's path to > pers->make_request(). > > Avoid splitting the bio in this context and require that it is either > read in full or not at all. > > This prevents recursion and avoids potential deadlocks during array > suspension. Do you have a reproducer? > Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().") > Signed-off-by: Abd-Alrhman Masalkhi > --- > I sent an email about this issue two days ago, but at the time I was not > sure whether it was a real problem or a misunderstanding on my part. > > After further analysis, it appears that this issue can occur. > > Apologies for the earlier confusion, and thank you for your time. > > Abd-Alrhman I suggest to always share the URL (lore.kernel.org), when referencing another thread. If relevant, maybe even reference your message with a Link: tag in the commit message. > --- > drivers/md/raid1.c | 33 ++++++++++++++++++++++++--------- > 1 file changed, 24 insertions(+), 9 deletions(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index cc9914bd15c1..14f6d6625811 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio, > > /* choose the first disk even if it has some bad blocks. */ > read_len = raid1_check_read_range(rdev, this_sector, &len); > - if (read_len > 0) { > + if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) { > update_read_sectors(conf, disk, this_sector, read_len); > *max_sectors = read_len; > return disk; > @@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio, > } > > if (bb_disk != -1) { > - *max_sectors = bb_read_len; > - update_read_sectors(conf, bb_disk, this_sector, bb_read_len); > + if (!*max_sectors || bb_read_len == r1_bio->sectors) { > + *max_sectors = bb_read_len; > + update_read_sectors(conf, bb_disk, this_sector, > + bb_read_len); > + } else { > + bb_disk = -1; > + } > } > > return bb_disk; > @@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) > * disks and disks with bad blocks for now. Only pay attention to key disk > * choice. > * > - * 3) If we've made it this far, now look for disks with bad blocks and choose > - * the one with most number of sectors. > + * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant > + * of bad blocks), look for disks with bad blocks and choose the one with > + * the most sectors. > * > * 4) If we are all the way at the end, we have no choice but to use a disk even > * if it is write mostly. > @@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, > /* > * If we are here it means we didn't find a perfectly good disk so > * now spend a bit more time trying to find one with the most good > - * sectors. > + * sectors. but only if we are tolerant of bad blocks. s/but/But/ > */ > - disk = choose_bb_rdev(conf, r1_bio, max_sectors); > - if (disk >= 0) > - return disk; > + if (!*max_sectors) { > + disk = choose_bb_rdev(conf, r1_bio, max_sectors); > + if (disk >= 0) > + return disk; > + } > > return choose_slow_rdev(conf, r1_bio, max_sectors); > } > @@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, > /* > * make_request() can abort the operation when read-ahead is being > * used and no empty request is available. > + * > + * If we allow splitting the bio while executing in the raid1 thread, > + * we may end up recursing (current->bio_list is NULL), and we might > + * also deadlock if we try to suspend the array, since we are > + * resubmitting an md_cloned_bio. Therefore, we must be read either … we must read … > + * all the sectors or none. > */ > + max_sectors = r1bio_existed; Excuse my ignorance, but I do not get why a bool is assigned to an int representing the maximum sector value. > rdisk = read_balance(conf, r1_bio, &max_sectors); > if (rdisk < 0) { > /* couldn't find anywhere to read from */ Kind regards, Paul