linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shaohua Li <shli@kernel.org>
To: NeilBrown <neilb@suse.com>
Cc: Coly Li <colyli@suse.de>, NeilBrown <neilb@suse.de>,
	linux-raid@vger.kernel.org, Shaohua Li <shli@fb.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Guoqing Jiang <gqjiang@suse.com>
Subject: Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
Date: Sun, 19 Feb 2017 23:04:30 -0800	[thread overview]
Message-ID: <20170220070430.4mca7clpaw7kpj4j@kernel.org> (raw)
In-Reply-To: <87k28lshg5.fsf@notabene.neil.brown.name>

On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> On Mon, Feb 20 2017, NeilBrown wrote:
> 
> > On Fri, Feb 17 2017, Coly Li wrote:
> >
> >> On 2017/2/16 下午3:04, NeilBrown wrote:
> >>> I know you are going to change this as Shaohua wantsthe spitting to
> >>> happen in a separate function, which I agree with, but there is 
> >>> something else wrong here. Calling bio_split/bio_chain repeatedly
> >>> in a loop is dangerous. It is OK for simple devices, but when one
> >>> request can wait for another request to the same device it can
> >>> deadlock. This can happen with raid1.  If a resync request calls
> >>> raise_barrier() between one request and the next, then the next has
> >>> to wait for the resync request, which has to wait for the first
> >>> request. As the first request will be stuck in the queue in 
> >>> generic_make_request(), you get a deadlock.
> >>
> >> For md raid1, queue in generic_make_request(), can I understand it as
> >> bio_list_on_stack in this function? And queue in underlying device,
> >> can I understand it as the data structures like plug->pending and
> >> conf->pending_bio_list ?
> >
> > Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
> > is the only queue I am talking about.  I'm not referring to
> > plug->pending or conf->pending_bio_list at all.
> >
> >>
> >> I still don't get the point of deadlock, let me try to explain why I
> >> don't see the possible deadlock. If a bio is split, and the first part
> >> is processed by make_request_fn(), and then a resync comes and it will
> >> raise a barrier, there are 3 possible conditions,
> >> - the resync I/O tries to raise barrier on same bucket of the first
> >> regular bio. Then the resync task has to wait to the first bio drops
> >> its conf->nr_pending[idx]
> >
> > Not quite.
> > First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
> > to be zero.  We can assume this happens immediately.
> > Then the resync_task will increment ->barrier[idx].
> > Only then will it wait for the first bio to drop ->nr_pending[idx].
> > The processing of that first bio will have submitted bios to the
> > underlying device, and they will be in the bio_list_on_stack queue, and
> > will not be processed until raid1_make_request() completes.
> >
> > The loop in raid1_make_request() will then call make_request_fn() which
> > will call wait_barrier(), which will wait for ->barrier[idx] to be
> > zero.
> 
> Thinking more carefully about this.. the 'idx' that the second bio will
> wait for will normally be different, so there won't be a deadlock after
> all.
> 
> However it is possible for hash_long() to produce the same idx for two
> consecutive barrier_units so there is still the possibility of a
> deadlock, though it isn't as likely as I thought at first.

Wrapped the function pointer issue Neil pointed out into Coly's original patch.
Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
please check.

Thanks,
Shaohua

From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001
Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Sun, 19 Feb 2017 22:18:32 -0800
Subject: [PATCH] md/raid1/10: fix potential deadlock

Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:

1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer

If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.

The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:

    if (need to split) {
        split = bio_split(bio, ...)
        bio_chain(...)
        make_request_fn(split);
        generic_make_request(bio);
   } else
        make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices.  These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first.  Then it will process the remainder
of the original bio once the first section has been fully processed.
"

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c  | 28 ++++++++++++++--------------
 drivers/md/raid10.c | 41 ++++++++++++++++++++---------------------
 2 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 676f72d..e55d865 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	sector_t sectors;
 
 	/* if bio exceeds barrier unit boundary, split it */
-	do {
-		sectors = align_to_barrier_unit_end(
-				bio->bi_iter.bi_sector, bio_sectors(bio));
-		if (sectors < bio_sectors(bio)) {
-			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
+	sectors = align_to_barrier_unit_end(
+			bio->bi_iter.bi_sector, bio_sectors(bio));
+	if (sectors < bio_sectors(bio)) {
+		split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+	} else {
+		split = bio;
+	}
 
-		if (bio_data_dir(split) == READ)
-			raid1_read_request(mddev, split);
-		else
-			raid1_write_request(mddev, split);
-	} while (split != bio);
+	if (bio_data_dir(split) == READ)
+		raid1_read_request(mddev, split);
+	else
+		raid1_write_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a1f8e98..b495049 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
-	do {
-
-		/*
-		 * If this request crosses a chunk boundary, we need to split
-		 * it.
-		 */
-		if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
-			     bio_sectors(bio) > chunk_sects
-			     && (conf->geo.near_copies < conf->geo.raid_disks
-				 || conf->prev.near_copies <
-				 conf->prev.raid_disks))) {
-			split = bio_split(bio, chunk_sects -
-					  (bio->bi_iter.bi_sector &
-					   (chunk_sects - 1)),
-					  GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
+	/*
+	 * If this request crosses a chunk boundary, we need to split
+	 * it.
+	 */
+	if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
+		     bio_sectors(bio) > chunk_sects
+		     && (conf->geo.near_copies < conf->geo.raid_disks
+			 || conf->prev.near_copies <
+			 conf->prev.raid_disks))) {
+		split = bio_split(bio, chunk_sects -
+				  (bio->bi_iter.bi_sector &
+				   (chunk_sects - 1)),
+				  GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+	} else {
+		split = bio;
+	}
 
-		__make_request(mddev, split);
-	} while (split != bio);
+	__make_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);
 
 	/* In case raid10d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
-- 
2.9.3


  reply	other threads:[~2017-02-20  7:04 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
2017-02-15 17:15   ` Coly Li
2017-02-16  2:25   ` Shaohua Li
2017-02-17 18:42     ` Coly Li
2017-02-16  7:04   ` NeilBrown
2017-02-17  7:56     ` Coly Li
2017-02-17 18:35       ` Coly Li
2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2017-02-16 17:05   ` Coly Li
2017-02-17 12:40     ` Coly Li
2017-02-16  7:04 ` NeilBrown
2017-02-17  6:56   ` Coly Li
2017-02-19 23:50     ` NeilBrown
2017-02-20  2:51       ` NeilBrown
2017-02-20  7:04         ` Shaohua Li [this message]
2017-02-20  8:07           ` Coly Li
2017-02-20  8:30             ` Coly Li
2017-02-20 18:14             ` Wols Lists
2017-02-21 11:30               ` Coly Li
2017-02-21 19:20                 ` Wols Lists
2017-02-21 20:16                   ` Coly Li
2017-02-21  0:29             ` NeilBrown
2017-02-21  9:45               ` Coly Li
2017-02-21 17:45                 ` Shaohua Li
2017-02-21 20:09                   ` Coly Li
2017-02-23  5:54                     ` Coly Li
2017-02-23 17:34                       ` Shaohua Li
2017-02-23 19:31                         ` Coly Li
2017-02-23 19:58                           ` Shaohua Li
2017-02-24 17:02                             ` Coly Li
2017-02-24 10:19                           ` 王金浦
2017-02-28 19:42                             ` Shaohua Li
2017-03-01 17:01                               ` 王金浦
2017-02-23 23:14                       ` NeilBrown
2017-02-24 17:06                         ` Coly Li
2017-02-24 17:17                           ` Shaohua Li
2017-02-24 18:57                             ` Coly Li
2017-02-24 19:02                               ` Shaohua Li
2017-02-24 19:19                                 ` Coly Li
2017-02-17 19:41   ` Shaohua Li
2017-02-18  2:40     ` Coly Li
2017-02-19 23:42     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170220070430.4mca7clpaw7kpj4j@kernel.org \
    --to=shli@kernel.org \
    --cc=colyli@suse.de \
    --cc=gqjiang@suse.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=neilb@suse.de \
    --cc=shli@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).