All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Coly Li <colyli@suse.de>,
	NeilBrown <neilb@suse.com>,
	Jack Wang <jinpu.wang@profitbricks.com>, Shaohua Li <shli@fb.com>
Subject: [PATCH 4.9 14/24] md/raid1/10: fix potential deadlock
Date: Fri, 24 Mar 2017 18:58:47 +0100	[thread overview]
Message-ID: <20170324151226.133441388@linuxfoundation.org> (raw)
In-Reply-To: <20170324151225.378075203@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Shaohua Li <shli@fb.com>

commit 61eb2b43b99ebdc9bc6bc83d9792257b243e7cb3 upstream.

Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:

1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer

If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.

The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:

    if (need to split) {
        split = bio_split(bio, ...)
        bio_chain(...)
        make_request_fn(split);
        generic_make_request(bio);
   } else
        make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices.  These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first.  Then it will process the remainder
of the original bio once the first section has been fully processed.
"

Note, this only happens in read path. In write path, the bio is flushed to
underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
It's queued in current->bio_list.

Cc: Coly Li <colyli@suse.de>
Suggested-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/md/raid10.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1470,7 +1470,25 @@ static void raid10_make_request(struct m
 			split = bio;
 		}
 
+		/*
+		 * If a bio is splitted, the first part of bio will pass
+		 * barrier but the bio is queued in current->bio_list (see
+		 * generic_make_request). If there is a raise_barrier() called
+		 * here, the second part of bio can't pass barrier. But since
+		 * the first part bio isn't dispatched to underlaying disks
+		 * yet, the barrier is never released, hence raise_barrier will
+		 * alays wait. We have a deadlock.
+		 * Note, this only happens in read path. For write path, the
+		 * first part of bio is dispatched in a schedule() call
+		 * (because of blk plug) or offloaded to raid10d.
+		 * Quitting from the function immediately can change the bio
+		 * order queued in bio_list and avoid the deadlock.
+		 */
 		__make_request(mddev, split);
+		if (split != bio && bio_data_dir(bio) == READ) {
+			generic_make_request(bio);
+			break;
+		}
 	} while (split != bio);
 
 	/* In case raid10d snuck in to freeze_array */

  parent reply	other threads:[~2017-03-24 18:22 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-24 17:58 [PATCH 4.9 00/24] 4.9.18-stable review Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 01/24] drm/vc4: Fix termination of the initial scan for branch targets Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 02/24] drm/vc4: Use runtime autosuspend to avoid thrashing V3D power state Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 03/24] give up on gcc ilog2() constant optimizations Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 04/24] qla2xxx: Fix memory leak for abts processing Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 05/24] qla2xxx: Fix request queue corruption Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 06/24] parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 07/24] parisc: Fix system shutdown halt Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 08/24] perf/core: Fix use-after-free in perf_release() Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 09/24] perf/core: Fix event inheritance on fork() Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 10/24] xprtrdma: Squelch kbuild sparse complaint Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 11/24] NFS prevent double free in async nfs4_exchange_id Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 12/24] cpufreq: Fix and clean up show_cpuinfo_cur_freq() Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 13/24] powerpc/boot: Fix zImage TOC alignment Greg Kroah-Hartman
2017-03-24 17:58 ` Greg Kroah-Hartman [this message]
2017-03-24 17:58 ` [PATCH 4.9 15/24] target/pscsi: Fix TYPE_TAPE + TYPE_MEDIMUM_CHANGER export Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 16/24] scsi: lpfc: Add shutdown method for kexec Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 17/24] scsi: libiscsi: add lock around task lists to fix list corruption regression Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 18/24] target: Fix VERIFY_16 handling in sbc_parse_cdb Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 19/24] isdn/gigaset: fix NULL-deref at probe Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 20/24] gfs2: Avoid alignment hole in struct lm_lockname Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 21/24] percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 22/24] cgroup/pids: remove spurious suspicious RCU usage warning Greg Kroah-Hartman
2017-03-24 17:58 ` [PATCH 4.9 24/24] ext4: fix fencepost in s_first_meta_bg validation Greg Kroah-Hartman
2017-03-25  0:01 ` [PATCH 4.9 00/24] 4.9.18-stable review Shuah Khan
     [not found] ` <58d5a33b.d426190a.e050f.3547@mx.google.com>
     [not found]   ` <m260iydy8p.fsf@baylibre.com>
2017-03-25  1:41     ` Javier Martinez Canillas
2017-03-27 16:55       ` Kevin Hilman
2017-03-27 17:25         ` Javier Martinez Canillas
2017-03-25  4:18 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170324151226.133441388@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=colyli@suse.de \
    --cc=jinpu.wang@profitbricks.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=shli@fb.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.