All of lore.kernel.org
 help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: Torsten Kaiser <just.for.lkml@googlemail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>,
	Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Li Shaohua <shaohua.li@intel.com>
Subject: Re: Deadlock possibly caused by too_many_isolated.
Date: Tue, 19 Oct 2010 10:11:51 +1100	[thread overview]
Message-ID: <20101019101151.57c6dd56@notabene> (raw)
In-Reply-To: <AANLkTimv_zXHdFDGa9ecgXyWmQynOKTDRPC59PZA9mvL@mail.gmail.com>

On Mon, 18 Oct 2010 12:58:17 +0200
Torsten Kaiser <just.for.lkml@googlemail.com> wrote:

> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
> > Testing shows that this patch seems to work.
> > The test load (essentially kernbench) doesn't deadlock any more, though it
> > does get bogged down thrashing in swap so it doesn't make a lot more
> > progress :-)  I guess that is to be expected.
> 
> I just noticed this thread, as your mail from today pushed it up.
> 
> In your original mail you wrote: " I recently had a customer (running
> 2.6.32) report a deadlock during very intensive IO with lots of
> processes. " and " Some threads that are blocked there, hold some IO
> lock (probably in the filesystem) and are trying to allocate memory
> inside the block device (md/raid1 to be precise) which is allocating
> with GFP_NOIO and has a mempool to fall back on."
> 
> I recently had the same problem (intense IO due to swapstorm created
> by 20 gcc processes hung my system) and after initially blaming the
> workqueue changes in 2.6.36 Tejun Heo determined that my problem was
> not the workqueues getting locked up, but that it was cause by an
> exhausted mempool:
> http://marc.info/?l=linux-kernel&m=128655737012549&w=2
> 
> Instrumenting mm/mempool.c and retrying my workload showed that
> fs_bio_set from fs/bio.c looked like the mempool to blame and the code
> in drivers/md/raid1.c to be the misuser:
> http://marc.info/?l=linux-kernel&m=128671179817823&w=2
> 
> I was even able to reproduce this hang with only using a normal RAID1
> md device as swapspace and then using dd to fill a tmpfs until
> swapping was needed:
> http://marc.info/?l=linux-raid&m=128699402805191&w=2
> 
> Looking back in the history of raid1.c and bio.c I found the following
> interesting parts:
> 
>  * the change to allocate more then one bio via bio_clone() is from
> 2005, but it looks like it was OK back then, because at that point the
> fs_bio_set was allocation 256 entries
>  * in 2007 the size of the mempool was changed from 256 to only 2
> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
> enough, lets scale it down to 2 just to be on the safe side.")
>  * only in 2009 the comment "To make this work, callers must never
> allocate more than 1 bio at the time from this pool. Callers that need
> to allocate more than 1 bio must always submit the previously allocate
> bio for IO before attempting to allocate a new one. Failure to do so
> can cause livelocks under memory pressure." was added to bio_alloc()
> that is the base from my reasoning that raid1.c is broken. (And such a
> comment was not added to bio_clone() although both calls use the same
> mempool)
> 
> So could please look someone into raid1.c to confirm or deny that
> using multiple bio_clone() (one per drive) before submitting them
> together could also cause such deadlocks?
> 
> Thank for looking
> 
> Torsten

Yes, thanks for the report.
This is a real bug exactly as you describe.

This is how I think I will fix it, though it needs a bit of review and
testing before I can be certain.
Also I need to check raid10 etc to see if they can suffer too.

If you can test it I would really appreciate it.

Thanks,
NeilBrown



diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d44a50f..8122dde 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -784,7 +784,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	int i, targets = 0, disks;
 	struct bitmap *bitmap;
 	unsigned long flags;
-	struct bio_list bl;
 	struct page **behind_pages = NULL;
 	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
@@ -892,13 +891,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	 * bios[x] to bio
 	 */
 	disks = conf->raid_disks;
-#if 0
-	{ static int first=1;
-	if (first) printk("First Write sector %llu disks %d\n",
-			  (unsigned long long)r1_bio->sector, disks);
-	first = 0;
-	}
-#endif
  retry_write:
 	blocked_rdev = NULL;
 	rcu_read_lock();
@@ -956,14 +948,15 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	    (behind_pages = alloc_behind_pages(bio)) != NULL)
 		set_bit(R1BIO_BehindIO, &r1_bio->state);
 
-	atomic_set(&r1_bio->remaining, 0);
+	atomic_set(&r1_bio->remaining, targets);
 	atomic_set(&r1_bio->behind_remaining, 0);
 
 	do_barriers = bio->bi_rw & REQ_HARDBARRIER;
 	if (do_barriers)
 		set_bit(R1BIO_Barrier, &r1_bio->state);
 
-	bio_list_init(&bl);
+	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
+				test_bit(R1BIO_BehindIO, &r1_bio->state));
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
 		if (!r1_bio->bios[i])
@@ -995,30 +988,18 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 				atomic_inc(&r1_bio->behind_remaining);
 		}
 
-		atomic_inc(&r1_bio->remaining);
-
-		bio_list_add(&bl, mbio);
+		spin_lock_irqsave(&conf->device_lock, flags);
+		bio_list_add(&conf->pending_bio_list, mbio);
+		blk_plug_device(mddev->queue);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	kfree(behind_pages); /* the behind pages are attached to the bios now */
 
-	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
-				test_bit(R1BIO_BehindIO, &r1_bio->state));
-	spin_lock_irqsave(&conf->device_lock, flags);
-	bio_list_merge(&conf->pending_bio_list, &bl);
-	bio_list_init(&bl);
-
-	blk_plug_device(mddev->queue);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
-
 	/* In case raid1d snuck into freeze_array */
 	wake_up(&conf->wait_barrier);
 
 	if (do_sync)
 		md_wakeup_thread(mddev->thread);
-#if 0
-	while ((bio = bio_list_pop(&bl)) != NULL)
-		generic_make_request(bio);
-#endif
 
 	return 0;
 }

WARNING: multiple messages have this Message-ID (diff)
From: Neil Brown <neilb@suse.de>
To: Torsten Kaiser <just.for.lkml@googlemail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>,
	Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Li Shaohua <shaohua.li@intel.com>
Subject: Re: Deadlock possibly caused by too_many_isolated.
Date: Tue, 19 Oct 2010 10:11:51 +1100	[thread overview]
Message-ID: <20101019101151.57c6dd56@notabene> (raw)
In-Reply-To: <AANLkTimv_zXHdFDGa9ecgXyWmQynOKTDRPC59PZA9mvL@mail.gmail.com>

On Mon, 18 Oct 2010 12:58:17 +0200
Torsten Kaiser <just.for.lkml@googlemail.com> wrote:

> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@suse.de> wrote:
> > Testing shows that this patch seems to work.
> > The test load (essentially kernbench) doesn't deadlock any more, though it
> > does get bogged down thrashing in swap so it doesn't make a lot more
> > progress :-)  I guess that is to be expected.
> 
> I just noticed this thread, as your mail from today pushed it up.
> 
> In your original mail you wrote: " I recently had a customer (running
> 2.6.32) report a deadlock during very intensive IO with lots of
> processes. " and " Some threads that are blocked there, hold some IO
> lock (probably in the filesystem) and are trying to allocate memory
> inside the block device (md/raid1 to be precise) which is allocating
> with GFP_NOIO and has a mempool to fall back on."
> 
> I recently had the same problem (intense IO due to swapstorm created
> by 20 gcc processes hung my system) and after initially blaming the
> workqueue changes in 2.6.36 Tejun Heo determined that my problem was
> not the workqueues getting locked up, but that it was cause by an
> exhausted mempool:
> http://marc.info/?l=linux-kernel&m=128655737012549&w=2
> 
> Instrumenting mm/mempool.c and retrying my workload showed that
> fs_bio_set from fs/bio.c looked like the mempool to blame and the code
> in drivers/md/raid1.c to be the misuser:
> http://marc.info/?l=linux-kernel&m=128671179817823&w=2
> 
> I was even able to reproduce this hang with only using a normal RAID1
> md device as swapspace and then using dd to fill a tmpfs until
> swapping was needed:
> http://marc.info/?l=linux-raid&m=128699402805191&w=2
> 
> Looking back in the history of raid1.c and bio.c I found the following
> interesting parts:
> 
>  * the change to allocate more then one bio via bio_clone() is from
> 2005, but it looks like it was OK back then, because at that point the
> fs_bio_set was allocation 256 entries
>  * in 2007 the size of the mempool was changed from 256 to only 2
> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is
> enough, lets scale it down to 2 just to be on the safe side.")
>  * only in 2009 the comment "To make this work, callers must never
> allocate more than 1 bio at the time from this pool. Callers that need
> to allocate more than 1 bio must always submit the previously allocate
> bio for IO before attempting to allocate a new one. Failure to do so
> can cause livelocks under memory pressure." was added to bio_alloc()
> that is the base from my reasoning that raid1.c is broken. (And such a
> comment was not added to bio_clone() although both calls use the same
> mempool)
> 
> So could please look someone into raid1.c to confirm or deny that
> using multiple bio_clone() (one per drive) before submitting them
> together could also cause such deadlocks?
> 
> Thank for looking
> 
> Torsten

Yes, thanks for the report.
This is a real bug exactly as you describe.

This is how I think I will fix it, though it needs a bit of review and
testing before I can be certain.
Also I need to check raid10 etc to see if they can suffer too.

If you can test it I would really appreciate it.

Thanks,
NeilBrown



diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d44a50f..8122dde 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -784,7 +784,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	int i, targets = 0, disks;
 	struct bitmap *bitmap;
 	unsigned long flags;
-	struct bio_list bl;
 	struct page **behind_pages = NULL;
 	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
@@ -892,13 +891,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	 * bios[x] to bio
 	 */
 	disks = conf->raid_disks;
-#if 0
-	{ static int first=1;
-	if (first) printk("First Write sector %llu disks %d\n",
-			  (unsigned long long)r1_bio->sector, disks);
-	first = 0;
-	}
-#endif
  retry_write:
 	blocked_rdev = NULL;
 	rcu_read_lock();
@@ -956,14 +948,15 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	    (behind_pages = alloc_behind_pages(bio)) != NULL)
 		set_bit(R1BIO_BehindIO, &r1_bio->state);
 
-	atomic_set(&r1_bio->remaining, 0);
+	atomic_set(&r1_bio->remaining, targets);
 	atomic_set(&r1_bio->behind_remaining, 0);
 
 	do_barriers = bio->bi_rw & REQ_HARDBARRIER;
 	if (do_barriers)
 		set_bit(R1BIO_Barrier, &r1_bio->state);
 
-	bio_list_init(&bl);
+	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
+				test_bit(R1BIO_BehindIO, &r1_bio->state));
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
 		if (!r1_bio->bios[i])
@@ -995,30 +988,18 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 				atomic_inc(&r1_bio->behind_remaining);
 		}
 
-		atomic_inc(&r1_bio->remaining);
-
-		bio_list_add(&bl, mbio);
+		spin_lock_irqsave(&conf->device_lock, flags);
+		bio_list_add(&conf->pending_bio_list, mbio);
+		blk_plug_device(mddev->queue);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	kfree(behind_pages); /* the behind pages are attached to the bios now */
 
-	bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
-				test_bit(R1BIO_BehindIO, &r1_bio->state));
-	spin_lock_irqsave(&conf->device_lock, flags);
-	bio_list_merge(&conf->pending_bio_list, &bl);
-	bio_list_init(&bl);
-
-	blk_plug_device(mddev->queue);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
-
 	/* In case raid1d snuck into freeze_array */
 	wake_up(&conf->wait_barrier);
 
 	if (do_sync)
 		md_wakeup_thread(mddev->thread);
-#if 0
-	while ((bio = bio_list_pop(&bl)) != NULL)
-		generic_make_request(bio);
-#endif
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-10-18 23:12 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-14 23:11 Deadlock possibly caused by too_many_isolated Neil Brown
2010-09-14 23:11 ` Neil Brown
2010-09-15  0:30 ` Rik van Riel
2010-09-15  0:30   ` Rik van Riel
2010-09-15  2:23   ` Neil Brown
2010-09-15  2:23     ` Neil Brown
2010-09-15  2:37     ` Wu Fengguang
2010-09-15  2:37       ` Wu Fengguang
2010-09-15  2:54       ` Wu Fengguang
2010-09-15  2:54         ` Wu Fengguang
2010-09-15  3:06         ` Wu Fengguang
2010-09-15  3:06           ` Wu Fengguang
2010-09-15  3:13           ` Wu Fengguang
2010-09-15  3:13             ` Wu Fengguang
2010-09-15  3:18             ` Shaohua Li
2010-09-15  3:18               ` Shaohua Li
2010-09-15  3:31               ` Wu Fengguang
2010-09-15  3:31                 ` Wu Fengguang
2010-09-15  3:17           ` Neil Brown
2010-09-15  3:17             ` Neil Brown
2010-09-15  3:47             ` Wu Fengguang
2010-09-15  3:47               ` Wu Fengguang
2010-09-15  8:28     ` Wu Fengguang
2010-09-15  8:28       ` Wu Fengguang
2010-09-15  8:44       ` Neil Brown
2010-09-15  8:44         ` Neil Brown
2010-10-18  4:14         ` Neil Brown
2010-10-18  4:14           ` Neil Brown
2010-10-18  5:04           ` KOSAKI Motohiro
2010-10-18  5:04             ` KOSAKI Motohiro
2010-10-18 10:58           ` Torsten Kaiser
2010-10-18 10:58             ` Torsten Kaiser
2010-10-18 23:11             ` Neil Brown [this message]
2010-10-18 23:11               ` Neil Brown
2010-10-19  8:43               ` Torsten Kaiser
2010-10-19  8:43                 ` Torsten Kaiser
2010-10-19 10:06                 ` Torsten Kaiser
2010-10-19 10:06                   ` Torsten Kaiser
2010-10-20  5:57                   ` Wu Fengguang
2010-10-20  5:57                     ` Wu Fengguang
2010-10-20  7:05                     ` KOSAKI Motohiro
2010-10-20  7:05                       ` KOSAKI Motohiro
2010-10-20  9:27                       ` Wu Fengguang
2010-10-20  9:27                         ` Wu Fengguang
2010-10-20 13:03                         ` Jens Axboe
2010-10-20 13:03                           ` Jens Axboe
2010-10-22  5:37                           ` Wu Fengguang
2010-10-22  5:37                             ` Wu Fengguang
2010-10-22  8:07                             ` Wu Fengguang
2010-10-22  8:07                               ` Wu Fengguang
2010-10-22  8:09                               ` Jens Axboe
2010-10-22  8:09                                 ` Jens Axboe
2010-10-24 16:52                                 ` Wu Fengguang
2010-10-24 16:52                                   ` Wu Fengguang
2010-10-25  6:40                                   ` Neil Brown
2010-10-25  6:40                                     ` Neil Brown
2010-10-25  7:26                                     ` Wu Fengguang
2010-10-25  7:26                                       ` Wu Fengguang
2010-10-20  7:25                     ` Torsten Kaiser
2010-10-20  7:25                       ` Torsten Kaiser
2010-10-20  9:01                       ` Wu Fengguang
2010-10-20  9:01                         ` Wu Fengguang
2010-10-20 10:07                         ` Torsten Kaiser
2010-10-20 10:07                           ` Torsten Kaiser
2010-10-20 14:23                       ` Minchan Kim
2010-10-20 14:23                         ` Minchan Kim
2010-10-20 15:35                         ` Torsten Kaiser
2010-10-20 15:35                           ` Torsten Kaiser
2010-10-20 23:31                           ` Minchan Kim
2010-10-20 23:31                             ` Minchan Kim
2010-10-18 16:15           ` Wu Fengguang
2010-10-18 16:15             ` Wu Fengguang
2010-10-18 21:58             ` Andrew Morton
2010-10-18 21:58               ` Andrew Morton
2010-10-18 22:31               ` Neil Brown
2010-10-18 22:31                 ` Neil Brown
2010-10-18 22:41                 ` Andrew Morton
2010-10-18 22:41                   ` Andrew Morton
2010-10-19  0:57                   ` KOSAKI Motohiro
2010-10-19  0:57                     ` KOSAKI Motohiro
2010-10-19  1:15                     ` Minchan Kim
2010-10-19  1:15                       ` Minchan Kim
2010-10-19  1:21                       ` KOSAKI Motohiro
2010-10-19  1:21                         ` KOSAKI Motohiro
2010-10-19  1:32                         ` Minchan Kim
2010-10-19  1:32                           ` Minchan Kim
2010-10-19  2:03                           ` KOSAKI Motohiro
2010-10-19  2:03                             ` KOSAKI Motohiro
2010-10-19  2:16                             ` Minchan Kim
2010-10-19  2:16                               ` Minchan Kim
2010-10-19  2:54                               ` KOSAKI Motohiro
2010-10-19  2:54                                 ` KOSAKI Motohiro
2010-10-19  2:35                       ` Wu Fengguang
2010-10-19  2:35                         ` Wu Fengguang
2010-10-19  2:52                         ` Minchan Kim
2010-10-19  2:52                           ` Minchan Kim
2010-10-19  3:05                           ` Wu Fengguang
2010-10-19  3:05                             ` Wu Fengguang
2010-10-19  3:09                             ` Minchan Kim
2010-10-19  3:09                               ` Minchan Kim
2010-10-19  3:13                               ` KOSAKI Motohiro
2010-10-19  3:13                                 ` KOSAKI Motohiro
2010-10-19  5:11                                 ` Minchan Kim
2010-10-19  5:11                                   ` Minchan Kim
2010-10-19  3:21                               ` Shaohua Li
2010-10-19  3:21                                 ` Shaohua Li
2010-10-19  7:15                                 ` Shaohua Li
2010-10-19  7:15                                   ` Shaohua Li
2010-10-19  7:34                                   ` Minchan Kim
2010-10-19  7:34                                     ` Minchan Kim
2010-10-19  2:24                   ` Wu Fengguang
2010-10-19  2:24                     ` Wu Fengguang
2010-10-19  2:37                     ` KOSAKI Motohiro
2010-10-19  2:37                       ` KOSAKI Motohiro
2010-10-19  2:37                     ` Minchan Kim
2010-10-19  2:37                       ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101019101151.57c6dd56@notabene \
    --to=neilb@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=fengguang.wu@intel.com \
    --cc=just.for.lkml@googlemail.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    --cc=shaohua.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.