2.6.39: raid1 check blocks jbd on other md more than 120 seconds

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.39: raid1 check blocks jbd on other md more than 120 seconds
@ 2011-06-02  9:36 Frank van Maarseveen
  2011-06-02  9:46 ` Mathias Burén
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Frank van Maarseveen @ 2011-06-02  9:36 UTC (permalink / raw)
  To: linux-raid

The system runs FC14 with an (almost) stock 2.6.39 kernel, configured to
panic if it seems to hang. That's exactly what started to happen without
anything being logged in the normal way except over netconsole.

/proc/mdstat:
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
md3 : active raid1 sda3[0] sdb3[1]
      1885338488 blocks super 1.2 [2/2] [UU]
      
md1 : active raid1 sda1[0] sdb1[1]
      33555384 blocks super 1.2 [2/2] [UU]

kernel messages:
	(/etc/cron.weekly/99-raid-check kicks in)
Jun  2 04:04:00 janus md: data-check of RAID array md3
Jun  2 04:04:00 janus md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
Jun  2 04:04:00 janus md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Jun  2 04:04:00 janus md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jun  2 04:04:00 janus md: using 128k window, over a total of 1885338488 blocks.
Jun  2 04:55:54 janus INFO: task jbd2/md1-8:1188 blocked for more than 120 seconds.
Jun  2 04:55:54 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  2 04:55:54 janus jbd2/md1-8     D
Jun  2 04:55:54  f5739e68
Jun  2 04:55:54 janus    0  1188      2 0x00000000
Jun  2 04:55:54  f5739e78
Jun  2 04:55:54  00000046
Jun  2 04:55:54  00000002
Jun  2 04:55:54  f5739e68
Jun  2 04:55:54  00000000
Jun  2 04:55:54  01000000
Jun  2 04:55:54  f5739e08
Jun  2 04:55:54  f6124f00
Jun  2 04:55:54 janus
Jun  2 04:55:54  c1999c00
Jun  2 04:55:54  00000001
Jun  2 04:55:54  c1999680
Jun  2 04:55:54  c1999680
Jun  2 04:55:54  00000003
Jun  2 04:55:54  f6126940
Jun  2 04:55:54  f6126ac0
Jun  2 04:55:54  f6786680
Jun  2 04:55:54 janus
Jun  2 04:55:54  000045d1
Jun  2 04:55:54  00000000
Jun  2 04:55:54  00000000
Jun  2 04:55:54  57087b88
Jun  2 04:55:54  000045d1
Jun  2 04:55:54  000045d1
Jun  2 04:55:54  f6686c00
Jun  2 04:55:54  f6686c10
Jun  2 04:55:54 janus
Jun  2 04:55:54 Call Trace:
Jun  2 04:55:54  [<c103822b>] ? get_parent_ip+0xb/0x40
Jun  2 04:55:54  [<c11698a2>] jbd2_journal_commit_transaction+0x182/0x1500
Jun  2 04:55:54  [<c100249f>] ? __switch_to+0xcf/0x1c0
Jun  2 04:55:54  [<c103822b>] ? get_parent_ip+0xb/0x40
Jun  2 04:55:54  [<c162f05c>] ? schedule+0x2ac/0x730
Jun  2 04:55:54  [<c1056c90>] ? wake_up_bit+0x80/0x80
	(reboots)

-- 
Frank

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.39: raid1 check blocks jbd on other md more than 120 seconds
  2011-06-02  9:36 2.6.39: raid1 check blocks jbd on other md more than 120 seconds Frank van Maarseveen
@ 2011-06-02  9:46 ` Mathias Burén
  2011-06-03  7:38   ` Frank van Maarseveen
  2011-06-03 12:08 ` Thomas Harold
  2011-06-07  8:24 ` 2.6.39[.1]: raid1 check blocks jbd on other md more than 120 seconds -- workaround Frank van Maarseveen
  2 siblings, 1 reply; 6+ messages in thread
From: Mathias Burén @ 2011-06-02  9:46 UTC (permalink / raw)
  To: Frank van Maarseveen; +Cc: linux-raid

On 2 June 2011 11:36, Frank van Maarseveen <frankvm@frankvm.com> wrote:
> The system runs FC14 with an (almost) stock 2.6.39 kernel, configured to
> panic if it seems to hang. That's exactly what started to happen without
> anything being logged in the normal way except over netconsole.
>
> /proc/mdstat:
> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> md3 : active raid1 sda3[0] sdb3[1]
>      1885338488 blocks super 1.2 [2/2] [UU]
>
> md1 : active raid1 sda1[0] sdb1[1]
>      33555384 blocks super 1.2 [2/2] [UU]
>
> kernel messages:
>        (/etc/cron.weekly/99-raid-check kicks in)
> Jun  2 04:04:00 janus md: data-check of RAID array md3
> Jun  2 04:04:00 janus md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
> Jun  2 04:04:00 janus md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Jun  2 04:04:00 janus md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> Jun  2 04:04:00 janus md: using 128k window, over a total of 1885338488 blocks.
> Jun  2 04:55:54 janus INFO: task jbd2/md1-8:1188 blocked for more than 120 seconds.
> Jun  2 04:55:54 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun  2 04:55:54 janus jbd2/md1-8     D
> Jun  2 04:55:54  f5739e68
> Jun  2 04:55:54 janus    0  1188      2 0x00000000
> Jun  2 04:55:54  f5739e78
> Jun  2 04:55:54  00000046
> Jun  2 04:55:54  00000002
> Jun  2 04:55:54  f5739e68
> Jun  2 04:55:54  00000000
> Jun  2 04:55:54  01000000
> Jun  2 04:55:54  f5739e08
> Jun  2 04:55:54  f6124f00
> Jun  2 04:55:54 janus
> Jun  2 04:55:54  c1999c00
> Jun  2 04:55:54  00000001
> Jun  2 04:55:54  c1999680
> Jun  2 04:55:54  c1999680
> Jun  2 04:55:54  00000003
> Jun  2 04:55:54  f6126940
> Jun  2 04:55:54  f6126ac0
> Jun  2 04:55:54  f6786680
> Jun  2 04:55:54 janus
> Jun  2 04:55:54  000045d1
> Jun  2 04:55:54  00000000
> Jun  2 04:55:54  00000000
> Jun  2 04:55:54  57087b88
> Jun  2 04:55:54  000045d1
> Jun  2 04:55:54  000045d1
> Jun  2 04:55:54  f6686c00
> Jun  2 04:55:54  f6686c10
> Jun  2 04:55:54 janus
> Jun  2 04:55:54 Call Trace:
> Jun  2 04:55:54  [<c103822b>] ? get_parent_ip+0xb/0x40
> Jun  2 04:55:54  [<c11698a2>] jbd2_journal_commit_transaction+0x182/0x1500
> Jun  2 04:55:54  [<c100249f>] ? __switch_to+0xcf/0x1c0
> Jun  2 04:55:54  [<c103822b>] ? get_parent_ip+0xb/0x40
> Jun  2 04:55:54  [<c162f05c>] ? schedule+0x2ac/0x730
> Jun  2 04:55:54  [<c1056c90>] ? wake_up_bit+0x80/0x80
>        (reboots)
>
> --
> Frank
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Same behavior if you lower this?

Jun  2 04:04:00 janus md: using maximum available idle IO bandwidth
(but not more than 200000 KB/sec) for data-check.

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.39: raid1 check blocks jbd on other md more than 120 seconds
  2011-06-02  9:46 ` Mathias Burén
@ 2011-06-03  7:38   ` Frank van Maarseveen
  0 siblings, 0 replies; 6+ messages in thread
From: Frank van Maarseveen @ 2011-06-03  7:38 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid

On Thu, Jun 02, 2011 at 11:46:38AM +0200, Mathias Burén wrote:
> On 2 June 2011 11:36, Frank van Maarseveen <frankvm@frankvm.com> wrote:
> > The system runs FC14 with an (almost) stock 2.6.39 kernel, configured to
> > panic if it seems to hang. That's exactly what started to happen without
> > anything being logged in the normal way except over netconsole.
> >
> > /proc/mdstat:
> > Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> > md3 : active raid1 sda3[0] sdb3[1]
> >      1885338488 blocks super 1.2 [2/2] [UU]
> >
> > md1 : active raid1 sda1[0] sdb1[1]
> >      33555384 blocks super 1.2 [2/2] [UU]
> >
> > kernel messages:
> >        (/etc/cron.weekly/99-raid-check kicks in)
> > Jun  2 04:04:00 janus md: data-check of RAID array md3
> > Jun  2 04:04:00 janus md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
> > Jun  2 04:04:00 janus md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> > Jun  2 04:04:00 janus md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> > Jun  2 04:04:00 janus md: using 128k window, over a total of 1885338488 blocks.
> > Jun  2 04:55:54 janus INFO: task jbd2/md1-8:1188 blocked for more than 120 seconds.
> [...]
> 
> Same behavior if you lower this?
> 
> Jun  2 04:04:00 janus md: using maximum available idle IO bandwidth
> (but not more than 200000 KB/sec) for data-check.

Practical bandwidth is usually slightly more than 100MB/s at start
to approximately 60MB/s at the end of the disk. I tried setting
sync_speed_max at 70000kB/s. The problem seems to correlate with the
max. practical bandwidth because at the end of the data-check there were
a couple of hung task messages again, referring to postfix- and other
daemons this time. Timeline:

Jun  2 11:52:30 janus kernel: md: data-check of RAID array md3
Jun  2 11:52:30 janus kernel: md: using maximum available idle IO bandwidth (but not more than 70000 KB/sec) for data-check.
Jun  2 18:48:44 hung task
Jun  2 18:48:44 hung task
Jun  2 18:50:44 hung task
Jun  2 18:50:45 hung task
Jun  2 19:28:45 hung task
Jun  2 19:28:45 hung task
Jun  2 19:34:45 hung task
Jun  2 19:34:45 hung task
Jun  2 19:34:45 hung task
Jun  2 19:34:45 hung task
Jun  2 19:53:29 janus kernel: md: md3: data-check done.

Kernel has been booted with hung_task_panic=0.

-- 
Frank
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.39: raid1 check blocks jbd on other md more than 120 seconds
  2011-06-02  9:36 2.6.39: raid1 check blocks jbd on other md more than 120 seconds Frank van Maarseveen
  2011-06-02  9:46 ` Mathias Burén
@ 2011-06-03 12:08 ` Thomas Harold
  2011-06-03 12:36   ` Frank van Maarseveen
  2011-06-07  8:24 ` 2.6.39[.1]: raid1 check blocks jbd on other md more than 120 seconds -- workaround Frank van Maarseveen
  2 siblings, 1 reply; 6+ messages in thread
From: Thomas Harold @ 2011-06-03 12:08 UTC (permalink / raw)
  To: Frank van Maarseveen, linux-raid

On 6/2/2011 5:36 AM, Frank van Maarseveen wrote:
> The system runs FC14 with an (almost) stock 2.6.39 kernel, configured to
> panic if it seems to hang. That's exactly what started to happen without
> anything being logged in the normal way except over netconsole.
>
> /proc/mdstat:
> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> md3 : active raid1 sda3[0] sdb3[1]
>        1885338488 blocks super 1.2 [2/2] [UU]
>
> md1 : active raid1 sda1[0] sdb1[1]
>        33555384 blocks super 1.2 [2/2] [UU]
>
> kernel messages:
> 	(/etc/cron.weekly/99-raid-check kicks in)
> Jun  2 04:04:00 janus md: data-check of RAID array md3
> Jun  2 04:04:00 janus md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
> Jun  2 04:04:00 janus md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Jun  2 04:04:00 janus md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> Jun  2 04:04:00 janus md: using 128k window, over a total of 1885338488 blocks.
> Jun  2 04:55:54 janus INFO: task jbd2/md1-8:1188 blocked for more than 120 seconds.
> Jun  2 04:55:54 "echo 0>  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun  2 04:55:54 janus jbd2/md1-8     D

That's a bug that you'll see in CentOS/RHEL in cases where there are 
multiple arrays to be checked, that use the same set of disks.  I first 
saw it in CentOS 5.5 (or maybe 5.6).

https://bugzilla.redhat.com/show_bug.cgi?id=573106

It's an annoying message, but the weekly raid sync runs fine.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.39: raid1 check blocks jbd on other md more than 120 seconds
  2011-06-03 12:08 ` Thomas Harold
@ 2011-06-03 12:36   ` Frank van Maarseveen
  0 siblings, 0 replies; 6+ messages in thread
From: Frank van Maarseveen @ 2011-06-03 12:36 UTC (permalink / raw)
  To: Thomas Harold; +Cc: linux-raid

On Fri, Jun 03, 2011 at 08:08:01AM -0400, Thomas Harold wrote:
> On 6/2/2011 5:36 AM, Frank van Maarseveen wrote:
> >The system runs FC14 with an (almost) stock 2.6.39 kernel, configured to
> >panic if it seems to hang. That's exactly what started to happen without
> >anything being logged in the normal way except over netconsole.
> >
> >/proc/mdstat:
> >Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> >md3 : active raid1 sda3[0] sdb3[1]
> >       1885338488 blocks super 1.2 [2/2] [UU]
> >
> >md1 : active raid1 sda1[0] sdb1[1]
> >       33555384 blocks super 1.2 [2/2] [UU]
> >
> >kernel messages:
> >	(/etc/cron.weekly/99-raid-check kicks in)
> >Jun  2 04:04:00 janus md: data-check of RAID array md3
> >Jun  2 04:04:00 janus md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
> >Jun  2 04:04:00 janus md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> >Jun  2 04:04:00 janus md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> >Jun  2 04:04:00 janus md: using 128k window, over a total of 1885338488 blocks.
> >Jun  2 04:55:54 janus INFO: task jbd2/md1-8:1188 blocked for more than 120 seconds.
> >Jun  2 04:55:54 "echo 0>  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >Jun  2 04:55:54 janus jbd2/md1-8     D
> 
> That's a bug that you'll see in CentOS/RHEL in cases where there are
> multiple arrays to be checked, that use the same set of disks.  I
> first saw it in CentOS 5.5 (or maybe 5.6).
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=573106
> 
> It's an annoying message, but the weekly raid sync runs fine.

According to the bugzilla report it was the resync itself which got stuck,
unlike what I am seeing where any random program may get stuck. Depending
on kernel configuration it may trigger a kernel panic. Last time:

Jun  2 18:48:44 janus kernel: INFO: task master:2705 blocked for more than 120 seconds.
Jun  2 18:48:44 janus kernel: INFO: task pickup:19276 blocked for more than 120 seconds.
Jun  2 18:50:44 janus kernel: INFO: task jbd2/md1-8:1187 blocked for more than 120 seconds.
Jun  2 18:50:45 janus kernel: INFO: task python:1890 blocked for more than 120 seconds.
Jun  2 19:28:45 janus kernel: INFO: task master:2705 blocked for more than 120 seconds.
Jun  2 19:28:45 janus kernel: INFO: task pickup:20589 blocked for more than 120 seconds.
Jun  2 19:34:45 janus kernel: INFO: task jbd2/md1-8:1187 blocked for more than 120 seconds.
Jun  2 19:34:45 janus kernel: INFO: task master:2705 blocked for more than 120 seconds.
Jun  2 19:34:45 janus kernel: INFO: task qmgr:2718 blocked for more than 120 seconds.
Jun  2 19:34:45 janus kernel: INFO: task pickup:20589 blocked for more than 120 seconds.

-- 
Frank

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 2.6.39[.1]: raid1 check blocks jbd on other md more than 120 seconds -- workaround
  2011-06-02  9:36 2.6.39: raid1 check blocks jbd on other md more than 120 seconds Frank van Maarseveen
  2011-06-02  9:46 ` Mathias Burén
  2011-06-03 12:08 ` Thomas Harold
@ 2011-06-07  8:24 ` Frank van Maarseveen
  2 siblings, 0 replies; 6+ messages in thread
From: Frank van Maarseveen @ 2011-06-07  8:24 UTC (permalink / raw)
  To: linux-raid

I did some more testing and apparently the issue can be avoided by
choosing the deadline scheduler instead of cfq. Either configure the
kernel with

	CONFIG_DEFAULT_IOSCHED="deadline"

or select the deadline I/O scheduler per device. In my case:

	echo deadline >/sys/block/sda/queue/scheduler
	echo deadline >/sys/block/sdb/queue/scheduler


-- 
Frank

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-06-07  8:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-02  9:36 2.6.39: raid1 check blocks jbd on other md more than 120 seconds Frank van Maarseveen
2011-06-02  9:46 ` Mathias Burén
2011-06-03  7:38   ` Frank van Maarseveen
2011-06-03 12:08 ` Thomas Harold
2011-06-03 12:36   ` Frank van Maarseveen
2011-06-07  8:24 ` 2.6.39[.1]: raid1 check blocks jbd on other md more than 120 seconds -- workaround Frank van Maarseveen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).