3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)
@ 2013-07-21 10:26 Justin Piszcz
  2013-07-21 23:02 ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Justin Piszcz @ 2013-07-21 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-raid

Hi,

When I run repair on an MD-RAID1 sync_action, the speed slows down and it
stays like this (below) for hours.  

The system is then completely unresponsive to user input.  I have replaced a
failing SSD; however, after a check, mismatch_cnt seems to increase over
time.  When I run repair, the system freezes to user-input.  Has anyone else
run into this issue with a RAID-1 volume (2 x SSD) using 0.90 metadata?
Long ago I used to use this same configuration with two physical disks and
there was never a problem.

Even though I left a root shell open, this has no effect to break the
resync:
# echo idle > /sys/devices/virtual/block/md1/md/sync_action

Every 1.0s: cat /proc/mdstat                            Sun Jul 21 06:15:38
2013

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]
      [>....................]  resync =  0.0% (151616/233381376)
finish=36171.5min speed=107K/sec

md0 : active raid1 sdc1[0] sdb1[1]
      1048512 blocks [2/2] [UU]

unused devices: <none>

10 minutes later:

      233381376 blocks [2/2] [UU]
      [>....................]  resync =  0.0% (151616/233381376)
finish=52219.3min speed=74K/sec

Where it hangs (151616) or elsewhere, has been different each time I watched
it, it does not appear to be hanging at the same block each time.

Justin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)
  2013-07-21 10:26 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD) Justin Piszcz
@ 2013-07-21 23:02 ` NeilBrown
  2013-07-25 23:10   ` Justin Piszcz
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2013-07-21 23:02 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1684 bytes --]

On Sun, 21 Jul 2013 06:26:55 -0400 "Justin Piszcz" <jpiszcz@lucidpixels.com>
wrote:

> Hi,
> 
> When I run repair on an MD-RAID1 sync_action, the speed slows down and it
> stays like this (below) for hours.  
> 
> The system is then completely unresponsive to user input.  I have replaced a
> failing SSD; however, after a check, mismatch_cnt seems to increase over
> time.  When I run repair, the system freezes to user-input.  Has anyone else
> run into this issue with a RAID-1 volume (2 x SSD) using 0.90 metadata?
> Long ago I used to use this same configuration with two physical disks and
> there was never a problem.
> 
> Even though I left a root shell open, this has no effect to break the
> resync:
> # echo idle > /sys/devices/virtual/block/md1/md/sync_action
> 
> Every 1.0s: cat /proc/mdstat                            Sun Jul 21 06:15:38
> 2013
> 
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
>       233381376 blocks [2/2] [UU]
>       [>....................]  resync =  0.0% (151616/233381376)
> finish=36171.5min speed=107K/sec
> 
> md0 : active raid1 sdc1[0] sdb1[1]
>       1048512 blocks [2/2] [UU]
> 
> unused devices: <none>
> 
> 10 minutes later:
> 
>       233381376 blocks [2/2] [UU]
>       [>....................]  resync =  0.0% (151616/233381376)
> finish=52219.3min speed=74K/sec
> 
> Where it hangs (151616) or elsewhere, has been different each time I watched
> it, it does not appear to be hanging at the same block each time.
> 

Hi Justin,
 this is a known bug.  Fix has been accepted into mainline for 3.11-rc2.
 Hopefully it will get into 3.10.3 (too late for 3.10.2).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)
  2013-07-21 23:02 ` NeilBrown
@ 2013-07-25 23:10   ` Justin Piszcz
  2013-07-26  0:35     ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Justin Piszcz @ 2013-07-25 23:10 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: linux-kernel, linux-raid



-----Original Message-----
From: NeilBrown [mailto:neilb@suse.de] 
Sent: Sunday, July 21, 2013 7:03 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

> Hi Justin,
>  this is a known bug.  Fix has been accepted into mainline for 3.11-rc2.
>  Hopefully it will get into 3.10.3 (too late for 3.10.2).

> NeilBrown


Hi Neil,

Did the fix by chance make it into 3.10.3?

The same issue occurs with 3.10.3 for me as well:

Every 1.0s: cat /proc/mdstat                            Thu Jul 25 19:09:46
2013

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]
      [>....................]  resync =  0.0% (151488/233381376)
finish=32045.3m
in speed=121K/sec

md0 : active raid1 sdc1[0] sdb1[1]
      1048512 blocks [2/2] [UU]

unused devices: <none>



Justin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)
  2013-07-25 23:10   ` Justin Piszcz
@ 2013-07-26  0:35     ` NeilBrown
  2013-07-26  9:56       ` Justin Piszcz
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2013-07-26  0:35 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1403 bytes --]

On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" <jpiszcz@lucidpixels.com>
wrote:

> 
> 
> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de] 
> Sent: Sunday, July 21, 2013 7:03 PM
> To: Justin Piszcz
> Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org
> Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
> SSD)
> 
> > Hi Justin,
> >  this is a known bug.  Fix has been accepted into mainline for 3.11-rc2.
> >  Hopefully it will get into 3.10.3 (too late for 3.10.2).
> 
> > NeilBrown
> 
> 
> Hi Neil,
> 
> Did the fix by chance make it into 3.10.3?

No, it looks like it missed again.  I gather there was a large inflow of
patches for -stable in the 3.11-rc1 merge window and Greg has been processing
them in batches.  Hopefully in 3.10.4.

The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.

NeilBrown



> 
> The same issue occurs with 3.10.3 for me as well:
> 
> Every 1.0s: cat /proc/mdstat                            Thu Jul 25 19:09:46
> 2013
> 
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
>       233381376 blocks [2/2] [UU]
>       [>....................]  resync =  0.0% (151488/233381376)
> finish=32045.3m
> in speed=121K/sec
> 
> md0 : active raid1 sdc1[0] sdb1[1]
>       1048512 blocks [2/2] [UU]
> 
> unused devices: <none>
> 
> 
> 
> Justin.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)
  2013-07-26  0:35     ` NeilBrown
@ 2013-07-26  9:56       ` Justin Piszcz
  2013-07-29  5:56         ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Justin Piszcz @ 2013-07-26  9:56 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: linux-kernel, linux-raid



-----Original Message-----
From: NeilBrown [mailto:neilb@suse.de] 
Sent: Thursday, July 25, 2013 8:36 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" <jpiszcz@lucidpixels.com>
wrote:

> Did the fix by chance make it into 3.10.3?

No, it looks like it missed again.  I gather there was a large inflow of
patches for -stable in the 3.11-rc1 merge window and Greg has been
processing
them in batches.  Hopefully in 3.10.4.

The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.

NeilBrown

--

Method to get patch via git and patch kernel:

$ git clone
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git log |grep 30bc9b53878a9921b02e3
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
$ git show 30bc9b53878a9921b02e3b5bc4283ac1c6de102a > /tmp/a
# patch -p1 < /tmp/a
patching file drivers/md/raid1.c
Hunk #1 succeeded at 1848 (offset -1 lines).
Hunk #2 succeeded at 1886 (offset -1 lines).
Hunk #3 succeeded at 1915 (offset -1 lines).

Reboot- tested, success, thanks..!

One follow-up question:
$ cat /sys/block/md1/md/mismatch_cnt
314112
-> On a live RAID-1 (root filesystem) without swap, is it normal to have
such a high mismatch_cnt even after a repair?

First repair:
Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
314112 sectors.
Second repair:
Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
313600 sectors.

Should I be concerned?


Testing the patch:

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]
      [>....................]  check =  0.3% (838976/233381376)
finish=9.2min speed=419488K/sec

md0 : active raid1 sdc1[0] sdb1[1]
      1048512 blocks [2/2] [UU]

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]
      [===============>.....]  check = 77.5% (180889856/233381376)
finish=2.5min speed=342654K/sec

md0 : active raid1 sdc1[0] sdb1[1]
      1048512 blocks [2/2] [UU]

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      233381376 blocks [2/2] [UU]

md0 : active raid1 sdc1[0] sdb1[1]
      1048512 blocks [2/2] [UU]


Justin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)
  2013-07-26  9:56       ` Justin Piszcz
@ 2013-07-29  5:56         ` NeilBrown
  2013-07-29  7:33           ` Justin Piszcz
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2013-07-29  5:56 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2988 bytes --]

On Fri, 26 Jul 2013 05:56:51 -0400 "Justin Piszcz" <jpiszcz@lucidpixels.com>
wrote:

> 
> 
> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de] 
> Sent: Thursday, July 25, 2013 8:36 PM
> To: Justin Piszcz
> Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org
> Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
> SSD)
> 
> On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" <jpiszcz@lucidpixels.com>
> wrote:
> 
> > Did the fix by chance make it into 3.10.3?
> 
> No, it looks like it missed again.  I gather there was a large inflow of
> patches for -stable in the 3.11-rc1 merge window and Greg has been
> processing
> them in batches.  Hopefully in 3.10.4.
> 
> The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.
> 
> NeilBrown
> 
> --
> 
> Method to get patch via git and patch kernel:
> 
> $ git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> $ git log |grep 30bc9b53878a9921b02e3
> commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
> $ git show 30bc9b53878a9921b02e3b5bc4283ac1c6de102a > /tmp/a
> # patch -p1 < /tmp/a
> patching file drivers/md/raid1.c
> Hunk #1 succeeded at 1848 (offset -1 lines).
> Hunk #2 succeeded at 1886 (offset -1 lines).
> Hunk #3 succeeded at 1915 (offset -1 lines).
> 
> Reboot- tested, success, thanks..!
> 
> One follow-up question:
> $ cat /sys/block/md1/md/mismatch_cnt
> 314112
> -> On a live RAID-1 (root filesystem) without swap, is it normal to have
> such a high mismatch_cnt even after a repair?
> 
> First repair:
> Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
> 314112 sectors.
> Second repair:
> Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
> 313600 sectors.

Those two lines have exactly the same timestamp and  array name but different
mismatch counts.  That is very strange.

Did you run two consecutive 'repair's on the one array, both with the patched
kernel?  If so and the second mismatch_cnt wasn't zero (or close to
it..maybe) then something is definitely wrong.

NeilBrown


> 
> Should I be concerned?
> 
> 
> Testing the patch:
> 
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
>       233381376 blocks [2/2] [UU]
>       [>....................]  check =  0.3% (838976/233381376)
> finish=9.2min speed=419488K/sec
> 
> md0 : active raid1 sdc1[0] sdb1[1]
>       1048512 blocks [2/2] [UU]
> 
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
>       233381376 blocks [2/2] [UU]
>       [===============>.....]  check = 77.5% (180889856/233381376)
> finish=2.5min speed=342654K/sec
> 
> md0 : active raid1 sdc1[0] sdb1[1]
>       1048512 blocks [2/2] [UU]
> 
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
>       233381376 blocks [2/2] [UU]
> 
> md0 : active raid1 sdc1[0] sdb1[1]
>       1048512 blocks [2/2] [UU]
> 
> 
> Justin.
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)
  2013-07-29  5:56         ` NeilBrown
@ 2013-07-29  7:33           ` Justin Piszcz
  0 siblings, 0 replies; 7+ messages in thread
From: Justin Piszcz @ 2013-07-29  7:33 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: linux-kernel, linux-raid



-----Original Message-----
From: NeilBrown [mailto:neilb@suse.de] 
Sent: Monday, July 29, 2013 1:57 AM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

On Fri, 26 Jul 2013 05:56:51 -0400 "Justin Piszcz" <jpiszcz@lucidpixels.com>
wrote:

[..]

Further testing shows all is ok now:


Sun Nov 25 02:12:03 EST 2012: Parity check(s) running, sleeping 60
seconds...
Sun Nov 25 02:13:03 EST 2012: Parity check(s) running, sleeping 60
seconds...
Sun Nov 25 02:14:03 EST 2012: cat /sys/block/md0/md/mismatch_cnt
Sun Nov 25 02:14:03 EST 2012: 0
Sun Nov 25 02:14:03 EST 2012: cat /sys/block/md1/md/mismatch_cnt
Sun Nov 25 02:14:03 EST 2012: 0
Sun Nov 25 02:14:03 EST 2012: The meta-device /dev/md0 has no mismatched
sectors.
Sun Nov 25 02:14:04 EST 2012: The meta-device /dev/md1 has no mismatched
sectors.
Sun Nov 25 02:14:05 EST 2012: All devices are clean...
Sun Nov 25 02:14:05 EST 2012: cat /sys/block/md0/md/mismatch_cnt
Sun Nov 25 02:14:05 EST 2012: 0
Sun Nov 25 02:14:05 EST 2012: cat /sys/block/md1/md/mismatch_cnt
Sun Nov 25 02:14:05 EST 2012: 0

Thanks for your help.

Justin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-07-29  7:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-21 10:26 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD) Justin Piszcz
2013-07-21 23:02 ` NeilBrown
2013-07-25 23:10   ` Justin Piszcz
2013-07-26  0:35     ` NeilBrown
2013-07-26  9:56       ` Justin Piszcz
2013-07-29  5:56         ` NeilBrown
2013-07-29  7:33           ` Justin Piszcz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).