linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] super1: error handling for super-block loading
@ 2016-05-12 17:38 Gioh Kim
  2016-05-12 18:16 ` Jes Sorensen
  0 siblings, 1 reply; 4+ messages in thread
From: Gioh Kim @ 2016-05-12 17:38 UTC (permalink / raw)
  To: Jes Sorensen, linux-raid; +Cc: Elmar Gerdes, Jinpu Wang

Hi,

I found a segfault of mdadm.
mdadm[59966]: segfault at 5c ip 000000000042c1e1 sp 00007ffe52745390 
error 4 in mdadm[400000+65000]

I disassembled mdadm.

   42c170:       41 55                   push   %r13
   42c172:       49 89 f8                mov    %rdi,%r8
   42c175:       41 54                   push   %r12
   42c177:       49 89 d4                mov    %rdx,%r12
   42c17a:       55                      push   %rbp
   42c17b:       53                      push   %rbx
   42c17c:       48 89 f3                mov    %rsi,%rbx
   42c17f:       48 83 ec 08             sub    $0x8,%rsp
   42c183:       f6 c3 01                test   $0x1,%bl
   42c186:       48 8b 6f 18             mov    0x18(%rdi),%rbp
   42c18a:       44 8b 6e 1c             mov    0x1c(%rsi),%r13d
   42c18e:       48 89 f7                mov    %rsi,%rdi
   42c191:       be 88 01 00 00          mov    $0x188,%esi
   42c196:       0f 85 b8 02 00 00       jne    42c454 <socket@plt+0x29724>
   42c19c:       40 f6 c7 02             test   $0x2,%dil
   42c1a0:       0f 85 c2 02 00 00       jne    42c468 <socket@plt+0x29738>
   42c1a6:       40 f6 c7 04             test   $0x4,%dil
   42c1aa:       0f 85 ce 02 00 00       jne    42c47e <socket@plt+0x2974e>
   42c1b0:       89 f1                   mov    %esi,%ecx
   42c1b2:       31 c0                   xor    %eax,%eax
   42c1b4:       c1 e9 03                shr    $0x3,%ecx
   42c1b7:       40 f6 c6 04             test   $0x4,%sil
   42c1bb:       f3 48 ab                rep stos %rax,%es:(%rdi)
   42c1be:       74 0a                   je     42c1ca <socket@plt+0x2949a>
   42c1c0:       c7 07 00 00 00 00       movl   $0x0,(%rdi)
   42c1c6:       48 83 c7 04             add    $0x4,%rdi
   42c1ca:       40 f6 c6 02             test   $0x2,%sil
   42c1ce:       74 09                   je     42c1d9 <socket@plt+0x294a9>
   42c1d0:       66 c7 07 00 00          movw   $0x0,(%rdi)
   42c1d5:       48 83 c7 02             add    $0x2,%rdi
   42c1d9:       83 e6 01                and    $0x1,%esi
   42c1dc:       74 03                   je     42c1e1 <socket@plt+0x294b1>
   42c1de:       c6 07 00                movb   $0x0,(%rdi)
   42c1e1:       8b 45 5c                mov    0x5c(%rbp),%eax

I think this is getinfo_super1() because I rebuilt mdadm with debugging 
symbol and found exactly the same assemble code.
A NULL pointer referencing is generated by reading st->sb->raid_disks 
(%rdi=st, %rbp=sb=NULL).
I found there is no error handling if Grow_addbitmap fails to load 
super-block.

I'm not sure yet why mdadm failed to load super-block of disks.
I checked the kernel log and found I/O error from disks.
Anyway mdadm needs to handle that error case.

Please review following patch.

------------------------------------------- 8< 
-------------------------------------------------------------
 From 8cacf56b2d630c7e74bad942779ff7ed5f516d26 Mon Sep 17 00:00:00 2001
From: Gioh Kim <gi-oh.kim@profitbricks.com>
Date: Thu, 12 May 2016 19:09:45 +0200
Subject: [PATCH] super1: error handling for super-block loading

Loading super-block can fail if all sub-devices are faulty
or have I/O errors.

Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
---
  Grow.c | 7 ++++++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Grow.c b/Grow.c
index f58c753..fa08522 100755
--- a/Grow.c
+++ b/Grow.c
@@ -389,7 +389,7 @@ int Grow_addbitmap(char *devname, int fd, struct 
context *c, struct shape *s)
      }
      if (strcmp(s->bitmap_file, "internal") == 0 ||
          strcmp(s->bitmap_file, "clustered") == 0) {
-        int rv;
+        int rv = 0;
          int d;
          int offset_setable = 0;
          struct mdinfo *mdi;
@@ -419,6 +419,7 @@ int Grow_addbitmap(char *devname, int fd, struct 
context *c, struct shape *s)
                  if (fd2 < 0)
                      continue;
                  if (st->ss->load_super(st, fd2, NULL)==0) {
+                    rv++;
                      if (st->ss->add_internal_bitmap(
                              st,
                              &s->bitmap_chunk, c->delay, s->write_behind,
@@ -435,6 +436,10 @@ int Grow_addbitmap(char *devname, int fd, struct 
context *c, struct shape *s)
                  close(fd2);
              }
          }
+        if (rv == 0) {
+            pr_err("failed to load super-block.\n");
+            return 1;
+        }
          if (offset_setable) {
              st->ss->getinfo_super(st, mdi, NULL);
              sysfs_init(mdi, fd, NULL);
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC] super1: error handling for super-block loading
  2016-05-12 17:38 [RFC] super1: error handling for super-block loading Gioh Kim
@ 2016-05-12 18:16 ` Jes Sorensen
  2016-05-12 19:43   ` Jes Sorensen
  0 siblings, 1 reply; 4+ messages in thread
From: Jes Sorensen @ 2016-05-12 18:16 UTC (permalink / raw)
  To: Gioh Kim; +Cc: linux-raid, Elmar Gerdes, Jinpu Wang

Gioh Kim <gi-oh.kim@profitbricks.com> writes:
> I'm not sure yet why mdadm failed to load super-block of disks.
> I checked the kernel log and found I/O error from disks.
> Anyway mdadm needs to handle that error case.
>
> Please review following patch.
>
> ------------------------------------------- 8<
> -------------------------------------------------------------
> From 8cacf56b2d630c7e74bad942779ff7ed5f516d26 Mon Sep 17 00:00:00 2001
> From: Gioh Kim <gi-oh.kim@profitbricks.com>
> Date: Thu, 12 May 2016 19:09:45 +0200
> Subject: [PATCH] super1: error handling for super-block loading
>
> Loading super-block can fail if all sub-devices are faulty
> or have I/O errors.
>
> Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> ---
>  Grow.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/Grow.c b/Grow.c
> index f58c753..fa08522 100755
> --- a/Grow.c
> +++ b/Grow.c
> @@ -389,7 +389,7 @@ int Grow_addbitmap(char *devname, int fd, struct
> context *c, struct shape *s)
>      }
>      if (strcmp(s->bitmap_file, "internal") == 0 ||
>          strcmp(s->bitmap_file, "clustered") == 0) {
> -        int rv;
> +        int rv = 0;
>          int d;
>          int offset_setable = 0;
>          struct mdinfo *mdi;
> @@ -419,6 +419,7 @@ int Grow_addbitmap(char *devname, int fd, struct
> context *c, struct shape *s)
>                  if (fd2 < 0)
>                      continue;
>                  if (st->ss->load_super(st, fd2, NULL)==0) {
> +                    rv++;
>                      if (st->ss->add_internal_bitmap(
>                              st,
>                              &s->bitmap_chunk, c->delay, s->write_behind,
> @@ -435,6 +436,10 @@ int Grow_addbitmap(char *devname, int fd, struct
> context *c, struct shape *s)
>                  close(fd2);
>              }
>          }
> +        if (rv == 0) {
> +            pr_err("failed to load super-block.\n");
> +            return 1;
> +        }
>          if (offset_setable) {
>              st->ss->getinfo_super(st, mdi, NULL);
>              sysfs_init(mdi, fd, NULL);

This is definitly not the right way to solve this problem. Error codes
are negative, and zero should _always_ mean success. Here you suddenly
introduced a new meaning to positive values of rv.

I agree handling the error case needs to be fixed, so a better way to
solve this would be to bail out when the load_super() call fails and
stop there, the same way it does if add_internal_bitmap() fails.

Ie. make it do something like this:

	rv = st->ss->load_super(st, fd2, NULL)==0) {
	if (!rv) {
		if (st->ss->add_internal_bitmap(
....
	} else {
		pr_err("failed to load super-block.\n");
		close(fd2);
		return 1;
	}

Actually looking at that code, there's a couple of things to do to clean
it up and make it more readable.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] super1: error handling for super-block loading
  2016-05-12 18:16 ` Jes Sorensen
@ 2016-05-12 19:43   ` Jes Sorensen
  2016-05-13 15:15     ` Gioh Kim
  0 siblings, 1 reply; 4+ messages in thread
From: Jes Sorensen @ 2016-05-12 19:43 UTC (permalink / raw)
  To: Gioh Kim; +Cc: linux-raid, Elmar Gerdes, Jinpu Wang

Jes Sorensen <Jes.Sorensen@redhat.com> writes:
> Gioh Kim <gi-oh.kim@profitbricks.com> writes:
> This is definitly not the right way to solve this problem. Error codes
> are negative, and zero should _always_ mean success. Here you suddenly
> introduced a new meaning to positive values of rv.
>
> I agree handling the error case needs to be fixed, so a better way to
> solve this would be to bail out when the load_super() call fails and
> stop there, the same way it does if add_internal_bitmap() fails.
>
> Ie. make it do something like this:
>
> 	rv = st->ss->load_super(st, fd2, NULL)==0) {
> 	if (!rv) {
> 		if (st->ss->add_internal_bitmap(
> ....
> 	} else {
> 		pr_err("failed to load super-block.\n");
> 		close(fd2);
> 		return 1;
> 	}
>
> Actually looking at that code, there's a couple of things to do to clean
> it up and make it more readable.

So I started looking at this, and ended up making a bunch of changes
which should both resolve the issue you encountered and also cleans up
this part of the code. I had to change add_internal_bitmap() to return 0
on success, in order to get it cleaned up. Looks like we have a pile of
inconsistencies on the 0 vs 1 as success returns ... guess we won't get
bored.

I just pushed this into git - let me know if it doesn't work for you.

Cheers,
Jes


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] super1: error handling for super-block loading
  2016-05-12 19:43   ` Jes Sorensen
@ 2016-05-13 15:15     ` Gioh Kim
  0 siblings, 0 replies; 4+ messages in thread
From: Gioh Kim @ 2016-05-13 15:15 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid, Elmar Gerdes, Jinpu Wang



On 12.05.2016 21:43, Jes Sorensen wrote:
> Jes Sorensen <Jes.Sorensen@redhat.com> writes:
>> Gioh Kim <gi-oh.kim@profitbricks.com> writes:
>> This is definitly not the right way to solve this problem. Error codes
>> are negative, and zero should _always_ mean success. Here you suddenly
>> introduced a new meaning to positive values of rv.
>>
>> I agree handling the error case needs to be fixed, so a better way to
>> solve this would be to bail out when the load_super() call fails and
>> stop there, the same way it does if add_internal_bitmap() fails.
>>
>> Ie. make it do something like this:
>>
>> 	rv = st->ss->load_super(st, fd2, NULL)==0) {
>> 	if (!rv) {
>> 		if (st->ss->add_internal_bitmap(
>> ....
>> 	} else {
>> 		pr_err("failed to load super-block.\n");
>> 		close(fd2);
>> 		return 1;
>> 	}
>>
>> Actually looking at that code, there's a couple of things to do to clean
>> it up and make it more readable.
> So I started looking at this, and ended up making a bunch of changes
> which should both resolve the issue you encountered and also cleans up
> this part of the code. I had to change add_internal_bitmap() to return 0
> on success, in order to get it cleaned up. Looks like we have a pile of
> inconsistencies on the 0 vs 1 as success returns ... guess we won't get
> bored.
>
> I just pushed this into git - let me know if it doesn't work for you.

GREATE!!

I pulled your patch and checked it works fine.
Following is how I tested.

1. I set one device as faulty.
# cat /proc/mdstat
Personalities : [raid1]
md101 : active raid1 dm-6[1] dm-1[0](F)
       16384 blocks super 1.2 [2/1] [_U]

unused devices: <none>

2. I disconnected storage.

3. recover bitmap
# mdadm --grow --bitmap=internal /dev/md101
Segmentation fault

4. new mdadm including your patch
# mdadm --grow --bitmap=internal /dev/md101
failed to load super-block.
# echo $?
1

Thank you very much.
Have a nice weekend!

-- 
Best regards,
Gioh Kim


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-05-13 15:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-12 17:38 [RFC] super1: error handling for super-block loading Gioh Kim
2016-05-12 18:16 ` Jes Sorensen
2016-05-12 19:43   ` Jes Sorensen
2016-05-13 15:15     ` Gioh Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).