All of lore.kernel.org
 help / color / mirror / Atom feed
* grub-probe crashes on a linux softraid 1 with one disk beginning to sync
@ 2008-07-28 23:05 Felix Zielcke
  2008-07-29 20:20 ` [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid Felix Zielcke
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Zielcke @ 2008-07-28 23:05 UTC (permalink / raw)
  To: grub-devel

grub-probe --device-map=/boot/grub/device.map --target=partmap --device /dev/md0
showed pc
and then segfaulted

/dev/md0 is raid1 with 2 disks but the second was just added to it, so
it's syncing
I try to trace this down and maybe provide a patch but probable someone
else here is better and faster on this then me :)
Even as .tar.bz2 the 2 files are too big for the list so you'll find them at [0]
in case you need.


#0  0x00000000004015f9 in probe_partmap (disk=0x0)
at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87
87	  if (disk->partition == NULL)
(gdb) bt full
#0  0x00000000004015f9 in probe_partmap (disk=0x0)
at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87
	name = 0x0
	underscore = 0x7f7916e9ea86 "\203=\233�-"
#1  0x00000000004018ac in probe (path=0x0, device_name=0x7fff1f399a42
"/dev/md0") at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:182
	list = (grub_disk_memberlist_t) 0x1f341d0
	tmp = (grub_disk_memberlist_t) 0x1f341d0
	drive_name = 0x1f34080 "md0"
	grub_path = 0x0
	filebuf_via_grub = 0x0
	filebuf_via_sys = 0x0
	abstraction_type = 2
	dev = (grub_device_t) 0x1f34190
	fs = (grub_fs_t) 0x0
#2  0x0000000000401d98 in main (argc=5, argv=0x7fff1f397888)
at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:370
	dev_map = 0x1f34010 "/boot/grub/device.map"
	argument = 0x7fff1f399a42 "/dev/md0"

[0] http://rapidshare.com/files/133187302/coredump.tar.bz2.html




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid
  2008-07-28 23:05 grub-probe crashes on a linux softraid 1 with one disk beginning to sync Felix Zielcke
@ 2008-07-29 20:20 ` Felix Zielcke
  2008-07-30 10:37   ` Robert Millan
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Zielcke @ 2008-07-29 20:20 UTC (permalink / raw)
  To: The development of GRUB 2

>#0  0x00000000004015f9 in probe_partmap (disk=0x0)
>at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87
>87   if (disk->partition == NULL)


This is really a good way to get more into this coding and debugging C stuff in general and especially, into grub2's code.

As far as I could find out now (more with trying it out then reading the code) grub itself can handle a raid1
with one missing device or an unsynced one.
Here's a patch which print an error on "grub-probe -t partmap /" instead of segfaulting.
I have to find out why "grub-probe -t fs /" fails with one missing device but succeeds with an unsynced one.
First I thought I let it continue and just print out a warning, but there's no grub_util_warn() and probable anyway better
to reject installing grub on a not fully synced raid.

Please comment :)


Index: util/grub-probe.c
===================================================================
--- util/grub-probe.c   (Revision 1747)
+++ util/grub-probe.c   (Arbeitskopie)
@@ -179,6 +179,8 @@
        list = dev->disk->dev->memberlist (dev->disk);
       while (list)
        {
+         if (! list->disk)
+           grub_util_error ("At least one underlying device for %s is missing.",dev->disk->name);
          probe_partmap (list->disk);
          tmp = list->next;
          free (list);
 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid
  2008-07-29 20:20 ` [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid Felix Zielcke
@ 2008-07-30 10:37   ` Robert Millan
  2008-07-30 11:39     ` Felix Zielcke
  0 siblings, 1 reply; 8+ messages in thread
From: Robert Millan @ 2008-07-30 10:37 UTC (permalink / raw)
  To: The development of GRUB 2

On Tue, Jul 29, 2008 at 10:20:29PM +0200, Felix Zielcke wrote:
> >#0  0x00000000004015f9 in probe_partmap (disk=0x0)
> >at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87
> >87   if (disk->partition == NULL)
> 
> 
> This is really a good way to get more into this coding and debugging C 
> stuff in general and especially, into grub2's code.
> 
> As far as I could find out now (more with trying it out then reading the 
> code) grub itself can handle a raid1
> with one missing device or an unsynced one.
> Here's a patch which print an error on "grub-probe -t partmap /" instead of 
> segfaulting.
> I have to find out why "grub-probe -t fs /" fails with one missing device 
> but succeeds with an unsynced one.
> First I thought I let it continue and just print out a warning, but there's 
> no grub_util_warn() and probable anyway better
> to reject installing grub on a not fully synced raid.
> 
> Please comment :)
> 
> 
> Index: util/grub-probe.c
> ===================================================================
> --- util/grub-probe.c   (Revision 1747)
> +++ util/grub-probe.c   (Arbeitskopie)
> @@ -179,6 +179,8 @@
>        list = dev->disk->dev->memberlist (dev->disk);
>       while (list)
>        {
> +         if (! list->disk)
> +           grub_util_error ("At least one underlying device for %s is 
> missing.",dev->disk->name);

We generate that list ourselves (I think in util/raid.c), so if one device
is linked in the list but just points to NULL that's our own fault, at least
up to the util/raid.c layer.

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid
  2008-07-30 10:37   ` Robert Millan
@ 2008-07-30 11:39     ` Felix Zielcke
  2008-07-30 15:11       ` Felix Zielcke
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Zielcke @ 2008-07-30 11:39 UTC (permalink / raw)
  To: The development of GRUB 2

I have told this all Robert just on IRC, but it would be really nice if
others would comment on this problem too :)
I just don't know if GRUB_UTIL should ignore the missing devices just
like real GRUB or if it's a good idea to at least warn the user that
sth. is wrong with the mdraid.

Am Mittwoch, den 30.07.2008, 12:37 +0200 schrieb Robert Millan:


> We generate that list ourselves (I think in util/raid.c), so if one device
> is linked in the list but just points to NULL that's our own fault, at least
> up to the util/raid.c layer.
> 

The list is generated by grub_raid_memberlist at disk/raid.c:72
and that doestn't handle missing devices, it adds every disk to the list

  for (i = 0; i < array->total_devs; i++)
    {
      tmp = grub_malloc (sizeof (*tmp));
      tmp->disk = array->device[i];
      tmp->next = list;
      list = tmp;
    }


grub_raid_read for raid1 can handle missing devices without problems
disk/raid.c:209

        for (i = 0; i < array->total_devs; i++)
          {
            if (array->device[i])


This is from grub_raid_scan_device after hd0,1 is added
sdb1 / hd1,1 has been removed and --zero-superblock'ed

(gdb) print *array
 $3 = {number = 0, version = 0, level = 1, layout = 0, total_devs = 1, nr_devs = 1, chunk_size = 0, uuid = {1191940856, 3853190239, 
                     1276382316, 2567164211}, 
   name = 0x19e1230 "md0", disk_size = 16777216, device = {0x0, 0x19e1080, 0x0 <repeats 30 times>}, next = 0x0}






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid
  2008-07-30 11:39     ` Felix Zielcke
@ 2008-07-30 15:11       ` Felix Zielcke
  2008-08-01 13:36         ` Robert Millan
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Zielcke @ 2008-07-30 15:11 UTC (permalink / raw)
  To: The development of GRUB 2

Am Mittwoch, den 30.07.2008, 13:39 +0200 schrieb Felix Zielcke:

>         for (i = 0; i < array->total_devs; i++)
>           {
>             if (array->device[i])
> 
> 

Above code isn't correct either, as you can see on my mail before
in my testing case device[0] is 0x0 and device[1] is the working one.

I don't think now that on RAID 1 there's a need for a warning/error on
not fully synced mdraid

Here's a patch which fixes this, if this is ok I'll have to think about
the changelog.

Please comment.

Index: disk/raid.c
===================================================================
--- disk/raid.c	(Revision 1753)
+++ disk/raid.c	(Arbeitskopie)
@@ -75,8 +75,10 @@
   grub_disk_memberlist_t list = NULL, tmp;
   unsigned int i;
   
-  for (i = 0; i < array->total_devs; i++)
+  for (i = 0; i < GRUB_RAID_MAX_DEVICES; i++)
     {
+      if (! array->device[i])
+      	continue;
       tmp = grub_malloc (sizeof (*tmp));
       tmp->disk = array->device[i];
       tmp->next = list;
@@ -213,7 +215,7 @@
       {
 	unsigned int i = 0;
 	
-	for (i = 0; i < array->total_devs; i++)
+	for (i = 0; i < GRUB_RAID_MAX_DEVICES; i++)
 	  {
 	    if (array->device[i])
 	      {
Index: include/grub/raid.h
===================================================================
--- include/grub/raid.h	(Revision 1753)
+++ include/grub/raid.h	(Arbeitskopie)
@@ -22,6 +22,8 @@
 
 #include <grub/types.h>
 
+#define GRUB_RAID_MAX_DEVICES 32
+
 struct grub_raid_array
 {
   int number;              /* The device number, taken from md_minor so
we
@@ -37,7 +39,7 @@
   char *name;              /* That will be "md<number>". */
   grub_uint64_t disk_size; /* Size of an individual disk, in 512 byte
 			      sectors. */
-  grub_disk_t device[32];  /* Array of total_devs devices. */          
+  grub_disk_t device[GRUB_RAID_MAX_DEVICES];  /* Array of total_devs
devices. */          
   struct grub_raid_array *next;
 };





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid
  2008-07-30 15:11       ` Felix Zielcke
@ 2008-08-01 13:36         ` Robert Millan
  2008-08-01 15:22           ` Felix Zielcke
  0 siblings, 1 reply; 8+ messages in thread
From: Robert Millan @ 2008-08-01 13:36 UTC (permalink / raw)
  To: The development of GRUB 2

On Wed, Jul 30, 2008 at 05:11:24PM +0200, Felix Zielcke wrote:
> Index: disk/raid.c
> ===================================================================
> --- disk/raid.c	(Revision 1753)
> +++ disk/raid.c	(Arbeitskopie)
> @@ -75,8 +75,10 @@
>    grub_disk_memberlist_t list = NULL, tmp;
>    unsigned int i;
>    
> -  for (i = 0; i < array->total_devs; i++)
> +  for (i = 0; i < GRUB_RAID_MAX_DEVICES; i++)

The existing code is confusing.  What is total_devs for?  If we need to
iterate up to 32, sounds like this variable is pointless?

If it's useless, it should be removed (but maybe it isn't!).

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid
  2008-08-01 13:36         ` Robert Millan
@ 2008-08-01 15:22           ` Felix Zielcke
  2008-08-02 19:00             ` Felix Zielcke
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Zielcke @ 2008-08-01 15:22 UTC (permalink / raw)
  To: The development of GRUB 2

Am Freitag, den 01.08.2008, 15:36 +0200 schrieb Robert Millan:

> The existing code is confusing.  What is total_devs for?  If we need to
> iterate up to 32, sounds like this variable is pointless?
> 
> If it's useless, it should be removed (but maybe it isn't!).
> 

total_devs is a fast check if the array is readable.

On fully synced raids the disks should be always in linear continues
order starting with 0 in the mdraid superblock.

If I added the sdb1 back to the md0 then mdadm did reassemble them to 0
and 1
Whereas when I removed it then disk 1 was the working one and disk 2 the
removed one.

So at least for raid1 it's not safe to just check disk 0 if the array
has only one device (which is totally legal on raid1)

For raid0 you always must have a fully synced raid.

On raid5 one disk can be missing, thanks to VMware and that I have enoug
time I can find it out how it works with raid5

By the way, this 32 seems to be oversafe
mdadm(8) says super 0.90 supports only 28 devices in one array

With implementing super 1.X this will be a problem because mdadm(8)
doestn't say any limit at all.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid
  2008-08-01 15:22           ` Felix Zielcke
@ 2008-08-02 19:00             ` Felix Zielcke
  0 siblings, 0 replies; 8+ messages in thread
From: Felix Zielcke @ 2008-08-02 19:00 UTC (permalink / raw)
  To: The development of GRUB 2

Now with trying if raid5 makes problem I looked further into this whole
raid stuff in grub.

Am Freitag, den 01.08.2008, 17:22 +0200 schrieb Felix Zielcke:
> Am Freitag, den 01.08.2008, 15:36 +0200 schrieb Robert Millan:
> 
> > The existing code is confusing.  What is total_devs for?  If we need to
> > iterate up to 32, sounds like this variable is pointless?
> > 
> > If it's useless, it should be removed (but maybe it isn't!).
> > 
> 
> total_devs is a fast check if the array is readable.

total_devs is the number of disks according to the superblock
nr_devs is the number grub found.

The value from the superblock is needed.
nr_devs could be removed but as I already said above it's faster with
it.
If it doestn't exist you have to iterate over all 32 devices and then
check the found number of disks with total_devs to find out if the array
is readable
(grub_is_array_readable in disk/raid.c)

And the raid5 has a check, if one disk in the array is missing and then
an error occurs while reading from a different one then stops reading
further from the whole array.

So I would suggest to just keep both, as long as you don't need the
sizeof(unsigned int) this would save.
But with the super 0.90 format both could be just an unsigned char, it
should never happen that we have more then 28 devices which 0.90
supports and the current code just assumes 32 so there'll be problems
anyway.

> ...

To make that a bit more clear:
mdadm allocates the disk number stored in superblock begining with 0
If you remove the disk with number 0 then the others still keep there
number.
If you add it back, then it gets while it's syncing the first highest
disk number which is free.
So if you make a raid1 with 2 disks then the numbers are 0 and 1
if you remove the one with has 0 and then add it back then it becomes 2 
If it's fully synced then it get's changed back to 0

And the problem in grub is that the disk number in superblock is used
for the array->device[] so there can be holes with missing disks.



I thought about changing grub_raid_scan_device so the disks get added to
a temporary array with the superblock number just as it's currently done
and then copy from that one to the used array->devices[] skipping all
invalid entrys.

This would be fine for Raid 0 and 1
But there's Raid 5 :(

On Raid 5 one disk can be missing and everything can be still read it's
just slower.
The problem is we need every disk which is in the array and not just the
working ones to calculate from which disk shall be read.

So the Raid 1 case can be fixed fine with my previous patch which
shouldn't slow down that much
All 32 devices are just checked on grub-probe
real grub stops when it found a working one which should be for the most
users always a very low number

Maybe I can find a solution for this Raid 5 problem, but I doubt.
So we should probable warn users if their /boot is on a raid 5

Raid 5 worked for me fine with qemu just using 2 of the 3 disk images,
when it was fully synced, the order of them didn't even matter.
But if I removed a disk with mdadm then grub failed with the other 2
used.




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-08-02 19:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-28 23:05 grub-probe crashes on a linux softraid 1 with one disk beginning to sync Felix Zielcke
2008-07-29 20:20 ` [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid Felix Zielcke
2008-07-30 10:37   ` Robert Millan
2008-07-30 11:39     ` Felix Zielcke
2008-07-30 15:11       ` Felix Zielcke
2008-08-01 13:36         ` Robert Millan
2008-08-01 15:22           ` Felix Zielcke
2008-08-02 19:00             ` Felix Zielcke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.