* grub-probe crashes on a linux softraid 1 with one disk beginning to sync @ 2008-07-28 23:05 Felix Zielcke 2008-07-29 20:20 ` [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid Felix Zielcke 0 siblings, 1 reply; 8+ messages in thread From: Felix Zielcke @ 2008-07-28 23:05 UTC (permalink / raw) To: grub-devel grub-probe --device-map=/boot/grub/device.map --target=partmap --device /dev/md0 showed pc and then segfaulted /dev/md0 is raid1 with 2 disks but the second was just added to it, so it's syncing I try to trace this down and maybe provide a patch but probable someone else here is better and faster on this then me :) Even as .tar.bz2 the 2 files are too big for the list so you'll find them at [0] in case you need. #0 0x00000000004015f9 in probe_partmap (disk=0x0) at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87 87 if (disk->partition == NULL) (gdb) bt full #0 0x00000000004015f9 in probe_partmap (disk=0x0) at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87 name = 0x0 underscore = 0x7f7916e9ea86 "\203=\233�-" #1 0x00000000004018ac in probe (path=0x0, device_name=0x7fff1f399a42 "/dev/md0") at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:182 list = (grub_disk_memberlist_t) 0x1f341d0 tmp = (grub_disk_memberlist_t) 0x1f341d0 drive_name = 0x1f34080 "md0" grub_path = 0x0 filebuf_via_grub = 0x0 filebuf_via_sys = 0x0 abstraction_type = 2 dev = (grub_device_t) 0x1f34190 fs = (grub_fs_t) 0x0 #2 0x0000000000401d98 in main (argc=5, argv=0x7fff1f397888) at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:370 dev_map = 0x1f34010 "/boot/grub/device.map" argument = 0x7fff1f399a42 "/dev/md0" [0] http://rapidshare.com/files/133187302/coredump.tar.bz2.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid 2008-07-28 23:05 grub-probe crashes on a linux softraid 1 with one disk beginning to sync Felix Zielcke @ 2008-07-29 20:20 ` Felix Zielcke 2008-07-30 10:37 ` Robert Millan 0 siblings, 1 reply; 8+ messages in thread From: Felix Zielcke @ 2008-07-29 20:20 UTC (permalink / raw) To: The development of GRUB 2 >#0 0x00000000004015f9 in probe_partmap (disk=0x0) >at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87 >87 if (disk->partition == NULL) This is really a good way to get more into this coding and debugging C stuff in general and especially, into grub2's code. As far as I could find out now (more with trying it out then reading the code) grub itself can handle a raid1 with one missing device or an unsynced one. Here's a patch which print an error on "grub-probe -t partmap /" instead of segfaulting. I have to find out why "grub-probe -t fs /" fails with one missing device but succeeds with an unsynced one. First I thought I let it continue and just print out a warning, but there's no grub_util_warn() and probable anyway better to reject installing grub on a not fully synced raid. Please comment :) Index: util/grub-probe.c =================================================================== --- util/grub-probe.c (Revision 1747) +++ util/grub-probe.c (Arbeitskopie) @@ -179,6 +179,8 @@ list = dev->disk->dev->memberlist (dev->disk); while (list) { + if (! list->disk) + grub_util_error ("At least one underlying device for %s is missing.",dev->disk->name); probe_partmap (list->disk); tmp = list->next; free (list); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid 2008-07-29 20:20 ` [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid Felix Zielcke @ 2008-07-30 10:37 ` Robert Millan 2008-07-30 11:39 ` Felix Zielcke 0 siblings, 1 reply; 8+ messages in thread From: Robert Millan @ 2008-07-30 10:37 UTC (permalink / raw) To: The development of GRUB 2 On Tue, Jul 29, 2008 at 10:20:29PM +0200, Felix Zielcke wrote: > >#0 0x00000000004015f9 in probe_partmap (disk=0x0) > >at /home/fz/grub/grub2-1.96+20080724/util/grub-probe.c:87 > >87 if (disk->partition == NULL) > > > This is really a good way to get more into this coding and debugging C > stuff in general and especially, into grub2's code. > > As far as I could find out now (more with trying it out then reading the > code) grub itself can handle a raid1 > with one missing device or an unsynced one. > Here's a patch which print an error on "grub-probe -t partmap /" instead of > segfaulting. > I have to find out why "grub-probe -t fs /" fails with one missing device > but succeeds with an unsynced one. > First I thought I let it continue and just print out a warning, but there's > no grub_util_warn() and probable anyway better > to reject installing grub on a not fully synced raid. > > Please comment :) > > > Index: util/grub-probe.c > =================================================================== > --- util/grub-probe.c (Revision 1747) > +++ util/grub-probe.c (Arbeitskopie) > @@ -179,6 +179,8 @@ > list = dev->disk->dev->memberlist (dev->disk); > while (list) > { > + if (! list->disk) > + grub_util_error ("At least one underlying device for %s is > missing.",dev->disk->name); We generate that list ourselves (I think in util/raid.c), so if one device is linked in the list but just points to NULL that's our own fault, at least up to the util/raid.c layer. -- Robert Millan The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and how) you may access your data; but nobody's threatening your freedom: we still allow you to remove your data and not access it at all." ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid 2008-07-30 10:37 ` Robert Millan @ 2008-07-30 11:39 ` Felix Zielcke 2008-07-30 15:11 ` Felix Zielcke 0 siblings, 1 reply; 8+ messages in thread From: Felix Zielcke @ 2008-07-30 11:39 UTC (permalink / raw) To: The development of GRUB 2 I have told this all Robert just on IRC, but it would be really nice if others would comment on this problem too :) I just don't know if GRUB_UTIL should ignore the missing devices just like real GRUB or if it's a good idea to at least warn the user that sth. is wrong with the mdraid. Am Mittwoch, den 30.07.2008, 12:37 +0200 schrieb Robert Millan: > We generate that list ourselves (I think in util/raid.c), so if one device > is linked in the list but just points to NULL that's our own fault, at least > up to the util/raid.c layer. > The list is generated by grub_raid_memberlist at disk/raid.c:72 and that doestn't handle missing devices, it adds every disk to the list for (i = 0; i < array->total_devs; i++) { tmp = grub_malloc (sizeof (*tmp)); tmp->disk = array->device[i]; tmp->next = list; list = tmp; } grub_raid_read for raid1 can handle missing devices without problems disk/raid.c:209 for (i = 0; i < array->total_devs; i++) { if (array->device[i]) This is from grub_raid_scan_device after hd0,1 is added sdb1 / hd1,1 has been removed and --zero-superblock'ed (gdb) print *array $3 = {number = 0, version = 0, level = 1, layout = 0, total_devs = 1, nr_devs = 1, chunk_size = 0, uuid = {1191940856, 3853190239, 1276382316, 2567164211}, name = 0x19e1230 "md0", disk_size = 16777216, device = {0x0, 0x19e1080, 0x0 <repeats 30 times>}, next = 0x0} ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid 2008-07-30 11:39 ` Felix Zielcke @ 2008-07-30 15:11 ` Felix Zielcke 2008-08-01 13:36 ` Robert Millan 0 siblings, 1 reply; 8+ messages in thread From: Felix Zielcke @ 2008-07-30 15:11 UTC (permalink / raw) To: The development of GRUB 2 Am Mittwoch, den 30.07.2008, 13:39 +0200 schrieb Felix Zielcke: > for (i = 0; i < array->total_devs; i++) > { > if (array->device[i]) > > Above code isn't correct either, as you can see on my mail before in my testing case device[0] is 0x0 and device[1] is the working one. I don't think now that on RAID 1 there's a need for a warning/error on not fully synced mdraid Here's a patch which fixes this, if this is ok I'll have to think about the changelog. Please comment. Index: disk/raid.c =================================================================== --- disk/raid.c (Revision 1753) +++ disk/raid.c (Arbeitskopie) @@ -75,8 +75,10 @@ grub_disk_memberlist_t list = NULL, tmp; unsigned int i; - for (i = 0; i < array->total_devs; i++) + for (i = 0; i < GRUB_RAID_MAX_DEVICES; i++) { + if (! array->device[i]) + continue; tmp = grub_malloc (sizeof (*tmp)); tmp->disk = array->device[i]; tmp->next = list; @@ -213,7 +215,7 @@ { unsigned int i = 0; - for (i = 0; i < array->total_devs; i++) + for (i = 0; i < GRUB_RAID_MAX_DEVICES; i++) { if (array->device[i]) { Index: include/grub/raid.h =================================================================== --- include/grub/raid.h (Revision 1753) +++ include/grub/raid.h (Arbeitskopie) @@ -22,6 +22,8 @@ #include <grub/types.h> +#define GRUB_RAID_MAX_DEVICES 32 + struct grub_raid_array { int number; /* The device number, taken from md_minor so we @@ -37,7 +39,7 @@ char *name; /* That will be "md<number>". */ grub_uint64_t disk_size; /* Size of an individual disk, in 512 byte sectors. */ - grub_disk_t device[32]; /* Array of total_devs devices. */ + grub_disk_t device[GRUB_RAID_MAX_DEVICES]; /* Array of total_devs devices. */ struct grub_raid_array *next; }; ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid 2008-07-30 15:11 ` Felix Zielcke @ 2008-08-01 13:36 ` Robert Millan 2008-08-01 15:22 ` Felix Zielcke 0 siblings, 1 reply; 8+ messages in thread From: Robert Millan @ 2008-08-01 13:36 UTC (permalink / raw) To: The development of GRUB 2 On Wed, Jul 30, 2008 at 05:11:24PM +0200, Felix Zielcke wrote: > Index: disk/raid.c > =================================================================== > --- disk/raid.c (Revision 1753) > +++ disk/raid.c (Arbeitskopie) > @@ -75,8 +75,10 @@ > grub_disk_memberlist_t list = NULL, tmp; > unsigned int i; > > - for (i = 0; i < array->total_devs; i++) > + for (i = 0; i < GRUB_RAID_MAX_DEVICES; i++) The existing code is confusing. What is total_devs for? If we need to iterate up to 32, sounds like this variable is pointless? If it's useless, it should be removed (but maybe it isn't!). -- Robert Millan The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and how) you may access your data; but nobody's threatening your freedom: we still allow you to remove your data and not access it at all." ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid 2008-08-01 13:36 ` Robert Millan @ 2008-08-01 15:22 ` Felix Zielcke 2008-08-02 19:00 ` Felix Zielcke 0 siblings, 1 reply; 8+ messages in thread From: Felix Zielcke @ 2008-08-01 15:22 UTC (permalink / raw) To: The development of GRUB 2 Am Freitag, den 01.08.2008, 15:36 +0200 schrieb Robert Millan: > The existing code is confusing. What is total_devs for? If we need to > iterate up to 32, sounds like this variable is pointless? > > If it's useless, it should be removed (but maybe it isn't!). > total_devs is a fast check if the array is readable. On fully synced raids the disks should be always in linear continues order starting with 0 in the mdraid superblock. If I added the sdb1 back to the md0 then mdadm did reassemble them to 0 and 1 Whereas when I removed it then disk 1 was the working one and disk 2 the removed one. So at least for raid1 it's not safe to just check disk 0 if the array has only one device (which is totally legal on raid1) For raid0 you always must have a fully synced raid. On raid5 one disk can be missing, thanks to VMware and that I have enoug time I can find it out how it works with raid5 By the way, this 32 seems to be oversafe mdadm(8) says super 0.90 supports only 28 devices in one array With implementing super 1.X this will be a problem because mdadm(8) doestn't say any limit at all. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid 2008-08-01 15:22 ` Felix Zielcke @ 2008-08-02 19:00 ` Felix Zielcke 0 siblings, 0 replies; 8+ messages in thread From: Felix Zielcke @ 2008-08-02 19:00 UTC (permalink / raw) To: The development of GRUB 2 Now with trying if raid5 makes problem I looked further into this whole raid stuff in grub. Am Freitag, den 01.08.2008, 17:22 +0200 schrieb Felix Zielcke: > Am Freitag, den 01.08.2008, 15:36 +0200 schrieb Robert Millan: > > > The existing code is confusing. What is total_devs for? If we need to > > iterate up to 32, sounds like this variable is pointless? > > > > If it's useless, it should be removed (but maybe it isn't!). > > > > total_devs is a fast check if the array is readable. total_devs is the number of disks according to the superblock nr_devs is the number grub found. The value from the superblock is needed. nr_devs could be removed but as I already said above it's faster with it. If it doestn't exist you have to iterate over all 32 devices and then check the found number of disks with total_devs to find out if the array is readable (grub_is_array_readable in disk/raid.c) And the raid5 has a check, if one disk in the array is missing and then an error occurs while reading from a different one then stops reading further from the whole array. So I would suggest to just keep both, as long as you don't need the sizeof(unsigned int) this would save. But with the super 0.90 format both could be just an unsigned char, it should never happen that we have more then 28 devices which 0.90 supports and the current code just assumes 32 so there'll be problems anyway. > ... To make that a bit more clear: mdadm allocates the disk number stored in superblock begining with 0 If you remove the disk with number 0 then the others still keep there number. If you add it back, then it gets while it's syncing the first highest disk number which is free. So if you make a raid1 with 2 disks then the numbers are 0 and 1 if you remove the one with has 0 and then add it back then it becomes 2 If it's fully synced then it get's changed back to 0 And the problem in grub is that the disk number in superblock is used for the array->device[] so there can be holes with missing disks. I thought about changing grub_raid_scan_device so the disks get added to a temporary array with the superblock number just as it's currently done and then copy from that one to the used array->devices[] skipping all invalid entrys. This would be fine for Raid 0 and 1 But there's Raid 5 :( On Raid 5 one disk can be missing and everything can be still read it's just slower. The problem is we need every disk which is in the array and not just the working ones to calculate from which disk shall be read. So the Raid 1 case can be fixed fine with my previous patch which shouldn't slow down that much All 32 devices are just checked on grub-probe real grub stops when it found a working one which should be for the most users always a very low number Maybe I can find a solution for this Raid 5 problem, but I doubt. So we should probable warn users if their /boot is on a raid 5 Raid 5 worked for me fine with qemu just using 2 of the 3 disk images, when it was fully synced, the order of them didn't even matter. But if I removed a disk with mdadm then grub failed with the other 2 used. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-08-02 19:00 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-28 23:05 grub-probe crashes on a linux softraid 1 with one disk beginning to sync Felix Zielcke 2008-07-29 20:20 ` [PATCH] show an error instead of segfaulting on grub-probe -t partmap on a unsynced raid Felix Zielcke 2008-07-30 10:37 ` Robert Millan 2008-07-30 11:39 ` Felix Zielcke 2008-07-30 15:11 ` Felix Zielcke 2008-08-01 13:36 ` Robert Millan 2008-08-01 15:22 ` Felix Zielcke 2008-08-02 19:00 ` Felix Zielcke
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.