[sending to grub-devel@ as requested] Robert Millan wrote: > On Sun, May 04, 2008 at 05:01:32PM +0300, Török Edwin wrote: > >>>> Device Boot Start End Blocks Id System >>>> /dev/sda1 * 1 1275 10241406 7 HPFS/NTFS >>>> /dev/sda2 1276 2248 7815622+ a6 OpenBSD >>>> /dev/sda3 2249 5289 24426832+ f W95 Ext'd (LBA) >>>> /dev/sda4 6080 7296 9775552+ bf Solaris >>>> /dev/sda5 2249 2371 987966 82 Linux swap / Solaris >>>> /dev/sda6 2372 3587 9767488+ 83 Linux >>>> /dev/sda7 3588 3600 104391 83 Linux >>>> /dev/sda8 3601 4863 10145016 8e Linux LVM >>>> /dev/sda9 4864 5228 2931831 a6 OpenBSD >>>> /dev/sda10 5229 5289 489951 83 Linux >>>> >> [...] >> grub> ls (hd0,10) >> error: unknown device >> grub> ls (hd0,11) >> error: unknown device >> grub> >> > > I tried reproducing your setup, but I can't hit the same bug. This starts to > look really nasty. Just spotted this: > > /build/buildd/grub2-1.96+20080426/partmap/pc.c:141: partition 0: flag 0x80, type 0x7, start 0x3f, len 0x1388afc > [...] > /build/buildd/grub2-1.96+20080426/partmap/pc.c:141: partition 0: flag 0x0, type 0x82, start 0x2270f07, len 0x1e267c > > for which I can't find any explanation other than memory corruption. Also, > due to a missing fflush() call the output is somewhat scrambled, which makes > it harder to track (I fixed this already in upstream). > > Could you: > > - Apply the attached patch & run grub-probe again (this time output > will be a bit more readable) > There was no patch attached, however I did a 'cvs diff -u -D2008-04-30', and applied that patch. I found what the problem is, and it also explains why you couldn't reproduce the problem. /dev/sda9 is not a valid OpenBSD partition, and in partmap/pc.c:176 the iteration fails with an error: invalid disk label magic 0x%x. If I replace that return with a continue, it works. The problem is that grub2 stops looking for more partitions as soon as it encountered the invalid partition, grub 0.97 was working perfectly and I never noticed the partition has the wrong type! Also if I change the partition type to 83 (as it should be) an unpatched grub-probe can find that /boot is on /dev/sda10: # grub-probe -t device /boot /dev/sda10 I think grub2 should handle errors more gracefully, eventually mark the partition as invalid, and keep going. grub-probe was looking for /dev/sda10, and it shouldn't be affected by /dev/sda9 being corrupted/invalid. Think of it this way: if a partition gets corrupted, that shouldn't prevent from booting, assuming the boot and root partitions are still ok. Compare what grub-emu says when sda9 has wrong type: grub> ls (hd0,10) error: unknown device And this is what it says when sda9 has the correct type: grub> ls (hd0,10) Partition hd0,10: Filesystem type ext2, Label debian_BOOT > - Send it to grub-devel@gnu.org > Done > ? > > Maybe someone there has an idea, but if it's memory corruption and we can't > reproduce it, tracing the problem remotely isn't going to work very well. > It wasn't memory corruption, however I have run valgrind and it has shown some leaks, plus call to stat() with NULL parameter. The attached patch fixes some valgrind warnings. Some leaks still remain, I attached the new valgrind logs. P.S.: grub2 seems to work now, I am able to boot with it with the text-mode menu. The default graphics mode doesn't work I will open a separate bug about that. Best regards, --Edwin