Old version of lilo fails to boot 2.6.23

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Old version of lilo fails to boot 2.6.23
@ 2007-10-16  7:07 Joseph Parmelee
  2007-10-25  8:47 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Joseph Parmelee @ 2007-10-16  7:07 UTC (permalink / raw)
  To: linux-kernel

Greetings:

I upgraded to version 2.6.23 and had a fun time figuring out the source of
this boot failure message on my x86 system:

   This kernel requires an i<random integer>86 CPU, but only detected an
   i<smaller random integer>86 CPU.

It turns out that my version of lilo (lilo -V gives version 21) doesn't set
up the stack and data segment registers in a compatible manner before
entering the new 16-bit real mode kernel loader code.  This problem is new
to the 2.6.23 series.

Parts of the 16-bit real mode loader code are now being compiled as C code
with gcc in 32 bit mode passing the .code16gcc directive to the assembler to
correct the stack frames to 16 bit.  This kludge won't work unless all the
16-bit segment registers are set to the same value.  Gcc only manipulates
the offset of the address and doesn't know anything about segment registers
or segment override prefixes.  My lilo was setting SS=0x8000, DS=0x9000, and
SP=0xB000 before entering the kernel loader.  This makes stack automatics
unreachable from the data segment without segment override prefixes.

I was tempted to patch the kernel code, but instead decided to try
"upgrading" lilo to grub-0.97 and found that grub works just fine.  This
also has the significant advantage that we won't need those nasty as86 and
ld86 things any more since lilo was the last package on our systems that
used them.

However, it would probably be a good idea to modify the kernel loader to
lock out interrupts and explicitly set up the stack in its assembly startup
code to insure that the stack is located correctly above the code in the
same segment, rather than relying on the boot loader to do the right thing. 
The existing setup code already insures that the other segment registers are
equal but omits the stack segment register.  Also, because lilo (and
others?) loads the data/code segment at 0X90000, the stack pointer would
have to be set no higher than 0XA000 to avoid potential overwrites of the
EBDA.  But I believe from my look at the code that the data/code sits below
0X8000 in the segment, so this should be fine.

If others think this is a good thing, I will test and submit a patch.

Please CC me directly as I am no longer subscribed to the list.

Best regards,

Joseph

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Old version of lilo fails to boot 2.6.23
  2007-10-16  7:07 Old version of lilo fails to boot 2.6.23 Joseph Parmelee
@ 2007-10-25  8:47 ` Andrew Morton
  2007-10-25  9:08   ` H. Peter Anvin
  2007-10-25 21:53   ` H. Peter Anvin
  0 siblings, 2 replies; 7+ messages in thread
From: Andrew Morton @ 2007-10-25  8:47 UTC (permalink / raw)
  To: Joseph Parmelee; +Cc: linux-kernel, H. Peter Anvin

On Tue, 16 Oct 2007 01:07:31 -0600 (CST) Joseph Parmelee <jparmele@wildbear.com> wrote:

> 
> Greetings:
> 
> I upgraded to version 2.6.23 and had a fun time figuring out the source of
> this boot failure message on my x86 system:
> 
>    This kernel requires an i<random integer>86 CPU, but only detected an
>    i<smaller random integer>86 CPU.
> 
> It turns out that my version of lilo (lilo -V gives version 21) doesn't set
> up the stack and data segment registers in a compatible manner before
> entering the new 16-bit real mode kernel loader code.  This problem is new
> to the 2.6.23 series.

hm, one of my test boxes runs

vmm:/home/akpm> lilo -V       
LILO version 21.4-4

and I haven't had any such problems.

> Parts of the 16-bit real mode loader code are now being compiled as C code
> with gcc in 32 bit mode passing the .code16gcc directive to the assembler to
> correct the stack frames to 16 bit.  This kludge won't work unless all the
> 16-bit segment registers are set to the same value.  Gcc only manipulates
> the offset of the address and doesn't know anything about segment registers
> or segment override prefixes.  My lilo was setting SS=0x8000, DS=0x9000, and
> SP=0xB000 before entering the kernel loader.  This makes stack automatics
> unreachable from the data segment without segment override prefixes.
> 
> I was tempted to patch the kernel code, but instead decided to try
> "upgrading" lilo to grub-0.97 and found that grub works just fine.  This
> also has the significant advantage that we won't need those nasty as86 and
> ld86 things any more since lilo was the last package on our systems that
> used them.
> 
> However, it would probably be a good idea to modify the kernel loader to
> lock out interrupts and explicitly set up the stack in its assembly startup
> code to insure that the stack is located correctly above the code in the
> same segment, rather than relying on the boot loader to do the right thing. 
> The existing setup code already insures that the other segment registers are
> equal but omits the stack segment register.  Also, because lilo (and
> others?) loads the data/code segment at 0X90000, the stack pointer would
> have to be set no higher than 0XA000 to avoid potential overwrites of the
> EBDA.  But I believe from my look at the code that the data/code sits below
> 0X8000 in the segment, so this should be fine.
> 
> If others think this is a good thing, I will test and submit a patch.

I think this is a good thing ;)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Old version of lilo fails to boot 2.6.23
  2007-10-25  8:47 ` Andrew Morton
@ 2007-10-25  9:08   ` H. Peter Anvin
  2007-10-25 21:53   ` H. Peter Anvin
  1 sibling, 0 replies; 7+ messages in thread
From: H. Peter Anvin @ 2007-10-25  9:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joseph Parmelee, linux-kernel

Andrew Morton wrote:
> 
>> Parts of the 16-bit real mode loader code are now being compiled as C code
>> with gcc in 32 bit mode passing the .code16gcc directive to the assembler to
>> correct the stack frames to 16 bit.  This kludge won't work unless all the
>> 16-bit segment registers are set to the same value.  Gcc only manipulates
>> the offset of the address and doesn't know anything about segment registers
>> or segment override prefixes.  My lilo was setting SS=0x8000, DS=0x9000, and
>> SP=0xB000 before entering the kernel loader.  This makes stack automatics
>> unreachable from the data segment without segment override prefixes.
>>
>> I was tempted to patch the kernel code, but instead decided to try
>> "upgrading" lilo to grub-0.97 and found that grub works just fine.  This
>> also has the significant advantage that we won't need those nasty as86 and
>> ld86 things any more since lilo was the last package on our systems that
>> used them.
>>
>> However, it would probably be a good idea to modify the kernel loader to
>> lock out interrupts and explicitly set up the stack in its assembly startup
>> code to insure that the stack is located correctly above the code in the
>> same segment, rather than relying on the boot loader to do the right thing. 
>> The existing setup code already insures that the other segment registers are
>> equal but omits the stack segment register.  Also, because lilo (and
>> others?) loads the data/code segment at 0X90000, the stack pointer would
>> have to be set no higher than 0XA000 to avoid potential overwrites of the
>> EBDA.  But I believe from my look at the code that the data/code sits below
>> 0X8000 in the segment, so this should be fine.
>>
>> If others think this is a good thing, I will test and submit a patch.
> 
> I think this is a good thing ;)
> 

Not quite so fast.  The entry value of SS:SP is actually part of the 
protocol (an upper memory boundary), although for 2.01+ one could argue 
it is redundant with the heap_end field in the header.

I'm rather confused which particular LILO this would possibly be, 
especially given the oddball version number.  The boot protocol was 
pretty much formalized by Werner Amsberger (sp?), the original LILO 
author, with contributions from Hans Lermen and myself.  It hasn't 
changed in this area.

If this was a LILO that someone "cleverly broke" I'd like to understand 
the nature of it, so we can work around it properly.  I see a couple of 
options:

- If protocol >= 2.01, force (e)sp to match the heap_end field of the 
setup structure.  For < 2.01, what to do?
- Pray and hope the value of SP is sane to start out with in the correct SS.
- Declare the "cleverly broken" version of LILO not so cleverly broken.

For what it's worth, the old code, for protocol < 2.02, the boot code 
would simply overwrite %ss, leaving %sp unchanged (alternative #2.)  So 
this configuration was always buggy.  There is a comment in the old code 
(setup.S, line 655) that "after this the stack should not be used", but 
we then go right into the A20 code which does a bunch of subroutine calls.

I think at this point that if protocol >= 2.01 and CAN_USE_HEAP, we 
should set %ss:%sp to that, otherwise fall back to simply setting %ss 
and hope that %sp is set to something sane.  I don't like it, but I 
don't see any better alternative.

	-hpa

	-hpa

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Old version of lilo fails to boot 2.6.23
  2007-10-25  8:47 ` Andrew Morton
  2007-10-25  9:08   ` H. Peter Anvin
@ 2007-10-25 21:53   ` H. Peter Anvin
  2007-10-25 22:31     ` H. Peter Anvin
  1 sibling, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2007-10-25 21:53 UTC (permalink / raw)
  To: Joseph Parmelee; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 97 bytes --]

[Ancient LILO boot problem]

Joseph, could you try this patch on your ancient-LILO setup?

	-hpa

[-- Attachment #2: newsetup-ancient-lilo.patch --]
[-- Type: text/x-patch, Size: 1140 bytes --]

diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 8353c81..295f9b9 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -242,11 +242,31 @@ setup2:
 	movw	%ax, %es
 	cld
 
+# Apparently some ancient versions of LILO invoked the kernel
+# with %ss != %ds, which happened to work by accident for the
+# old code.  If the CAN_USE_HEAP flag is set in loadflags, or
+# %ss != %ds, then adjust the stack pointer.
+	testb	$CAN_USE_HEAP, loadflags
+	jnz 2f
+
+	# No CAN_USE_HEAP
+	movw	%ss, %dx
+	cmpw	%ax, %dx	# %ds == %ss?
+	je	3f		# If so, assume %sp is valid
+
+	# If not, use the default value from heap_end_ptr
+	# as the %sp value -- it's the best we can do with an
+	# impossible situation.
+2:
+	movw	%ax, %ss
+	movw	heap_end_ptr, %sp
+	
 # Stack paranoia: align the stack and make sure it is good
 # for both 16- and 32-bit references.  In particular, if we
 # were meant to have been using the full 16-bit segment, the
 # caller might have set %sp to zero, which breaks %esp-based
 # references.
+3:	
 	andw	$~3, %sp	# dword align (might as well...)
 	jnz	1f
 	movw	$0xfffc, %sp	# Make sure we're not zero

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Old version of lilo fails to boot 2.6.23
  2007-10-25 21:53   ` H. Peter Anvin
@ 2007-10-25 22:31     ` H. Peter Anvin
  2007-10-26 18:25       ` Joseph Parmelee
  0 siblings, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2007-10-25 22:31 UTC (permalink / raw)
  To: Joseph Parmelee; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 168 bytes --]

H. Peter Anvin wrote:
> [Ancient LILO boot problem]
> 
> Joseph, could you try this patch on your ancient-LILO setup?
> 

Actually, please try this one instead.

	-hpa

[-- Attachment #2: newsetup-ancient-lilo-2.patch --]
[-- Type: text/x-patch, Size: 2504 bytes --]

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 5f9a2e7..887874f 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -17,6 +17,8 @@
 #ifndef BOOT_BOOT_H
 #define BOOT_BOOT_H
 
+#define STACK_SIZE	512	/* Minimum number of bytes for stack */
+
 #ifndef __ASSEMBLY__
 
 #include <stdarg.h>
@@ -198,8 +200,6 @@ static inline int isdigit(int ch)
 }
 
 /* Heap -- available for dynamic lists. */
-#define STACK_SIZE	512	/* Minimum number of bytes for stack */
-
 extern char _end[];
 extern char *HEAP;
 extern char *heap_end;
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 8353c81..256098d 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -173,7 +173,7 @@ ramdisk_size:	.long	0		# its size in bytes
 bootsect_kludge:
 		.long	0		# obsolete
 
-heap_end_ptr:	.word	_end+1024	# (Header version 0x0201 or later)
+heap_end_ptr:	.word	_end+STACK_SIZE	# (Header version 0x0201 or later)
 					# space from here (exclusive) down to
 					# end of setup code can be used by setup
 					# for local heap purposes.
@@ -242,16 +242,38 @@ setup2:
 	movw	%ax, %es
 	cld
 
-# Stack paranoia: align the stack and make sure it is good
-# for both 16- and 32-bit references.  In particular, if we
-# were meant to have been using the full 16-bit segment, the
-# caller might have set %sp to zero, which breaks %esp-based
-# references.
-	andw	$~3, %sp	# dword align (might as well...)
+# Apparently some ancient versions of LILO invoked the kernel
+# with %ss != %ds, which happened to work by accident for the
+# old code.  If the CAN_USE_HEAP flag is set in loadflags, or
+# %ss != %ds, then adjust the stack pointer.
+
+	# Smallest possible stack we can tolerate
+	movw	$(_end+STACK_SIZE), %cx
+
+	testb	$CAN_USE_HEAP, loadflags
+	jnz	2f
+
+	# No CAN_USE_HEAP
+	movw	%ss, %dx
+	cmpw	%ax, %dx	# %ds == %ss?
+	movw	%sp, %dx
+	je	3f		# If so, assume %sp is valid
+
+2:
+	movw	heap_end_ptr, %dx
+
+	# Make sure the stack is at least minimum size.  Take a value
+	# of zero to mean "full segment."
+3:
+	andw	$~3, %dx	# dword align (might as well...)
 	jnz	1f
-	movw	$0xfffc, %sp	# Make sure we're not zero
-1:	movzwl	%sp, %esp	# Clear upper half of %esp
-	sti
+	movw	$0xfffc, %dx	# Make sure we're not zero
+1:	cmpw	%cx, %dx
+	jnb	4f
+	movw	%cx, %dx	# Minimum value we can possibly use
+4:	movw	%ax, %ss
+	movzwl	%dx, %esp	# Clear upper half of %esp
+	sti			# Now we should have a working stack
 
 # Check signature at end of setup
 	cmpl	$0x5a5aaa55, setup_sig

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Old version of lilo fails to boot 2.6.23
  2007-10-25 22:31     ` H. Peter Anvin
@ 2007-10-26 18:25       ` Joseph Parmelee
  2007-10-26 18:37         ` H. Peter Anvin
  0 siblings, 1 reply; 7+ messages in thread
From: Joseph Parmelee @ 2007-10-26 18:25 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Andrew Morton, linux-kernel

On Thu, 25 Oct 2007, H. Peter Anvin wrote:

> H. Peter Anvin wrote:
>> [Ancient LILO boot problem]
>> 
>> Joseph, could you try this patch on your ancient-LILO setup?
>> 
>
> Actually, please try this one instead.
>
> 	-hpa
>

This patch will work in my particular case, though it appears to violate the
rules about getting too close to the EBDA (SP=0xB000 on entry).

The boot loader is responsible for loading the kernel loader at a suitable
location in low memory, but I don't understand why the boot loader should be
involved in setting the stack at all.  If we explicitly allocate the stack
as part of the .data segment, why not just play it safe and in all cases
fully set up the stack in header.S?  This insures that the stack pointer is
not zero, is as low as possible to stay out of the EBDA, and that ss=ds;
quite irrespective of what the boot loader does.

What am I missing?

Regards,

Joseph

Please CC me directly as I am no longer subscribed to the list.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Old version of lilo fails to boot 2.6.23
  2007-10-26 18:25       ` Joseph Parmelee
@ 2007-10-26 18:37         ` H. Peter Anvin
  0 siblings, 0 replies; 7+ messages in thread
From: H. Peter Anvin @ 2007-10-26 18:37 UTC (permalink / raw)
  To: Joseph Parmelee; +Cc: Andrew Morton, linux-kernel

Joseph Parmelee wrote:
> 
> This patch will work in my particular case, though it appears to violate 
> the rules about getting too close to the EBDA (SP=0xB000 on entry).
> 
> The boot loader is responsible for loading the kernel loader at a suitable
> location in low memory, but I don't understand why the boot loader 
> should be
> involved in setting the stack at all.  If we explicitly allocate the stack
> as part of the .data segment, why not just play it safe and in all cases
> fully set up the stack in header.S?  This insures that the stack pointer is
> not zero, is as low as possible to stay out of the EBDA, and that ss=ds;
> quite irrespective of what the boot loader does.
> 
> What am I missing?
> 

What you're missing is that "just loading into a suitable location in 
low memory" isn't a sufficient condition.  This is something that one 
finds out very quickly trying to do boot loader work.

Heap and stack control the amount of functionality that is available, 
and therefore the protocol allows them to be dynamic.

Anyway, the final version of the patch that I sent you privately uses 
this logic:

- If heap size is properly reported, use it.
- Otherwise, if %ss == %ds, then use the stack pointer as entered.
- Otherwise, use the minimum stack.

This seems like a fairly reasonable compromise, especially since 
anything even remotely modern will be handled by the first clause.

	-hpa

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-10-26 18:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-16  7:07 Old version of lilo fails to boot 2.6.23 Joseph Parmelee
2007-10-25  8:47 ` Andrew Morton
2007-10-25  9:08   ` H. Peter Anvin
2007-10-25 21:53   ` H. Peter Anvin
2007-10-25 22:31     ` H. Peter Anvin
2007-10-26 18:25       ` Joseph Parmelee
2007-10-26 18:37         ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox