grub-devel.gnu.org archive mirror
 help / color / mirror / Atom feed
* HP root-caues analysis for GRUB "Red screen of death" on DL120/DL360 G7 servers
@ 2011-12-08 19:39 Iain Barker
  2011-12-08 20:16 ` Vladimir 'φ-coder/phcoder' Serbinenko
  0 siblings, 1 reply; 5+ messages in thread
From: Iain Barker @ 2011-12-08 19:39 UTC (permalink / raw)
  To: bug-grub@gnu.org; +Cc: grub-devel@gnu.org

I am posting the following information with permission from HP support, in the hope that it may be useful for future GRUB developer reference.

Summary:
When using GRUB to chain-load from one device to another device (e.g. USB to HDD), the HP BIOS used in DL120/DL360 and other G7 servers reports "Illegal Opcode" and a red crashdump screen.  This failure did not occur on previous generation (G6) servers of the same models, which used AMI/Phoenix BIOS.

References:
Acme Packet opened HP support case 4635415916 for additional clarification in reference to the public HP customer advisory number c02695572

http://bizsupport1.austin.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02695572&lang=en&cc=us&taskId=101&prodSeriesId=4091408&prodTypeId=15351

Root cause analysis:
HP level3 engineering identified the root cause as follows:

_start_quoted_text_

HP Level-3 engineering have found that the HP BIOS on the DL120 G7 is not causing the red screen.   GRUB loads its own INT13 handler in the interrupt vector table, so it will now intercept all int13 calls.  Some time after it does that, GRUB does some type of memory copy operation which overrides the data at the address where Grub stores the INT13 handler code.   As a result, on the next Int13 call in grub, the interrupt handler is no longer there so the processor just starts to execute whatever data overwrote where the int13 handler code was.  
 
Here is how the red screen happens: When the processor executes an illegal instruction (like when it tries to execute whatever is in the overwritten int13 handler), the processor causes and interrupt which the BIOS then handles by printing the red screen with the register dump and the message.  So our BIOS just prints out the red screen, but the cause of the red screen is Grub.  
 
The specific scenario which leads to this is identified as follows:

1) Grub installs its own INT13 handler
2) Near the end of the chain loading process, Grub loads an image of the Linux kernel into memory which wipes out their Int13 handler.
3) Right before grub transfers control to the kernel to boot, grub makes a call to a function to turn off the floppy drive.
4) The call to the floppy code then makes an Int13 call to the handler which has been overwritten by the kernel and thereby results in the red screen.
 
The problem seems to be that Grub made assumptions about the memory layout in our system which is not accurate.  HP systems that use HP developed BIOSes instead of outsourced (AMI) BIOSes use more of a memory area called EBDA than a typical system does.   As a result, Grub assumes there's memory that it could safely use instead of properly calculating an area of safe memory to use. That's probably why Grub worked  on the other systems and fails on G7.  

_end quoted text_

Regards,
Iain Barker - Platform Engineering, Acme Packet.
[yoshac@member.fsf.org]



^ permalink raw reply	[flat|nested] 5+ messages in thread
* HP root-caues analysis for GRUB "Red screen of death" on DL120/DL360 G7 servers
@ 2011-12-08 15:36 Iain Barker
  0 siblings, 0 replies; 5+ messages in thread
From: Iain Barker @ 2011-12-08 15:36 UTC (permalink / raw)
  To: bug-grub@gnu.org; +Cc: grub-devel@gnu.org

I am posting the following information with permission from HP support, in the hope that it may be useful for future GRUB developer reference.
Please note that I do not subscribe to the GRUB mailing list, so cc: me directly if any reply is required.

Summary:
When using GRUB to chain-load from one device to another device, the HP BIOS used in currently DL120/DL360 (G7) servers reports "Illegal Opcode" and a red crashdump screen.  This failure did not occur on previous G6 generation servers of the same models, which used AMI/Phoenix BIOS.

References:
HP support case 4635415916, opened for additional clarification in reference to HP customer advisory number c02695572

http://bizsupport1.austin.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02695572&lang=en&cc=us&taskId=101&prodSeriesId=4091408&prodTypeId=15351

Root cause analysis:
HP level3 engineering identified the root cause as follows:

_start_quoted_text_

HP Level-3 engineering have found that the HP BIOS on the DL120 G7 is not causing the red screen.   GRUB loads its own INT13 handler in the interrupt vector table, so it will now intercept all int13 calls.  Some time after it does that, GRUB does some type of memory copy operation which overrides the data at the address where Grub stores the INT13 handler code.   As a result, on the next Int13 call in grub, the interrupt handler is no longer there so the processor just starts to execute whatever data overwrote where the int13 handler code was.  
 
Here is how the red screen happens: When the processor executes an illegal instruction (like when it tries to execute whatever is in the overwritten int13 handler), the processor causes and interrupt which the BIOS then handles by printing the red screen with the register dump and the message.  So our BIOS just prints out the red screen, but the cause of the red screen is Grub.  
 
The specific scenario which leads to this is identified as follows:

1) Grub installs its own INT13 handler
2) Near the end of the chain loading process, Grub loads an image of the Linux kernel into memory which wipes out their Int13 handler.
3) Right before grub transfers control to the kernel to boot, grub makes a call to a function to turn off the floppy drive.
4) The call to the floppy code then makes an Int13 call to the handler which has been overwritten by the kernel and thereby results in the red screen.
 
The problem seems to be that Grub made assumptions about the memory layout in our system which is not accurate.  HP systems that use HP developed BIOSes instead of outsourced (AMI) BIOSes use more of a memory area called EBDA than a typical system does.   As a result, Grub assumes there's memory that it could safely use instead of properly calculating an area of safe memory to use. That's probably why Grub worked  on the other systems and fails on G7.  

_end quoted text_

Regards,
Iain Barker - Platform Engineering, Acme Packet.
yoshac@member.fsf.org




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-12-08 21:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-08 19:39 HP root-caues analysis for GRUB "Red screen of death" on DL120/DL360 G7 servers Iain Barker
2011-12-08 20:16 ` Vladimir 'φ-coder/phcoder' Serbinenko
2011-12-08 20:23   ` Seth Goldberg
2011-12-08 20:42     ` Vladimir 'φ-coder/phcoder' Serbinenko
  -- strict thread matches above, loose matches on Subject: below --
2011-12-08 15:36 Iain Barker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).