public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* BUG 2.6.7 hangs on boot (rx2600)
@ 2004-06-22  6:15 Grant Grundler
  2004-06-22 13:50 ` Jesse Barnes
                   ` (38 more replies)
  0 siblings, 39 replies; 40+ messages in thread
From: Grant Grundler @ 2004-06-22  6:15 UTC (permalink / raw)
  To: linux-ia64

Hi,
just a bug report.

I tried building a 2.6.7 kernel using my "standard" .config
that worked for 2.6.6. The only source code changes were
to include openib.org infiniband driver in the tree as modules.

Last output was:
	Console: colour VGA+ 80x25

THe next line for 2.6.6 kernel would be:
	Memory: 6225280k/6263872k available (6560k code, 38224k reserved, 3069k data, 208k init)


After 2 minutes I still couldn't ping the box and reset it.
I should have "INIT" the box to see where it hung but wasn't thinking.

I wonder if this might be related to KALLSYMS bugs recently discussed.
Trying with CONFIG_KALLSYMS and CONFIG_KALLSYMS_ALL off next.

hth,
grant

EFI Boot Manager ver 1.10 [14.61]  Firmware ver 2.21 [4334]

Please select a boot option

    Debian GNU/Linux                                                
    EFI Shell [Built-in]                                            
    Boot Option Maintenance Menu                                    
    System Configuration Menu                                       


    Use ^ and v to change option(s). Use Enter to select an option
Loading.: Debian GNU/Linux                                          
Starting: Debian GNU/Linux
ELILO

ELILO boot: 
Linux266 Linux267   Linux265   Linux2424   (or a kernel file name: [[dev_name:/]path/]kernel_image cmdline options)

ELILO boot: Linux267
Uncompressing Linux... done
Linux version 2.6.7 (grundler@gsyprf3.external.hp.com) (gcc version 3.3.4 (Debian)) #1 SMP Mon Jun 21 22:36:17 PDT 2004
EFI v1.10 by HP: SALsystab=0x3fb38000 ACPI 2.0=0x3fb2e000 SMBIOS=0x3fb3a000 HCDP=0x3fb2c000
ACPI: RSDP (v002     HP                                    ) @ 0x000000003fb2e000
ACPI: XSDT (v001     HP   zx6000 0x00000000 HP 0x00000000) @ 0x000000003fb2e02c
ACPI: FADT (v003     HP   zx6000 0x00000000 HP 0x00000000) @ 0x000000003fb369e0
ACPI: SPCR (v001     HP   zx6000 0x00000000 HP 0x00000000) @ 0x000000003fb36b18
ACPI: DBGP (v001     HP   zx6000 0x00000000 HP 0x00000000) @ 0x000000003fb36b68
ACPI: MADT (v001     HP   zx6000 0x00000000 HP 0x00000000) @ 0x000000003fb36c28
ACPI: SPMI (v004     HP   zx6000 0x00000000 HP 0x00000000) @ 0x0003fb36ba0
ACPI: CPEP (v001     HP   zx6000 0x00000000 HP 0x00000000) @ 0x000000003fb36bf0
ACPI: SSDT (v001     HP   zx6000 0x006 INTL 0x02012044) @ 0x000000003fb33870
ACPI: SSDT (v001     HP   zx6000 0x00000006 INTL 0x02012044) @ 0x000000003fb33a50
ACPIT (v001     HP   zx6000 0x00000006 INTL 0x02012044) @ 0x000000003fb33da0
ACPI: SSDT (v001     HP   zx6000 0x00000006 INTL 0x020) @ 0x000000003fb347c0
ACPI: SSDT (v001     HP   zx6000 0x00000006 INTL 0x02012044) @ 0x000000003fb351e0
ACPI: SSDT (v001     zx6000 0x00000006 INTL 0x02012044) @ 0x000000003fb35c00
ACPI: SSDT (v001     HP   zx6000 0x00000006 INTL 0x02012044) @ 0x00000036620
ACPI: SSDT (v001     HP   zx6000 0x00000006 INTL 0x02012044) @ 0x000000003fb36800
ACPI: SSDT (v001     HP   zx6000 0x000 INTL 0x02012044) @ 0x000000003fb368f0
ACPI: DSDT (v001     HP   zx6000 0x00000007 INTL 0x02012044) @ 0x0000000000000000
efi.top: ignoring 4KB of memory at 0x0 due to granule hole at 0x0
efi.trim_top: ignoring 636KB of memory at 0x1000 due to granule ho 0x0
efi.trim_bottom: ignoring 15360KB of memory at 0x100000 due to granule hole at 0x0
SAL 3.1: HP version 2.21
SAL Platformures: None
SAL: AP wakeup using external interrupt vector 0xff
ACPI: Local APIC address 0xc0000000fee00000
ACPI: LAPIC_ADDR_Oddress[00000000fee00000])
ACPI: LSAPIC (acpi_id[0x00] lsapic_id[0x00] lsapic_eid[0x00] enabled)
CPU 0 (0x0000) enabled (BSP)
: LSAPIC (acpi_id[0x01] lsapic_id[0x01] lsapic_eid[0x00] enabled)
CPU 1 (0x0100) enabled
ACPI: IOSAPIC (id[0x0] global_irq_bas0] address[00000000fed20800])
ACPI: IOSAPIC (id[0x1] global_irq_base[0x1b] address[00000000fed22800])
ACPI: IOSAPIC (id[0x2] g_irq_base[0x26] address[00000000fed24800])
ACPI: IOSAPIC (id[0x3] global_irq_base[0x31] address[00000000fed26800])
ACPI: IOSAPd[0x4] global_irq_base[0x3c] address[00000000fed28800])
ACPI: IOSAPIC (id[0x6] global_irq_base[0x47] address[00000000fed2c800])PI: IOSAPIC (id[0x7] global_irq_base[0x52] address[00000000fed2e800])
GSI 0x24(low,level) -> CPU 0x0000 vector 48
2 CPUs avail 2 CPUs total
GSI 0x52(low,level) -> CPU 0x0000 vector 49
MCA related initialization done
On node 0 totalpages: 391492
  DMA: 391492 pages, LIFO batch:4
  Normal zone: 0 pages, LIFO batch:1
  HighMem zone: 0 pages, LIFO batch:1
Virtual mem_map start0xa0007fffc7200000
Built 1 zonelists
Kernel command line: BOOT_IMAGE=scsi0:/EFI/debian/boot/vmlinuz-2.6.7 root=/dev/sdb3 consoyS0 ro
PID hash table entries: 4096 (order 12: 65536 bytes)
CPU 0: base freq 0.000MHz, ITC ratio\x10/2, ITC freq\x1000.000MHz+ppm
Console: colour VGA+ 80x25
<End Of Output>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
@ 2004-06-22 13:50 ` Jesse Barnes
  2004-06-22 14:51 ` Grant Grundler
                   ` (37 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-06-22 13:50 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, June 22, 2004 2:15 am, Grant Grundler wrote:
> Kernel command line: BOOT_IMAGE=scsi0:/EFI/debian/boot/vmlinuz-2.6.7
> root=/dev/sdb3 consoyS0 ro PID hash table entries: 4096 (order 12: 65536
> bytes)
> CPU 0: base freq 0.000MHz, ITC ratio\x10/2, ITC freq\x1000.000MHz+ppm
> Console: colour VGA+ 80x25

Did you have any MCA records after you rebooted?  This is about the time it 
should have chosen a default console device and started printing to it, so if 
it tried using the VGA console and you didn't have one, it may have just 
fallen over.  The MCA record would be pretty clear about that though...

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
  2004-06-22 13:50 ` Jesse Barnes
@ 2004-06-22 14:51 ` Grant Grundler
  2004-06-22 15:59 ` Bjorn Helgaas
                   ` (36 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Grant Grundler @ 2004-06-22 14:51 UTC (permalink / raw)
  To: linux-ia64

On Tue, Jun 22, 2004 at 09:50:10AM -0400, Jesse Barnes wrote:
> Did you have any MCA records after you rebooted?

Sorry - I forgot to mention that. I did check and there was none.

> This is about the time it should have chosen a default console device
> and started printing to it, so if it tried using the VGA console and
> you didn't have one, it may have just fallen over.
> The MCA record would be pretty clear about that though...

I was thinking it just picked a different serial port for ttyS0 output.
That's why I waited a couple of minutes to see if I could
ping/ssh to the box. But after two minutes it still didn't
respond and should have by then.

The systems does have a VGA output but nothing is connected.
USB keyboard/mouse is not connected either.

However, you just gave me the idea to enable CONFIG_IA64_EARLY_PRINTK.
Trying that now.

thanks,
grant

> 
> Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
  2004-06-22 13:50 ` Jesse Barnes
  2004-06-22 14:51 ` Grant Grundler
@ 2004-06-22 15:59 ` Bjorn Helgaas
  2004-06-22 21:16 ` Grant Grundler
                   ` (35 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Bjorn Helgaas @ 2004-06-22 15:59 UTC (permalink / raw)
  To: linux-ia64

On Tuesday 22 June 2004 12:15 am, Grant Grundler wrote:
> I tried building a 2.6.7 kernel using my "standard" .config
> that worked for 2.6.6. The only source code changes were
> to include openib.org infiniband driver in the tree as modules.
> 
> Last output was:
> 	Console: colour VGA+ 80x25

David saw this late last week on a 2.6.7 kernel compiled for UP,
and I reproduced it yesterday using the current linux-ia64-2.5
BK bits.  I'm using a zx6000 with serial console (VGA also
present, but unused).  No MCA records.  I collected info from
an INIT, but haven't had a chance to look very far.  Something
in the error records pointed to efi_memmap_walk(), so I added
a printk there, and the problem disappeared.

David mentioned the linker in Debian/testing as a possibility.
I'm using ld 2.14.90.0.7 from Debian binutils 2.14.90.0.7-3.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (2 preceding siblings ...)
  2004-06-22 15:59 ` Bjorn Helgaas
@ 2004-06-22 21:16 ` Grant Grundler
  2004-06-22 21:23 ` Bjorn Helgaas
                   ` (34 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Grant Grundler @ 2004-06-22 21:16 UTC (permalink / raw)
  To: linux-ia64

On Tue, Jun 22, 2004 at 09:59:11AM -0600, Bjorn Helgaas wrote:
> David saw this late last week on a 2.6.7 kernel compiled for UP,

This is for SMP

> and I reproduced it yesterday using the current linux-ia64-2.5
> BK bits.  I'm using a zx6000 with serial console (VGA also
> present, but unused).  No MCA records.  I collected info from
> an INIT, but haven't had a chance to look very far.  Something
> in the error records pointed to efi_memmap_walk(), so I added
> a printk there, and the problem disappeared.

*ugh*. 

> David mentioned the linker in Debian/testing as a possibility.
> I'm using ld 2.14.90.0.7 from Debian binutils 2.14.90.0.7-3.

Debian/testing is currently:
ii  binutils       2.14.90.0.7-8  The GNU assembler, linker and binary utiliti

I suppose that's a possibility. I have a TOC dump on the machine
and just need to "harvest" it, then see where it's spinning.
Probably get to that tomorrow.

thanks,
grant

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (3 preceding siblings ...)
  2004-06-22 21:16 ` Grant Grundler
@ 2004-06-22 21:23 ` Bjorn Helgaas
  2004-06-22 22:28 ` Grant Grundler
                   ` (33 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Bjorn Helgaas @ 2004-06-22 21:23 UTC (permalink / raw)
  To: linux-ia64

On Tuesday 22 June 2004 3:16 pm, Grant Grundler wrote:
> On Tue, Jun 22, 2004 at 09:59:11AM -0600, Bjorn Helgaas wrote:
> > David saw this late last week on a 2.6.7 kernel compiled for UP,
> 
> This is for SMP

Yeah, I saw it with slightly different "console=" parameters than
David did.  So I think it's just a heisenbug and he happened to
trip over it with a UP kernel.

> I suppose that's a possibility. I have a TOC dump on the machine
> and just need to "harvest" it, then see where it's spinning.

Have you used salinfo_decode?  I think there's a debian package
for it now ("salinfo").

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (4 preceding siblings ...)
  2004-06-22 21:23 ` Bjorn Helgaas
@ 2004-06-22 22:28 ` Grant Grundler
  2004-06-22 22:30 ` Grant Grundler
                   ` (32 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Grant Grundler @ 2004-06-22 22:28 UTC (permalink / raw)
  To: linux-ia64

On Tue, Jun 22, 2004 at 03:23:12PM -0600, Bjorn Helgaas wrote:
> Yeah, I saw it with slightly different "console=" parameters than
> David did.  So I think it's just a heisenbug and he happened to
> trip over it with a UP kernel.

ok

> Have you used salinfo_decode?  I think there's a debian package
> for it now ("salinfo").

no...too cool.

gsyprf3:~# apt-get install salinfo
Reading Package Lists... Done
Building Dependency Tree... Done
The following NEW packages will be installed:
  salinfo
0 upgraded, 1 newly installed, 0 to remove and 7 not upgraded.
Need to get 29.5kB of archives.
After unpacking 197kB of additional disk space will be used.
Get:1 http://mirrors.kernel.org testing/main salinfo 0.5-1 [29.5kB]
Fetched 29.5kB in 2s (14.3kB/s)
Selecting previously deselected package salinfo.
(Reading database ... 63823 files and directories currently installed.)
Unpacking salinfo (from .../salinfo_0.5-1_ia64.deb) ...
Setting up salinfo (0.5-1) ...
Starting salinfo decode daemons: salinfo.

gsyprf3:~# man salinfo 
No manual entry for salinfo
gsyprf3:~# 

ah it's "man salinfo_decode".

gsyprf3:/var/log/salinfo/decoded# ls
2004-06-22-06:44:21-cpu0-init.0  2004-06-22-06:44:21-cpu1-init.0  old

bits of cpu0-init.0 data:
...
  Processor static data:
    xip  : 0xa00000010005b330  xfs  : 0x8000000000000000
    xpsr : 0x00001010084a6010
...
    iip  : 0xa000000100005400  iipa : 0xa00000010005b320
    ipsr : 0x0000001008000010
...
        b0 : 0xa000000100013cb0 0x000000003f7e4620 0x0000000000000000 0x0000000000000000
        b4 : 0x0000000000000000 0x0000000000000000 0xe00000003fa43280 0xa00000010005b2a0

Reading the corresponding System.map-2.6.7:
...
a000000100005400 t general_exception
...
a00000010005b2a0 t count_reserved_pages
a00000010005b380 T mem_init
a00000010005b8c0 T ia64_set_rbs_bot
a00000010005b920 t mapped_kernel_page_is_present
a00000010005ba40 T ia64_do_page_fault


I'm wasn't 100% certain I'm looking at the right System.map file.
But both see to be identical for the above symbols.

I've made both System.map files (with and w/o EARLY_PRINTK)
and salinfo records available on
	ftp://gsyprf3.external.hp.com/pub/2.6.7/

hth,
grant

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (5 preceding siblings ...)
  2004-06-22 22:28 ` Grant Grundler
@ 2004-06-22 22:30 ` Grant Grundler
  2004-06-22 22:38 ` Arun Sharma
                   ` (31 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Grant Grundler @ 2004-06-22 22:30 UTC (permalink / raw)
  To: linux-ia64

On Tue, Jun 22, 2004 at 07:51:23AM -0700, Grant Grundler wrote:
> However, you just gave me the idea to enable CONFIG_IA64_EARLY_PRINTK.
> Trying that now.

That didn't matter. Still "hangs".

thanks,
grant

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (6 preceding siblings ...)
  2004-06-22 22:30 ` Grant Grundler
@ 2004-06-22 22:38 ` Arun Sharma
  2004-06-23 14:26 ` Tian, Kevin
                   ` (30 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Arun Sharma @ 2004-06-22 22:38 UTC (permalink / raw)
  To: linux-ia64

On 6/22/2004 2:23 PM, Bjorn Helgaas wrote:

> On Tuesday 22 June 2004 3:16 pm, Grant Grundler wrote:
>> On Tue, Jun 22, 2004 at 09:59:11AM -0600, Bjorn Helgaas wrote:
>> > David saw this late last week on a 2.6.7 kernel compiled for UP,
>> 
>> This is for SMP
> 
> Yeah, I saw it with slightly different "console=" parameters than
> David did.  So I think it's just a heisenbug and he happened to
> trip over it with a UP kernel.

I saw it on a 4 way Tiger as well. The kernel was compiled using the RHEL3 toolchain.
Saw the machine sitting in dispatch_to_fault_handler: with the IIP pointing to count_reserved_pages(). I think IFA was 0.

Didn't get a chance to debug this further.

	-Arun


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (7 preceding siblings ...)
  2004-06-22 22:38 ` Arun Sharma
@ 2004-06-23 14:26 ` Tian, Kevin
  2004-06-23 17:03 ` Jesse Barnes
                   ` (29 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2004-06-23 14:26 UTC (permalink / raw)
  To: linux-ia64

On 6/23/2004 6:38AM, Sharma Arun wrote:
>I saw it on a 4 way Tiger as well. The kernel was compiled using the
RHEL3
>toolchain.
>Saw the machine sitting in dispatch_to_fault_handler: with the IIP
pointing to
>count_reserved_pages(). I think IFA was 0.
>
>Didn't get a chance to debug this further.
>
>	-Arun

I also saw such hang on Tiger, when rpmbuild on SLES9-RC2. I'm
suspecting that recent patch which moves init_task to region 5 may break
sth there.

Env:
Tiger 4 with 4 Madison
SLES9-RC2 (linux-2.6.5-7.79)

Description:
Before rpmbuild, I applied partial-page patch derived from David's tree,
which happened to contain another patch to move init_task to region 5.
Then when I booted the new kernel created by rpmbuild (so by default
config file), Tiger hangs immediately. 

Excluding patch for moving init_task made problem away.
======Kernel just stopped at general_exception(), with IIP pointing to
count_reserved_pages():
IIP: 0xa0000001000594a0 (In count_reserved_pages)
ISR: 0x00000028400000030
IFA: 0xe00000004fefe6018

Then by single step:
General_exception() -> dispatch_to_default_handler() ->
SAVE_MIN_WITH_COVER_R19:
	MINSTATE_GET_CURRENT(r16);	//r16 = 0xa00000010070c000
(&init_task)
	...
	adds r16=IA64_TASK_THREAD_ON_USTACK_OFFSET,r16;
\
	;;
\
	ld1 r17=[r16];	<------Just in this point, ITP also hangs and
seems another fault happening. 

I'm suspecting line in ia64_switch_to:
	/*
	* If we've already mapped this task's page, we can skip doing it
again.
	*/
(p6)	cmp.eq p7,p6=r27,r27 <----- Should here cmp.eq.unc be used
instead? No time to test it now...

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (8 preceding siblings ...)
  2004-06-23 14:26 ` Tian, Kevin
@ 2004-06-23 17:03 ` Jesse Barnes
  2004-06-23 22:50 ` Bjorn Helgaas
                   ` (28 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-06-23 17:03 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Wednesday, June 23, 2004 10:26 am, Tian, Kevin wrote:
> I'm suspecting line in ia64_switch_to:
> 	/*
> 	* If we've already mapped this task's page, we can skip doing it
> again.
> 	*/
> (p6)	cmp.eq p7,p6=r27,r27 <----- Should here cmp.eq.unc be used
> instead? No time to test it now...

Thanks a lot Kevin, that worked great!  Here's the patch for people who want 
something a little easier to apply.  Too bad the init_task move patch was 
mixed up with some ia32 stuff or it would have been easier to revert this 
change for testing.

Clear both p7 and p6 predicates in the check for task struct mapping in 
ia64_switch_to.

Jesse

[-- Attachment #2: init-task-move-fix.patch --]
[-- Type: text/x-diff, Size: 377 bytes --]

===== arch/ia64/kernel/entry.S 1.61 vs edited =====
--- 1.61/arch/ia64/kernel/entry.S	2004-06-16 21:09:33 -04:00
+++ edited/arch/ia64/kernel/entry.S	2004-06-23 12:47:37 -04:00
@@ -191,7 +191,7 @@
 	/*
 	 * If we've already mapped this task's page, we can skip doing it again.
 	 */
-(p6)	cmp.eq p7,p6=r26,r27
+(p6)	cmp.eq.unc p7,p6=r26,r27
 (p6)	br.cond.dpnt .map
 	;;
 .done:

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (9 preceding siblings ...)
  2004-06-23 17:03 ` Jesse Barnes
@ 2004-06-23 22:50 ` Bjorn Helgaas
  2004-06-24  2:57 ` Tian, Kevin
                   ` (27 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Bjorn Helgaas @ 2004-06-23 22:50 UTC (permalink / raw)
  To: linux-ia64

On Wednesday 23 June 2004 8:26 am, Tian, Kevin wrote:
> I'm suspecting line in ia64_switch_to:
> 	/*
> 	* If we've already mapped this task's page, we can skip doing it
> again.
> 	*/
> (p6)	cmp.eq p7,p6=r27,r27 <----- Should here cmp.eq.unc be used
> instead? No time to test it now...

Your change evidently solves the problem, but I don't understand
how.  Can you enlighten me?  Here's the essence of the code:

	        cmp.eq p7,p6=r25,in0
	(p6)    cmp.eq p7,p6=r26,r27
	(p6)    br.cond.dpnt .map

As I understand it, adding ".unc" to the second cmp instruction
should only make a difference when p6=0.  In that case, after
the old cmp (no ".unc") we'd have p6=0 and p7=1, while after
the new cmp (with ".unc") we'd have p6=0 and p7=0.  But we
never test p7, so I don't see what difference it makes.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (10 preceding siblings ...)
  2004-06-23 22:50 ` Bjorn Helgaas
@ 2004-06-24  2:57 ` Tian, Kevin
  2004-06-25  0:36 ` Chen, Kenneth W
                   ` (26 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2004-06-24  2:57 UTC (permalink / raw)
  To: linux-ia64

On 6/24/2004 6:51AM, Bjorn Helgaas wrote:
>> (p6)	cmp.eq p7,p6=r27,r27 <----- Should here cmp.eq.unc be used
>> instead? No time to test it now...
>
>Your change evidently solves the problem, but I don't understand
>how.  Can you enlighten me?  Here's the essence of the code:
>
>	        cmp.eq p7,p6=r25,in0
>	(p6)    cmp.eq p7,p6=r26,r27
>	(p6)    br.cond.dpnt .map
>
>As I understand it, adding ".unc" to the second cmp instruction
>should only make a difference when p6=0.  In that case, after
>the old cmp (no ".unc") we'd have p6=0 and p7=1, while after
>the new cmp (with ".unc") we'd have p6=0 and p7=0.  But we
>never test p7, so I don't see what difference it makes.

Ah, agree with you on this point. So maybe this touch just affects sth
behind indirectly. I'm looking into it again... Anyone has a quick
answer? :)

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (11 preceding siblings ...)
  2004-06-24  2:57 ` Tian, Kevin
@ 2004-06-25  0:36 ` Chen, Kenneth W
  2004-06-25 16:31 ` Chen, Kenneth W
                   ` (25 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-25  0:36 UTC (permalink / raw)
  To: linux-ia64

>>>> Bjorn Helgaas wrote on Wednesday, June 23, 2004 3:51 PM
> On Wednesday 23 June 2004 8:26 am, Tian, Kevin wrote:
> > I'm suspecting line in ia64_switch_to:
> > 	/*
> > 	* If we've already mapped this task's page, we can skip doing it
> > again.
> > 	*/
> > (p6)	cmp.eq p7,p6=r27,r27 <----- Should here cmp.eq.unc be used
> > instead? No time to test it now...
>
> Your change evidently solves the problem, but I don't understand
> how.  Can you enlighten me?  Here's the essence of the code:
>


This is called black magic and pure coincidence. Welcome to the world
of randomness.  If I boot that "unc" Kernel frequent enough, it will
hang eventually.  Without "unc" it also has 30/70 fail/pass rate.

The regression is coming from moving init_task from region 7 to region
5.  The hang was a nested fault with no valid dtlb mapping for the init
task's stack.  The problem was from physical mode efi call.  efi_call_phys
does: ia64_switch_mode_phys, call the function, then ia64_switch_mode_virt.
The ia64_switch_mode_virt now need to special case the init task to put
sp and ar.bspstore into region5 instead of region7.  I have a quick patch
that fix the hang.  Let me polish it a bit more and then post.

Oh yeah, baby, the first two hunk in head.S is just plain wrong in this
patch: http://www.gelato.unsw.edu.au/linux-ia64/0406/10047.html.  Let me
work on that too .....

- Ken



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (12 preceding siblings ...)
  2004-06-25  0:36 ` Chen, Kenneth W
@ 2004-06-25 16:31 ` Chen, Kenneth W
  2004-06-26  5:29 ` David Mosberger
                   ` (24 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-25 16:31 UTC (permalink / raw)
  To: linux-ia64

>>>>> Chen, Kenneth W wrote on Thursday, June 24, 2004 5:37 PM
> The regression is coming from moving init_task from region 7 to region
> 5.  The hang was a nested fault with no valid dtlb mapping for the init
> task's stack.  The problem was from physical mode efi call.  efi_call_phys
> does: ia64_switch_mode_phys, call the function, then ia64_switch_mode_virt.
> The ia64_switch_mode_virt now need to special case the init task to put
> sp and ar.bspstore into region5 instead of region7.  I have a quick patch
> that fix the hang.  Let me polish it a bit more and then post.
>
> Oh yeah, baby, the first two hunk in head.S is just plain wrong in this
> patch: http://www.gelato.unsw.edu.au/linux-ia64/0406/10047.html.  Let me
> work on that too .....

Regarding to rev 1.24 in head.S:
http://lia64.bkbits.net:8080/linux-ia64-2.5/diffs/arch/ia64/kernel/head.S@1.24?nav=index.html|src/.|src/arch|src/arch/ia64|src/arch/
ia64/kernel|hist/arch/ia64/kernel/head.S

The relocation of r16 is incorrect.  For BP, we are not installing any
region 7 TLB mapping.  But this patch will put a valid kernel granule
index into kr(stack).  Equivalently, it lies to the rest of the kernel
that it installed an entry at the index kernel image locates.  If the
first task coming out of this init_task happens to have its task struct
in that very same granule, the stack will not be mapped by any DTLB.
Then bad things happen like random hang because of nested fault.

The first two hunks in the following patch reverse that relocation. The
next two hunks fix the random hang observed when moving init_task from
region 7 to region 5.  As explained earlier, BP did a call to efi_call_phys
which switches to physical mode and then back to virtual.  When going
back to virtual, it converts ar.bspstore and sp to region 7 address.
After that, any heavy weight fault will lead into nested fault because
there are no region 7 dtlb mapping for its stack.  The fix is to special
case ia64_switch_mode_virt to compute ar.bspstore and sp's virtual
addresses into region 5.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>


Tested on Intel tiger machine with several hundred consecutive boots.



--- 1.24/arch/ia64/kernel/head.S	Wed Jun 16 18:09:33 2004
+++ edited/arch/ia64/kernel/head.S	Fri Jun 25 08:26:19 2004
@@ -154,10 +154,9 @@ start_ap:
 #endif
 	;;
 	tpa r3=r2		// r3 = phys addr of task struct
+	mov r16=-1
+(isBP)	br.cond.dpnt .load_current // no need to map region 5 init_task
 	;;
-	shr.u r16=r3,IA64_GRANULE_SHIFT
-(isBP)	br.cond.dpnt .load_current // BP stack is on region 5 --- no need to map it
-
 	// load mapping for stack (virtaddr in r2, physaddr in r3)
 	rsm psr.ic
 	movl r17=PAGE_KERNEL
@@ -169,6 +168,7 @@ start_ap:
 	dep r2=-1,r3,61,3	// IMVA of task
 	;;
 	mov r17=rr[r2]
+	shr.u r16=r3,IA64_GRANULE_SHIFT
 	;;
 	dep r17=0,r17,8,24
 	;;
@@ -766,7 +766,9 @@ GLOBAL_ENTRY(ia64_switch_mode_virt)
 	flushrs				// must be first insn in group
 	srlz.i
  }
+	movl r19=init_task
 	;;
+	cmp.eq p7,p6=r19,r13		// special case init_task
 	mov cr.ipsr=r16			// set new PSR
 	add r3\x1f-ia64_switch_mode_virt,r15

@@ -781,11 +783,15 @@ GLOBAL_ENTRY(ia64_switch_mode_virt)
 	movl r18=KERNEL_START
 	dep r3=0,r3,KERNEL_TR_PAGE_SHIFT,64-KERNEL_TR_PAGE_SHIFT
 	dep r14=0,r14,KERNEL_TR_PAGE_SHIFT,64-KERNEL_TR_PAGE_SHIFT
-	dep r17=-1,r17,61,3
-	dep sp=-1,sp,61,3
+(p6)	dep r17=-1,r17,61,3
+(p6)	dep sp=-1,sp,61,3
+(p7)	dep r17=0,r17,KERNEL_TR_PAGE_SHIFT,64-KERNEL_TR_PAGE_SHIFT
+(p7)	dep sp =0, sp,KERNEL_TR_PAGE_SHIFT,64-KERNEL_TR_PAGE_SHIFT
 	;;
 	or r3=r3,r18
 	or r14=r14,r18
+(p7)	or r17=r17,r18
+(p7)	or sp=sp,r18
 	;;

 	mov r18=ar.rnat			// save ar.rnat




^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (13 preceding siblings ...)
  2004-06-25 16:31 ` Chen, Kenneth W
@ 2004-06-26  5:29 ` David Mosberger
  2004-06-26  5:48 ` Chen, Kenneth W
                   ` (23 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2004-06-26  5:29 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 25 Jun 2004 09:31:15 -0700, "Chen, Kenneth W" <kenneth.w.chen@intel.com> said:

  Ken> Regarding to rev 1.24 in head.S:
  Ken> http://lia64.bkbits.net:8080/linux-ia64-2.5/diffs/arch/ia64/kernel/head.S@1.24?nav=index.html|src/.|src/arch|src/arch/ia64|src/arch/
  Ken> ia64/kernel|hist/arch/ia64/kernel/head.S

  Ken> The relocation of r16 is incorrect.  For BP, we are not installing any
  Ken> region 7 TLB mapping.

Indeed!  I must be missing something though: with your patch, r13 will
be initialized to the region 7 address again, which defeats the
purpose of the original patch.  I think the initialization of r13
needs to be conditional on whether we're dealing with init_task or
anything else.

  Ken> As explained earlier, BP did a call to efi_call_phys which
  Ken> switches to physical mode and then back to virtual.  When going
  Ken> back to virtual, it converts ar.bspstore and sp to region 7
  Ken> address.  After that, any heavy weight fault will lead into
  Ken> nested fault because there are no region 7 dtlb mapping for its
  Ken> stack.  The fix is to special case ia64_switch_mode_virt to
  Ken> compute ar.bspstore and sp's virtual addresses into region 5.

True, but it's really ugly to add more special cases.  Wouldn't it be
better to explicitly pass the sp/bsp that need to be restored?
(Caveat: can't use the normal calling conventions there; perhaps r17
and r18 could be used?)

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (14 preceding siblings ...)
  2004-06-26  5:29 ` David Mosberger
@ 2004-06-26  5:48 ` Chen, Kenneth W
  2004-06-26  5:55 ` David Mosberger
                   ` (22 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-26  5:48 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote on Friday, June 25, 2004 10:29 PM
>
>   Ken> The relocation of r16 is incorrect.  For BP, we are not installing any
>   Ken> region 7 TLB mapping.
>
> Indeed!  I must be missing something though: with your patch, r13 will
> be initialized to the region 7 address again, which defeats the
> purpose of the original patch.  I think the initialization of r13
> needs to be conditional on whether we're dealing with init_task or
> anything else.

Not sure what you mean.  The first two hunks are trying to revert the
change in rev 1.24 and r16 initialization, which gets put into kr(stack)
later. I'm not touching r13.

>   Ken> As explained earlier, BP did a call to efi_call_phys which
>   Ken> switches to physical mode and then back to virtual.  When going
>   Ken> back to virtual, it converts ar.bspstore and sp to region 7
>   Ken> address.  After that, any heavy weight fault will lead into
>   Ken> nested fault because there are no region 7 dtlb mapping for its
>   Ken> stack.  The fix is to special case ia64_switch_mode_virt to
>   Ken> compute ar.bspstore and sp's virtual addresses into region 5.
>
> True, but it's really ugly to add more special cases.  Wouldn't it be
> better to explicitly pass the sp/bsp that need to be restored?
> (Caveat: can't use the normal calling conventions there; perhaps r17
> and r18 could be used?)

Yeah, but we have to update all the call sites, current efi_call_phys
and two other PAL static/stacked calls.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (15 preceding siblings ...)
  2004-06-26  5:48 ` Chen, Kenneth W
@ 2004-06-26  5:55 ` David Mosberger
  2004-06-29 15:09 ` Chen, Kenneth W
                   ` (21 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2004-06-26  5:55 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 25 Jun 2004 22:48:26 -0700, "Chen, Kenneth W" <kenneth.w.chen@intel.com> said:

  Ken> David Mosberger wrote on Friday, June 25, 2004 10:29 PM

  Ken> The relocation of r16 is incorrect.  For BP, we are not installing any
  Ken> region 7 TLB mapping.

  >> Indeed!  I must be missing something though: with your patch, r13 will
  >> be initialized to the region 7 address again, which defeats the
  >> purpose of the original patch.  I think the initialization of r13
  >> needs to be conditional on whether we're dealing with init_task or
  >> anything else.

  Ken> Not sure what you mean.  The first two hunks are trying to revert the
  Ken> change in rev 1.24 and r16 initialization, which gets put into kr(stack)
  Ken> later. I'm not touching r13.

Ah, yes, my bad (I guess I really should catch up sleep first...).

  >> True, but it's really ugly to add more special cases.  Wouldn't it be
  >> better to explicitly pass the sp/bsp that need to be restored?
  >> (Caveat: can't use the normal calling conventions there; perhaps r17
  >> and r18 could be used?)

  Ken> Yeah, but we have to update all the call sites, current efi_call_phys
  Ken> and two other PAL static/stacked calls.

True, but I think there are only 3 call-sites.  If it turns out to be
_really_ ugly we can reonsider, but I think it might be a better
choice in the long run.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (16 preceding siblings ...)
  2004-06-26  5:55 ` David Mosberger
@ 2004-06-29 15:09 ` Chen, Kenneth W
  2004-06-29 15:34 ` Chen, Kenneth W
                   ` (20 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-29 15:09 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote on Friday, June 25, 2004 10:55 PM
>  >> True, but it's really ugly to add more special cases.  Wouldn't it be
>  >> better to explicitly pass the sp/bsp that need to be restored?
>  >> (Caveat: can't use the normal calling conventions there; perhaps r17
>  >> and r18 could be used?)
>
>  Ken> Yeah, but we have to update all the call sites, current efi_call_phys
>  Ken> and two other PAL static/stacked calls.
>
> True, but I think there are only 3 call-sites.  If it turns out to be
> _really_ ugly we can reonsider, but I think it might be a better
> choice in the long run.


How does this patch look?  It is a bit big. But what it does is really
simple: change 3 call sites to save/restore virtual address of sp and
ar.bsp/ar.bspstore.


=== arch/ia64/kernel/efi_stub.S 1.5 vs edited ==--- 1.5/arch/ia64/kernel/efi_stub.S	Thu May 15 04:45:02 2003
+++ edited/arch/ia64/kernel/efi_stub.S	Mon Jun 28 21:55:00 2004
@@ -44,7 +44,7 @@

 GLOBAL_ENTRY(efi_call_phys)
 	.prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
-	alloc loc1=ar.pfs,8,5,7,0
+	alloc loc1=ar.pfs,8,7,7,0
 	ld8 r2=[in0],8			// load EFI function's entry point
 	mov loc0=rp
 	.body
@@ -70,9 +70,13 @@
 	mov out3=in4
 	mov out5=in6
 	mov out6=in7
+	mov loc5=r19
+	mov loc6=r20
 	br.call.sptk.many rp¶		// call the EFI function
 .ret1:	mov ar.rsc=0			// put RSE in enforced lazy, LE mode
 	mov r16=loc3
+	mov r19=loc5
+	mov r20=loc6
 	br.call.sptk.many rp=ia64_switch_mode_virt // return to virtual mode
 .ret2:	mov ar.rsc=loc4			// restore RSE configuration
 	mov ar.pfs=loc1
=== arch/ia64/kernel/head.S 1.24 vs edited ==--- 1.24/arch/ia64/kernel/head.S	Wed Jun 16 18:09:33 2004
+++ edited/arch/ia64/kernel/head.S	Mon Jun 28 21:55:01 2004
@@ -706,6 +706,9 @@
  *
  * Inputs:
  *	r16 = new psr to establish
+ * Output:
+ *	r19 = old virtual address of ar.bsp
+ *	r20 = old virtual address of sp
  *
  * Note: RSE must already be in enforced lazy mode
  */
@@ -724,12 +727,13 @@
 	mov cr.ipsr=r16			// set new PSR
 	add r3\x1f-ia64_switch_mode_phys,r15

-	mov r17=ar.bsp
+	mov r19=ar.bsp
+	mov r20=sp
 	mov r14=rp			// get return address into a general register
 	;;

 	// going to physical mode, use tpa to translate virt->phys
-	tpa r17=r17
+	tpa r17=r19
 	tpa r3=r3
 	tpa sp=sp
 	tpa r14=r14
@@ -752,6 +756,8 @@
  *
  * Inputs:
  *	r16 = new psr to establish
+ *	r19 = new bspstore to establish
+ *	r20 = new sp to establish
  *
  * Note: RSE must already be in enforced lazy mode
  */
@@ -770,7 +776,6 @@
 	mov cr.ipsr=r16			// set new PSR
 	add r3\x1f-ia64_switch_mode_virt,r15

-	mov r17=ar.bsp
 	mov r14=rp			// get return address into a general register
 	;;

@@ -781,15 +786,14 @@
 	movl r18=KERNEL_START
 	dep r3=0,r3,KERNEL_TR_PAGE_SHIFT,64-KERNEL_TR_PAGE_SHIFT
 	dep r14=0,r14,KERNEL_TR_PAGE_SHIFT,64-KERNEL_TR_PAGE_SHIFT
-	dep r17=-1,r17,61,3
-	dep sp=-1,sp,61,3
+	mov sp=r20
 	;;
 	or r3=r3,r18
 	or r14=r14,r18
 	;;

 	mov r18=ar.rnat			// save ar.rnat
-	mov ar.bspstore=r17		// this steps on ar.rnat
+	mov ar.bspstore=r19		// this steps on ar.rnat
 	mov cr.iip=r3
 	mov cr.ifs=r0
 	;;
=== arch/ia64/kernel/pal.S 1.7 vs edited ==--- 1.7/arch/ia64/kernel/pal.S	Thu May 15 04:45:02 2003
+++ edited/arch/ia64/kernel/pal.S	Mon Jun 28 21:55:03 2004
@@ -176,10 +176,14 @@
 	andcm r16=loc3,r16		// removes bits to clear from psr
 	br.call.sptk.many rp=ia64_switch_mode_phys
 .ret1:	mov rp = r8			// install return address (physical)
+	mov loc5 = r19
+	mov loc6 = r20
 	br.cond.sptk.many b7
 1:
 	mov ar.rsc=0			// put RSE in enforced lazy, LE mode
 	mov r16=loc3			// r16= original psr
+	mov r19=loc5
+	mov r20=loc6
 	br.call.sptk.many rp=ia64_switch_mode_virt // return to virtual mode
 .ret2:
 	mov psr.l = loc3		// restore init PSR
@@ -201,7 +205,7 @@
  */
 GLOBAL_ENTRY(ia64_pal_call_phys_stacked)
 	.prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(5)
-	alloc	loc1 = ar.pfs,5,5,86,0
+	alloc	loc1 = ar.pfs,5,7,4,0
 	movl	loc2 = pal_entry_point
 1:	{
 	  mov r28  = in0		// copy procedure index
@@ -230,10 +234,14 @@
 	andcm r16=loc3,r16		// removes bits to clear from psr
 	br.call.sptk.many rp=ia64_switch_mode_phys
 .ret6:
+	mov loc5 = r19
+	mov loc6 = r20
 	br.call.sptk.many rp·		// now make the call
 .ret7:
 	mov ar.rsc=0			// put RSE in enforced lazy, LE mode
 	mov r16=loc3			// r16= original psr
+	mov r19=loc5
+	mov r20=loc6
 	br.call.sptk.many rp=ia64_switch_mode_virt	// return to virtual mode

 .ret8:	mov psr.l  = loc3		// restore init PSR



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (17 preceding siblings ...)
  2004-06-29 15:09 ` Chen, Kenneth W
@ 2004-06-29 15:34 ` Chen, Kenneth W
  2004-06-29 17:32 ` Jesse Barnes
                   ` (19 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-29 15:34 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W wrote on Tuesday, June 29, 2004 8:09 AM
> David Mosberger wrote on Friday, June 25, 2004 10:55 PM
> >  >> True, but it's really ugly to add more special cases.  Wouldn't it be
> >  >> better to explicitly pass the sp/bsp that need to be restored?
> >  >> (Caveat: can't use the normal calling conventions there; perhaps r17
> >  >> and r18 could be used?)
> >
> >  Ken> Yeah, but we have to update all the call sites, current efi_call_phys
> >  Ken> and two other PAL static/stacked calls.
> >
> > True, but I think there are only 3 call-sites.  If it turns out to be
> > _really_ ugly we can reonsider, but I think it might be a better
> > choice in the long run.
>
> How does this patch look?  It is a bit big. But what it does is really
> simple: change 3 call sites to save/restore virtual address of sp and
> ar.bsp/ar.bspstore.


To follow up the other bug in head.S, here is the fix.

---------
For BP, we are not installing any region 7 DTLB mapping for init_task.
However, kr(stack) is being initialized to a legal kernel granule that
the kernel resides.  If the first task context switch out of this
init_task happens to have its task struct in that very same granule,
the stack will not be mapped by any DTLB.  Patch to properly initialize
kr(stack) for BP.

=== arch/ia64/kernel/head.S 1.25 vs edited ==--- 1.25/arch/ia64/kernel/head.S	Mon Jun 28 22:07:49 2004
+++ edited/arch/ia64/kernel/head.S	Mon Jun 28 22:11:32 2004
@@ -154,8 +154,7 @@
 #endif
 	;;
 	tpa r3=r2		// r3 = phys addr of task struct
-	;;
-	shr.u r16=r3,IA64_GRANULE_SHIFT
+	mov r16=-1
 (isBP)	br.cond.dpnt .load_current // BP stack is on region 5 --- no need to map it

 	// load mapping for stack (virtaddr in r2, physaddr in r3)
@@ -169,6 +168,7 @@
 	dep r2=-1,r3,61,3	// IMVA of task
 	;;
 	mov r17=rr[r2]
+	shr.u r16=r3,IA64_GRANULE_SHIFT
 	;;
 	dep r17=0,r17,8,24
 	;;



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (18 preceding siblings ...)
  2004-06-29 15:34 ` Chen, Kenneth W
@ 2004-06-29 17:32 ` Jesse Barnes
  2004-06-29 17:40 ` Chen, Kenneth W
                   ` (18 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-06-29 17:32 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

On Tuesday, June 29, 2004 8:34 am, Chen, Kenneth W wrote:
> To follow up the other bug in head.S, here is the fix.
>
> ---------
> For BP, we are not installing any region 7 DTLB mapping for init_task.
> However, kr(stack) is being initialized to a legal kernel granule that
> the kernel resides.  If the first task context switch out of this
> init_task happens to have its task struct in that very same granule,
> the stack will not be mapped by any DTLB.  Patch to properly initialize
> kr(stack) for BP.

I tried both of these on a machine that doesn't have memory at the stock 
kernel load address, and it failed very early on.  However, it works with the 
attached patch.

Linux version 2.6.7 (jbarnes@tomahawk.engr.sgi.com) (gcc version 3.2.3 
20030502 (Red Hat Linux 3.2.3-24)) #3 SMP Tue Jun 29 10:01:08 PDT 2004
EFI v1.02 by SGI: SALsystab=0x230047e5150 ACPI 2.0=0x230047e5920
ACPI: RSDP (v002    SGI                                    ) @ 
0x00000230047e5920
ACPI: XSDT (v001    SGI  XSDTSN2 0x00010001  0x00000001) @ 0x00000230047e5960
ACPI: MADT (v001    SGI  APICSN2 0x00010001  0x00000001) @ 0x00000230047e59c0
ACPI: SRAT (v001    SGI  SRATSN2 0x00010001  0x00000001) @ 0x00000230047e5a30
ACPI: SLIT (v001    SGI  SLITSN2 0x00010001  0x00000001) @ 0x00000230047e5b00
ACPI: FADT (v003    SGI  FACPSN2 0x00030001  0x00000001) @ 0x00000230047e5c00
ACPI: DSDT (v001    SGI  DSDTSN2 0x00010001  0x00000001) @ 0x00000230047e5bc0
ACPI: DSDT (v001    SGI  DSDTSN2 0x00010001  0x00000001) @ 0x0000000000000000
ACPI: SRAT revision 0
ACPI: SLIT localities 6x6
Number of logical nodes in system = 2
Number of memory chunks in system = 2
SAL 2.9: SGI SN2 version 3.32
SAL Platform features: ITC_Drift
SAL: AP wakeup using external interrupt vector 0x12

Jesse

[-- Attachment #2: init-task-region-5-revert.patch --]
[-- Type: text/plain, Size: 1559 bytes --]

===== arch/ia64/kernel/entry.S 1.61 vs edited =====
--- 1.61/arch/ia64/kernel/entry.S	Wed Jun 16 18:09:33 2004
+++ edited/arch/ia64/kernel/entry.S	Thu Jun 24 12:12:01 2004
@@ -179,19 +179,17 @@
 	.body
 
 	adds r22=IA64_TASK_THREAD_KSP_OFFSET,r13
-	movl r25=init_task
 	mov r27=IA64_KR(CURRENT_STACK)
-	adds r21=IA64_TASK_THREAD_KSP_OFFSET,in0
 	dep r20=0,in0,61,3		// physical address of "current"
 	;;
 	st8 [r22]=sp			// save kernel stack pointer of old task
 	shr.u r26=r20,IA64_GRANULE_SHIFT
-	cmp.eq p7,p6=r25,in0
+	adds r21=IA64_TASK_THREAD_KSP_OFFSET,in0
 	;;
 	/*
 	 * If we've already mapped this task's page, we can skip doing it again.
 	 */
-(p6)	cmp.eq p7,p6=r26,r27
+	cmp.eq p7,p6=r26,r27
 (p6)	br.cond.dpnt .map
 	;;
 .done:
===== arch/ia64/kernel/head.S 1.24 vs edited =====
--- 1.24/arch/ia64/kernel/head.S	Wed Jun 16 18:09:33 2004
+++ edited/arch/ia64/kernel/head.S	Thu Jun 24 12:12:02 2004
@@ -154,10 +154,6 @@
 #endif
 	;;
 	tpa r3=r2		// r3 == phys addr of task struct
-	;;
-	shr.u r16=r3,IA64_GRANULE_SHIFT
-(isBP)	br.cond.dpnt .load_current // BP stack is on region 5 --- no need to map it
-
 	// load mapping for stack (virtaddr in r2, physaddr in r3)
 	rsm psr.ic
 	movl r17=PAGE_KERNEL
@@ -169,6 +165,7 @@
 	dep r2=-1,r3,61,3	// IMVA of task
 	;;
 	mov r17=rr[r2]
+	shr.u r16=r3,IA64_GRANULE_SHIFT
 	;;
 	dep r17=0,r17,8,24
 	;;
@@ -183,7 +180,6 @@
 	srlz.d
   	;;
 
-.load_current:
 	// load the "current" pointer (r13) and ar.k6 with the current task
 	mov IA64_KR(CURRENT)=r2		// virtual address
 	mov IA64_KR(CURRENT_STACK)=r16

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (19 preceding siblings ...)
  2004-06-29 17:32 ` Jesse Barnes
@ 2004-06-29 17:40 ` Chen, Kenneth W
  2004-06-29 17:45 ` Jesse Barnes
                   ` (17 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-29 17:40 UTC (permalink / raw)
  To: linux-ia64

Jesse Barnes wrote on Tuesday, June 29, 2004 10:32 AM
> On Tuesday, June 29, 2004 8:34 am, Chen, Kenneth W wrote:
> > To follow up the other bug in head.S, here is the fix.
> >
> > ---------
> > For BP, we are not installing any region 7 DTLB mapping for init_task.
> > However, kr(stack) is being initialized to a legal kernel granule that
> > the kernel resides.  If the first task context switch out of this
> > init_task happens to have its task struct in that very same granule,
> > the stack will not be mapped by any DTLB.  Patch to properly initialize
> > kr(stack) for BP.
>
> I tried both of these on a machine that doesn't have memory at the stock
> kernel load address, and it failed very early on.  However, it works with the
> attached patch.

Let me confirm what I understand:
David's bk tree doesn't boot.
David's bk tree + 2 patches I posted this morning doesn't boot.

Is that correct?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (20 preceding siblings ...)
  2004-06-29 17:40 ` Chen, Kenneth W
@ 2004-06-29 17:45 ` Jesse Barnes
  2004-06-29 18:03 ` Chen, Kenneth W
                   ` (16 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-06-29 17:45 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, June 29, 2004 10:40 am, Chen, Kenneth W wrote:
> Jesse Barnes wrote on Tuesday, June 29, 2004 10:32 AM
>
> > On Tuesday, June 29, 2004 8:34 am, Chen, Kenneth W wrote:
> > > To follow up the other bug in head.S, here is the fix.
> > >
> > > ---------
> > > For BP, we are not installing any region 7 DTLB mapping for init_task.
> > > However, kr(stack) is being initialized to a legal kernel granule that
> > > the kernel resides.  If the first task context switch out of this
> > > init_task happens to have its task struct in that very same granule,
> > > the stack will not be mapped by any DTLB.  Patch to properly initialize
> > > kr(stack) for BP.
> >
> > I tried both of these on a machine that doesn't have memory at the stock
> > kernel load address, and it failed very early on.  However, it works with
> > the attached patch.
>
> Let me confirm what I understand:
> David's bk tree doesn't boot.
> David's bk tree + 2 patches I posted this morning doesn't boot.
>
> Is that correct?

Correct.  David's tree hangs at a later point though, after printing 
"Console: ... 80x25".

Jesse


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (21 preceding siblings ...)
  2004-06-29 17:45 ` Jesse Barnes
@ 2004-06-29 18:03 ` Chen, Kenneth W
  2004-06-29 18:13 ` Jesse Barnes
                   ` (15 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-29 18:03 UTC (permalink / raw)
  To: linux-ia64

Jesse Barnes wrote on Tuesday, June 29, 2004 10:46 AM
> > >
> > > I tried both of these on a machine that doesn't have memory at the stock
> > > kernel load address, and it failed very early on.  However, it works with
> > > the attached patch.
> >
> > Let me confirm what I understand:
> > David's bk tree doesn't boot.
> > David's bk tree + 2 patches I posted this morning doesn't boot.
> >
> > Is that correct?
>
> Correct.  David's tree hangs at a later point though, after printing
> "Console: ... 80x25".


I presume David's bk tree plus this patch also hung on your machine?
http://www.gelato.unsw.edu.au/linux-ia64/0406/10162.html

- Ken



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (22 preceding siblings ...)
  2004-06-29 18:03 ` Chen, Kenneth W
@ 2004-06-29 18:13 ` Jesse Barnes
  2004-06-29 18:19 ` Chen, Kenneth W
                   ` (14 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-06-29 18:13 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, June 29, 2004 11:03 am, Chen, Kenneth W wrote:
> Jesse Barnes wrote on Tuesday, June 29, 2004 10:46 AM
>
> > > > I tried both of these on a machine that doesn't have memory at the
> > > > stock kernel load address, and it failed very early on.  However, it
> > > > works with the attached patch.
> > >
> > > Let me confirm what I understand:
> > > David's bk tree doesn't boot.
> > > David's bk tree + 2 patches I posted this morning doesn't boot.
> > >
> > > Is that correct?
> >
> > Correct.  David's tree hangs at a later point though, after printing
> > "Console: ... 80x25".
>
> I presume David's bk tree plus this patch also hung on your machine?
> http://www.gelato.unsw.edu.au/linux-ia64/0406/10162.html

Yep.  It gets as far as David's tree plus your patches from this morning.

Jesse

Linux version 2.6.7 (jbarnes@tomahawk.engr.sgi.com) (gcc version 3.3.2) #2 SMP 
Tue Jun 29 11:10:12 PDT 2004
EFI v1.02 by SGI: SALsystab=0x230047e5150 ACPI 2.0=0x230047e5920
ACPI: RSDP (v002    SGI                                    ) @ 
0x00000230047e5920
ACPI: XSDT (v001    SGI  XSDTSN2 0x00010001  0x00000001) @ 0x00000230047e5960
ACPI: MADT (v001    SGI  APICSN2 0x00010001  0x00000001) @ 0x00000230047e59c0
ACPI: SRAT (v001    SGI  SRATSN2 0x00010001  0x00000001) @ 0x00000230047e5a30
ACPI: SLIT (v001    SGI  SLITSN2 0x00010001  0x00000001) @ 0x00000230047e5b00
ACPI: FADT (v003    SGI  FACPSN2 0x00030001  0x00000001) @ 0x00000230047e5c00
ACPI: DSDT (v001    SGI  DSDTSN2 0x00010001  0x00000001) @ 0x00000230047e5bc0
ACPI: DSDT (v001    SGI  DSDTSN2 0x00010001  0x00000001) @ 0x0000000000000000
ACPI: SRAT revision 0
ACPI: SLIT localities 6x6
Number of logical nodes in system = 2
Number of memory chunks in system = 2
SAL 2.9: SGI SN2 version 3.32
SAL Platform features: ITC_Drift
SAL: AP wakeup using external interrupt vector 0x12

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (23 preceding siblings ...)
  2004-06-29 18:13 ` Jesse Barnes
@ 2004-06-29 18:19 ` Chen, Kenneth W
  2004-06-29 21:19 ` David Mosberger
                   ` (13 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-06-29 18:19 UTC (permalink / raw)
  To: linux-ia64

Jesse Barnes wrote on Tuesday, June 29, 2004 11:14 AM
> > > > > I tried both of these on a machine that doesn't have memory at the
> > > > > stock kernel load address, and it failed very early on.  However, it
> > > > > works with the attached patch.
> > > >
> > > > Let me confirm what I understand:
> > > > David's bk tree doesn't boot.
> > > > David's bk tree + 2 patches I posted this morning doesn't boot.
> > > >
> > > > Is that correct?
> > >
> > > Correct.  David's tree hangs at a later point though, after printing
> > > "Console: ... 80x25".
> >
> > I presume David's bk tree plus this patch also hung on your machine?
> > http://www.gelato.unsw.edu.au/linux-ia64/0406/10162.html
>
> Yep.  It gets as far as David's tree plus your patches from this morning.

Looks like there are more places where kernel does "virt -> phys -> virt".
And conversion from phys to virt is setting 3 msb to 1.

- Ken



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (24 preceding siblings ...)
  2004-06-29 18:19 ` Chen, Kenneth W
@ 2004-06-29 21:19 ` David Mosberger
  2004-06-29 23:18 ` David Mosberger
                   ` (12 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2004-06-29 21:19 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 29 Jun 2004 08:09:00 -0700, "Chen, Kenneth W" <kenneth.w.chen@intel.com> said:

  >> True, but I think there are only 3 call-sites.  If it turns out to be
  >> _really_ ugly we can reonsider, but I think it might be a better
  >> choice in the long run.


  Ken> How does this patch look?  It is a bit big. But what it does is really
  Ken> simple: change 3 call sites to save/restore virtual address of sp and
  Ken> ar.bsp/ar.bspstore.

Looks fine to me.

Thanks,

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (25 preceding siblings ...)
  2004-06-29 21:19 ` David Mosberger
@ 2004-06-29 23:18 ` David Mosberger
  2004-06-30 16:17 ` Jesse Barnes
                   ` (11 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2004-06-29 23:18 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 29 Jun 2004 11:19:34 -0700, "Chen, Kenneth W" <kenneth.w.chen@intel.com> said:

  Ken> Looks like there are more places where kernel does "virt ->
  Ken> phys -> virt".  And conversion from phys to virt is setting 3
  Ken> msb to 1.

Your fixes work fine on the machines I have tried so far (variety of
Itanium 2 boxen and a Big Sur Merced box).

Jesse, do you know if unapplying the "move current to region 5" patch
fixes your boot-problem?  For convenience, I attached the (original)
patch below.

	--david

# arch/ia64/kernel/head.S
#   2004/06/16 18:09:33-07:00 davidm@tiger.hpl.hp.com +5 -1
#   (_start): Initialize "current" pointer for init-task to be in
#   	region 5, not 7.
# 
# arch/ia64/kernel/entry.S
#   2004/06/16 18:09:33-07:00 davidm@tiger.hpl.hp.com +4 -2
#   (ia64_switch_to): Don't try to map "current"-pointers which are
#   	inside region 5.
# 
diff -Nru a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
--- a/arch/ia64/kernel/entry.S	Tue Jun 29 16:11:13 2004
+++ b/arch/ia64/kernel/entry.S	Tue Jun 29 16:11:13 2004
@@ -179,17 +179,19 @@
 	.body
 
 	adds r22=IA64_TASK_THREAD_KSP_OFFSET,r13
+	movl r25=init_task
 	mov r27=IA64_KR(CURRENT_STACK)
+	adds r21=IA64_TASK_THREAD_KSP_OFFSET,in0
 	dep r20=0,in0,61,3		// physical address of "current"
 	;;
 	st8 [r22]=sp			// save kernel stack pointer of old task
 	shr.u r26=r20,IA64_GRANULE_SHIFT
-	adds r21=IA64_TASK_THREAD_KSP_OFFSET,in0
+	cmp.eq p7,p6=r25,in0
 	;;
 	/*
 	 * If we've already mapped this task's page, we can skip doing it again.
 	 */
-	cmp.eq p7,p6=r26,r27
+(p6)	cmp.eq p7,p6=r26,r27
 (p6)	br.cond.dpnt .map
 	;;
 .done:
diff -Nru a/arch/ia64/kernel/head.S b/arch/ia64/kernel/head.S
--- a/arch/ia64/kernel/head.S	Tue Jun 29 16:11:13 2004
+++ b/arch/ia64/kernel/head.S	Tue Jun 29 16:11:13 2004
@@ -154,6 +154,10 @@
 #endif
 	;;
 	tpa r3=r2		// r3 = phys addr of task struct
+	;;
+	shr.u r16=r3,IA64_GRANULE_SHIFT
+(isBP)	br.cond.dpnt .load_current // BP stack is on region 5 --- no need to map it
+
 	// load mapping for stack (virtaddr in r2, physaddr in r3)
 	rsm psr.ic
 	movl r17=PAGE_KERNEL
@@ -165,7 +169,6 @@
 	dep r2=-1,r3,61,3	// IMVA of task
 	;;
 	mov r17=rr[r2]
-	shr.u r16=r3,IA64_GRANULE_SHIFT
 	;;
 	dep r17=0,r17,8,24
 	;;
@@ -180,6 +183,7 @@
 	srlz.d
   	;;
 
+.load_current:
 	// load the "current" pointer (r13) and ar.k6 with the current task
 	mov IA64_KR(CURRENT)=r2		// virtual address
 	mov IA64_KR(CURRENT_STACK)=r16

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (26 preceding siblings ...)
  2004-06-29 23:18 ` David Mosberger
@ 2004-06-30 16:17 ` Jesse Barnes
  2004-06-30 18:11 ` Jesse Barnes
                   ` (10 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-06-30 16:17 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, June 29, 2004 4:18 pm, David Mosberger wrote:
> >>>>> On Tue, 29 Jun 2004 11:19:34 -0700, "Chen, Kenneth W"
> >>>>> <kenneth.w.chen@intel.com> said:
>
>   Ken> Looks like there are more places where kernel does "virt ->
>   Ken> phys -> virt".  And conversion from phys to virt is setting 3
>   Ken> msb to 1.
>
> Your fixes work fine on the machines I have tried so far (variety of
> Itanium 2 boxen and a Big Sur Merced box).
>
> Jesse, do you know if unapplying the "move current to region 5" patch
> fixes your boot-problem?  For convenience, I attached the (original)
> patch below.

Yes, reverting that patch does fix the hang I'm seeing.

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (27 preceding siblings ...)
  2004-06-30 16:17 ` Jesse Barnes
@ 2004-06-30 18:11 ` Jesse Barnes
  2004-07-06 23:43 ` David Mosberger
                   ` (9 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-06-30 18:11 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, June 29, 2004 11:19 am, Chen, Kenneth W wrote:
> Jesse Barnes wrote on Tuesday, June 29, 2004 11:14 AM
>
> > > > > > I tried both of these on a machine that doesn't have memory at
> > > > > > the stock kernel load address, and it failed very early on. 
> > > > > > However, it works with the attached patch.
> > > > >
> > > > > Let me confirm what I understand:
> > > > > David's bk tree doesn't boot.
> > > > > David's bk tree + 2 patches I posted this morning doesn't boot.
> > > > >
> > > > > Is that correct?
> > > >
> > > > Correct.  David's tree hangs at a later point though, after printing
> > > > "Console: ... 80x25".
> > >
> > > I presume David's bk tree plus this patch also hung on your machine?
> > > http://www.gelato.unsw.edu.au/linux-ia64/0406/10162.html
> >
> > Yep.  It gets as far as David's tree plus your patches from this morning.
>
> Looks like there are more places where kernel does "virt -> phys -> virt".
> And conversion from phys to virt is setting 3 msb to 1.

Yep, that looks like a problem.  The kernel hangs right after the 
local_flush_tlb_all in ia64_tlb_init, and if I comment it out I get a "Unable 
to handle kernel paging request at virtual address a000003004289f70" which, 
on sn2, looks like a virt -> phys -> virt conversion (i.e. 0x3004289f70 is a 
valid physical address).

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (28 preceding siblings ...)
  2004-06-30 18:11 ` Jesse Barnes
@ 2004-07-06 23:43 ` David Mosberger
  2004-07-06 23:45 ` David Mosberger
                   ` (8 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2004-07-06 23:43 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 30 Jun 2004 09:17:18 -0700, Jesse Barnes <jbarnes@engr.sgi.com> said:

  Jesse> On Tuesday, June 29, 2004 4:18 pm, David Mosberger wrote:
  >> >>>>> On Tue, 29 Jun 2004 11:19:34 -0700, "Chen, Kenneth W"
  >> >>>>> <kenneth.w.chen@intel.com> said:

  Ken> Looks like there are more places where kernel does "virt ->
  Ken> phys -> virt".  And conversion from phys to virt is setting 3
  Ken> msb to 1.

  >> Your fixes work fine on the machines I have tried so far (variety of
  >> Itanium 2 boxen and a Big Sur Merced box).

  >> Jesse, do you know if unapplying the "move current to region 5" patch
  >> fixes your boot-problem?  For convenience, I attached the (original)
  >> patch below.

  Jesse> Yes, reverting that patch does fix the hang I'm seeing.

A quick grep shows that MINSTATE_END_SAVE_MIN_PHYS would also convert
the region 5 "current" into a region 7 address.  An MCA at the right
time might cause problems, I think.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (29 preceding siblings ...)
  2004-07-06 23:43 ` David Mosberger
@ 2004-07-06 23:45 ` David Mosberger
  2004-07-07 16:20 ` Jesse Barnes
                   ` (7 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2004-07-06 23:45 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 30 Jun 2004 11:11:04 -0700, Jesse Barnes <jbarnes@engr.sgi.com> said:

  >> Looks like there are more places where kernel does "virt -> phys -> virt".
  >> And conversion from phys to virt is setting 3 msb to 1.

  Jesse> Yep, that looks like a problem.  The kernel hangs right after
  Jesse> the local_flush_tlb_all in ia64_tlb_init, and if I comment it
  Jesse> out I get a "Unable to handle kernel paging request at
  Jesse> virtual address a000003004289f70" which, on sn2, looks like a
  Jesse> virt -> phys -> virt conversion (i.e. 0x3004289f70 is a valid
  Jesse> physical address).

I'm not sure this is related.  local_flush_tlb_all() simply flushes
the entire tlb it doesn't touch the "current" pointer in any shape or
fashion.  Not doing that flush will leave the kernel with stale TLB
entries, so all bets are off.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (30 preceding siblings ...)
  2004-07-06 23:45 ` David Mosberger
@ 2004-07-07 16:20 ` Jesse Barnes
  2004-07-07 23:56 ` Jesse Barnes
                   ` (6 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-07-07 16:20 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, July 6, 2004 4:45 pm, David Mosberger wrote:
> >>>>> On Wed, 30 Jun 2004 11:11:04 -0700, Jesse Barnes
> >>>>> <jbarnes@engr.sgi.com> said:
>   >>
>   >> Looks like there are more places where kernel does "virt -> phys ->
>   >> virt". And conversion from phys to virt is setting 3 msb to 1.
>
>   Jesse> Yep, that looks like a problem.  The kernel hangs right after
>   Jesse> the local_flush_tlb_all in ia64_tlb_init, and if I comment it
>   Jesse> out I get a "Unable to handle kernel paging request at
>   Jesse> virtual address a000003004289f70" which, on sn2, looks like a
>   Jesse> virt -> phys -> virt conversion (i.e. 0x3004289f70 is a valid
>   Jesse> physical address).
>
> I'm not sure this is related.  local_flush_tlb_all() simply flushes
> the entire tlb it doesn't touch the "current" pointer in any shape or
> fashion.  Not doing that flush will leave the kernel with stale TLB
> entries, so all bets are off.

Yeah, this is happening later, so it's probably a separate issue (and maybe 
not an issue at all).  Nonetheless, we still see a hang at boot with the 
latest BK tree.

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (31 preceding siblings ...)
  2004-07-07 16:20 ` Jesse Barnes
@ 2004-07-07 23:56 ` Jesse Barnes
  2004-07-08 18:13 ` Jesse Barnes
                   ` (5 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-07-07 23:56 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, July 6, 2004 4:43 pm, David Mosberger wrote:
> >>>>> On Wed, 30 Jun 2004 09:17:18 -0700, Jesse Barnes
> >>>>> <jbarnes@engr.sgi.com> said:
>
>   Jesse> On Tuesday, June 29, 2004 4:18 pm, David Mosberger wrote:
>   >> >>>>> On Tue, 29 Jun 2004 11:19:34 -0700, "Chen, Kenneth W"
>   >> >>>>> <kenneth.w.chen@intel.com> said:
>
>   Ken> Looks like there are more places where kernel does "virt ->
>   Ken> phys -> virt".  And conversion from phys to virt is setting 3
>   Ken> msb to 1.
>
>   >> Your fixes work fine on the machines I have tried so far (variety of
>   >> Itanium 2 boxen and a Big Sur Merced box).
>   >>
>   >> Jesse, do you know if unapplying the "move current to region 5" patch
>   >> fixes your boot-problem?  For convenience, I attached the (original)
>   >> patch below.
>
>   Jesse> Yes, reverting that patch does fix the hang I'm seeing.
>
> A quick grep shows that MINSTATE_END_SAVE_MIN_PHYS would also convert
> the region 5 "current" into a region 7 address.  An MCA at the right
> time might cause problems, I think.

If you hack your elilo to load the kernel higher in memory do you see the same 
thing?

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (32 preceding siblings ...)
  2004-07-07 23:56 ` Jesse Barnes
@ 2004-07-08 18:13 ` Jesse Barnes
  2004-07-08 18:31 ` Chen, Kenneth W
                   ` (4 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-07-08 18:13 UTC (permalink / raw)
  To: linux-ia64

On Tuesday, June 29, 2004 4:18 pm, David Mosberger wrote:
> >>>>> On Tue, 29 Jun 2004 11:19:34 -0700, "Chen, Kenneth W"
> >>>>> <kenneth.w.chen@intel.com> said:
>
>   Ken> Looks like there are more places where kernel does "virt ->
>   Ken> phys -> virt".  And conversion from phys to virt is setting 3
>   Ken> msb to 1.
>
> Your fixes work fine on the machines I have tried so far (variety of
> Itanium 2 boxen and a Big Sur Merced box).
>
> Jesse, do you know if unapplying the "move current to region 5" patch
> fixes your boot-problem?  For convenience, I attached the (original)
> patch below.

Applying this small patch to head.S gets me as far as Grant's original report.

-(isBP) br.cond.dpnt .load_current
+//(isBP)       br.cond.dpnt .load_current

This lets me get to:

PID hash table entries: 4096 (order 12: 65536 bytes)
CPU 0: base freq 0.000MHz, ITC ratio\x10/2, ITC freq\x1000.000MHz+/--1ppm
Console: colour dummy device 80x25

And like I've already mentioned, if I revert the whole thing the kernel boots 
fine.  Ugg, I'm really starting to dislike the move of the init_task...

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (33 preceding siblings ...)
  2004-07-08 18:13 ` Jesse Barnes
@ 2004-07-08 18:31 ` Chen, Kenneth W
  2004-07-08 18:39 ` Jesse Barnes
                   ` (3 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Chen, Kenneth W @ 2004-07-08 18:31 UTC (permalink / raw)
  To: linux-ia64

>>>> Jesse Barnes wrote on Thursday, July 08, 2004 11:13 AM
> >
> > Jesse, do you know if unapplying the "move current to region 5" patch
> > fixes your boot-problem?  For convenience, I attached the (original)
> > patch below.
>
> Applying this small patch to head.S gets me as far as Grant's original report.
>
> -(isBP) br.cond.dpnt .load_current
> +//(isBP)       br.cond.dpnt .load_current
>
> This lets me get to:
>
> PID hash table entries: 4096 (order 12: 65536 bytes)
> CPU 0: base freq 0.000MHz, ITC ratio\x10/2, ITC freq\x1000.000MHz+/--1ppm
> Console: colour dummy device 80x25
>
> And like I've already mentioned, if I revert the whole thing the kernel boots
> fine.  Ugg, I'm really starting to dislike the move of the init_task...

It is something related to stack pointer (r12).  If sp stays in region 7
(providing there is a corresponding dtr mapping), kernel boots fine.  David
mentioned that MCA code also has tpa thingy in it that needs to be converted.
One other thing I'm getting frustrated at is I'm keep on receiving mca while
doing a ptc.e (only on sgi altix).  Have no idea why.

- Ken



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (34 preceding siblings ...)
  2004-07-08 18:31 ` Chen, Kenneth W
@ 2004-07-08 18:39 ` Jesse Barnes
  2004-07-08 18:43 ` David Mosberger
                   ` (2 subsequent siblings)
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-07-08 18:39 UTC (permalink / raw)
  To: linux-ia64

On Thursday, July 8, 2004 11:31 am, Chen, Kenneth W wrote:
> >>>> Jesse Barnes wrote on Thursday, July 08, 2004 11:13 AM
> > >
> > > Jesse, do you know if unapplying the "move current to region 5" patch
> > > fixes your boot-problem?  For convenience, I attached the (original)
> > > patch below.
> >
> > Applying this small patch to head.S gets me as far as Grant's original
> > report.
> >
> > -(isBP) br.cond.dpnt .load_current
> > +//(isBP)       br.cond.dpnt .load_current
> >
> > This lets me get to:
> >
> > PID hash table entries: 4096 (order 12: 65536 bytes)
> > CPU 0: base freq 0.000MHz, ITC ratio\x10/2, ITC freq\x1000.000MHz+/--1ppm
> > Console: colour dummy device 80x25
> >
> > And like I've already mentioned, if I revert the whole thing the kernel
> > boots fine.  Ugg, I'm really starting to dislike the move of the
> > init_task...
>
> It is something related to stack pointer (r12).  If sp stays in region 7
> (providing there is a corresponding dtr mapping), kernel boots fine.  David
> mentioned that MCA code also has tpa thingy in it that needs to be
> converted. One other thing I'm getting frustrated at is I'm keep on
> receiving mca while doing a ptc.e (only on sgi altix).  Have no idea why.

I get different behavior depending on whether I'm on a partitioned system or 
not: either a machine check caused by a null pointer dereference in the 
unwind code on a non-partitioned machine, or a hang on a partitioned machine.
  
The difference could be in leftover tc entries from the PROM.  They'll get 
purged when the first ptc.e happens, causing a fault on the very next 
instruction or data reference following the purge.  Just a guess.

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (35 preceding siblings ...)
  2004-07-08 18:39 ` Jesse Barnes
@ 2004-07-08 18:43 ` David Mosberger
  2004-07-08 18:46 ` Jesse Barnes
  2004-07-12 17:59 ` Jesse Barnes
  38 siblings, 0 replies; 40+ messages in thread
From: David Mosberger @ 2004-07-08 18:43 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 8 Jul 2004 11:13:20 -0700, Jesse Barnes <jbarnes@engr.sgi.com> said:

  Jesse> And like I've already mentioned, if I revert the whole thing
  Jesse> the kernel boots fine.  Ugg, I'm really starting to dislike
  Jesse> the move of the init_task...

I suppose you Would rather have me undo the virtual mapping of the
kernel?  That's the only "easy" solution I can think of, but that
would be a definite step back in other ways.

	--david

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (36 preceding siblings ...)
  2004-07-08 18:43 ` David Mosberger
@ 2004-07-08 18:46 ` Jesse Barnes
  2004-07-12 17:59 ` Jesse Barnes
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-07-08 18:46 UTC (permalink / raw)
  To: linux-ia64

On Thursday, July 8, 2004 11:43 am, David Mosberger wrote:
> >>>>> On Thu, 8 Jul 2004 11:13:20 -0700, Jesse Barnes
> >>>>> <jbarnes@engr.sgi.com> said:
>
>   Jesse> And like I've already mentioned, if I revert the whole thing
>   Jesse> the kernel boots fine.  Ugg, I'm really starting to dislike
>   Jesse> the move of the init_task...
>
> I suppose you Would rather have me undo the virtual mapping of the
> kernel?  That's the only "easy" solution I can think of, but that
> would be a definite step back in other ways.

Yeah, that would be much worse.  This is just an ugly problem, that's all.

Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: BUG 2.6.7 hangs on boot (rx2600)
  2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
                   ` (37 preceding siblings ...)
  2004-07-08 18:46 ` Jesse Barnes
@ 2004-07-12 17:59 ` Jesse Barnes
  38 siblings, 0 replies; 40+ messages in thread
From: Jesse Barnes @ 2004-07-12 17:59 UTC (permalink / raw)
  To: linux-ia64

On Thursday, July 8, 2004 11:31 am, Chen, Kenneth W wrote:
> It is something related to stack pointer (r12).  If sp stays in region 7
> (providing there is a corresponding dtr mapping), kernel boots fine.

That's what I'm seeing too, any ideas where to look for the problem?  I'm 
happy to test any patches you might have :)  I'm not really sure what's going 
on, but it looks like it might be:


  _start -> start_kernel -> setup_arch -> ia64_mmu_init -> ptc.e -> page fault 
-> die

which would indicate (given that we know that the ptc.e arguments are valid) 
that we're purging a TC entry that keeps us going up to that point, and that 
we're not using the region 5 mapping like we should.  Does that sound right 
at all?

> David 
> mentioned that MCA code also has tpa thingy in it that needs to be
> converted.

But that's not the cause of the problem we're seeing, right?

> One other thing I'm getting frustrated at is I'm keep on 
> receiving mca while doing a ptc.e (only on sgi altix).  Have no idea why.

You don't think this is a symptom of the region 7 vs. region 5 problem?  It 
looks to me like we're doing the ptc.e with the same arguments in either 
case...

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2004-07-12 17:59 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-22  6:15 BUG 2.6.7 hangs on boot (rx2600) Grant Grundler
2004-06-22 13:50 ` Jesse Barnes
2004-06-22 14:51 ` Grant Grundler
2004-06-22 15:59 ` Bjorn Helgaas
2004-06-22 21:16 ` Grant Grundler
2004-06-22 21:23 ` Bjorn Helgaas
2004-06-22 22:28 ` Grant Grundler
2004-06-22 22:30 ` Grant Grundler
2004-06-22 22:38 ` Arun Sharma
2004-06-23 14:26 ` Tian, Kevin
2004-06-23 17:03 ` Jesse Barnes
2004-06-23 22:50 ` Bjorn Helgaas
2004-06-24  2:57 ` Tian, Kevin
2004-06-25  0:36 ` Chen, Kenneth W
2004-06-25 16:31 ` Chen, Kenneth W
2004-06-26  5:29 ` David Mosberger
2004-06-26  5:48 ` Chen, Kenneth W
2004-06-26  5:55 ` David Mosberger
2004-06-29 15:09 ` Chen, Kenneth W
2004-06-29 15:34 ` Chen, Kenneth W
2004-06-29 17:32 ` Jesse Barnes
2004-06-29 17:40 ` Chen, Kenneth W
2004-06-29 17:45 ` Jesse Barnes
2004-06-29 18:03 ` Chen, Kenneth W
2004-06-29 18:13 ` Jesse Barnes
2004-06-29 18:19 ` Chen, Kenneth W
2004-06-29 21:19 ` David Mosberger
2004-06-29 23:18 ` David Mosberger
2004-06-30 16:17 ` Jesse Barnes
2004-06-30 18:11 ` Jesse Barnes
2004-07-06 23:43 ` David Mosberger
2004-07-06 23:45 ` David Mosberger
2004-07-07 16:20 ` Jesse Barnes
2004-07-07 23:56 ` Jesse Barnes
2004-07-08 18:13 ` Jesse Barnes
2004-07-08 18:31 ` Chen, Kenneth W
2004-07-08 18:39 ` Jesse Barnes
2004-07-08 18:43 ` David Mosberger
2004-07-08 18:46 ` Jesse Barnes
2004-07-12 17:59 ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox