* [parisc-linux] rp2470 hang...getting closer
@ 2002-10-13 4:40 Grant Grundler
2002-10-13 13:56 ` Thibaut VARENE
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Grant Grundler @ 2002-10-13 4:40 UTC (permalink / raw)
To: parisc-linux
Hi friends of rp2470,
I'm getting closer to figuring out why rp2470 (a500-6x) hangs at boot time.
Here's the sequence I see so far:
o scsi_register_host() acquires io_request_lock (tpnt->use_new_eh_code is true)
o scsi_register_host() calls tpnt->detect(tpnt)
o detect() points to sym53c8xx_detect()
o sym53c8xx_detect() calls sym_attach()
o sym_attach() initializes s.timer to point at sym53c8xx_timer but
directly calls sym_timer() to kick off the self-arming timer.
timer will pop in 0.5 seconds.
o other interfaces are detected/initialized.
o timer_interrupt() calls timer_bh() and invokes sym53c8xx_timer().
o sym53c8xx_timer() attempts to reacquire the io_request_lock. checkmate.
The problem is we shouldn't see *any* interrupts, not even timers.
scsi_register_host() used spin_lock_irqsave() to acquire io_request_lock.
Either someone is clobbering the irqsave or it's being ignored.
That's the part the still needs to be worked out.
[ Does sym2 driver have a lock-related bug?
scsi_register_host() only acquires io_request_lock if
tpnt->use_new_eh_code is true. I wonder why sym2 driver advertises
"New EH Code" but then uses io_request_lock as well. ]
My A500-4X boots and has 4 built-in SCSI ports (1 53c896 and 1 53c876).
The A500-6X which hangs has an additional add-on 53c896.
I suspect the two extra SCSI ports make the difference.
I've posted a patch on
ftp://dsl2.external.hp.com/patches/rp2470-smp-hang.diff
Note it *removes* io_request_lock completely from sym2 driver.
The rp2470 boots with this patch though it doesn't fix the root cause.
The appended console output uses this patch *except* it calls
sym_timer() instead of add_timer() to trace original behavior.
Using 10second delay for the first timer pop also "works" (it boots)
with io_request_lock intact in sym2 driver.
I'll look at this some more if no one has ideas on why timer_bh()
is invoked (besides timer_interrupt).
grant
[ "ret 270725880" == scsi_register_module+50 == caller of io_r*_lock owner ]
grundler@gsyprf11:~$ dmesg
Linux version 2.4.19-pa21 (grundler@gsyprf11.external.hp.com) (gcc version 3.0) #17 SMP Sat Oct 12 18:02:20 PDT 2002
FP[0] enabled: Rev 1 Model 19
The 64-bit Kernel has started...
Initialized PDC Console for debugging.
Determining PDC firmware type: 64 bit PAT.
model 00005e20 00000491 00000000 00000001 73a33d02 100000f0 00000008 000000b2 000000b2
vers 00000203
CPUID vers 19 rev 8 (0x00000268)
capabilities 0x5
model 9000/800/A500-6X
Total Memory: 1024 Mb
pagetable_init
On node 0 totalpages: 262144
zone(0): 262144 pages.
zone(1): 0 pages.
zone(2): 0 pages.
Kernel command line: root=/dev/sdb3 HOME=/ console=ttyS0 TERM=linux palo_kernel=3/boot/vmlinux-2.4.19-pa21
Initialized PDC Console for debugging.
Calibrating delay loop... 1297.61 BogoMIPS
Memory: 1019504k available
Dentry cache hash table entries: 131072 (order: 9, 2097152 bytes)
Inode cache hash table entries: 65536 (order: 8, 1048576 bytes)
Mount-cache hash table entries: 16384 (order: 6, 262144 bytes)
Buffer-cache hash table entries: 65536 (order: 7, 524288 bytes)
Page-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Searching for devices...
Found devices:
1. Crescendo 650 W2 (0) at 0xfffffffffffa0000 [160], versions 0x5e2, 0x0, 0x4
2. Crescendo 650 W2 (0) at 0xfffffffffffa2000 [162], versions 0x5e2, 0x0, 0x4
3. Astro BC Runway Port (12) at 0xfffffffffed00000 [0], versions 0x582, 0x0, 0xb
4. Elroy PCI Bridge (13) at 0xfffffffffed30000 [0/0], versions 0x782, 0x0, 0xa
5. Elroy PCI Bridge (13) at 0xfffffffffed34000 [0/2], versions 0x782, 0x0, 0xa
6. Elroy PCI Bridge (13) at 0xfffffffffed38000 [0/4], versions 0x782, 0x0, 0xa
7. Elroy PCI Bridge (13) at 0xfffffffffed3c000 [0/6], versions 0x782, 0x0, 0xa
8. Memory (1) at 0xfffffffffed08000 [8], versions 0x9b, 0x0, 0x9
CPU(s): 2 x PA8700 (PCX-W2) at 650.000000 MHz
SBA found Astro 2.1 at 0xfffffffffed00000
lba version TR4.0 (0x5) found at 0xfffffffffed30000
lba range[2] : ignoring GMMIO (0xfffffff804000000)
lba version TR4.0 (0x5) found at 0xfffffffffed34000
lba range[2] : ignoring GMMIO (0xfffffff904000000)
lba version TR4.0 (0x5) found at 0xfffffffffed38000
lba range[2] : ignoring GMMIO (0xfffffffa04000000)
lba version TR4.0 (0x5) found at 0xfffffffffed3c000
lba range[2] : ignoring GMMIO (0xfffffffb04000000)
POSIX conformance testing by UNIFIX
SMP: bootstrap CPU ID is 0
FP[1] enabled: Rev 1 Model 19
SMP: Total 2 of 2 processors activated (2595.23 BogoMIPS noticed).
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Soft power switch support not available.
Performance monitoring counters enabled for Crescendo 650 W2
Starting kswapd
Journalled Block Device driver loaded
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled
Redundant entry in serial pci_table. Please send the output of
lspci -vv, this message (103c,1048,103c,104b)
and the manufacturer and name of serial board or modem board
to serial-pci-info@lists.sourceforge.net.
ttyS00 at iomem 0xfffffffff8000000 (irq = 132) is a 16550A
ttyS01 at iomem 0xfffffffff8000008 (irq = 132) is a 16550A
ttyS02 at iomem 0xfffffffff8000010 (irq = 132) is a 16550A
ttyS03 at iomem 0xfffffffff8000038 (irq = 132) is a 16550A
Generic RTC Driver v1.02 05/27/1999 Sam Creasey (sammy@oh.verio.com)
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
loop: loaded (max 8 devices)
Linux Tulip driver version 0.9.15-pre11 (May 11, 2002)
tulip0: no phy info, aborting mtable build
tulip0: MII transceiver #1 config 1000 status 782d advertising 0061.
eth0: Digital DS21143 Tulip rev 65 at 0x80, 00:30:6E:26:61:A3, IRQ 128.
eth1: Digital DS21143 Tulip rev 65 at 0x30000, 00:10:83:F6:5D:F6, IRQ 320.
SCSI subsystem driver Revision: 1.00
scsi_register_host: acquiring iorl
sym.0.2.0: setting PCI_COMMAND_MASTER...
sym.0.2.1: setting PCI_COMMAND_MASTER...
sym.0.1.0: setting PCI_COMMAND_MASTER...
sym.32.0.0: setting PCI_COMMAND_MASTER...
sym.32.0.1: setting PCI_COMMAND_MASTER...
sym0: <875> rev 0x37 on pci bus 0 device 2 function 0 irq 130
sym0: No NVRAM, ID 7, Fast-20, SE, parity checking
sym0: SCSI BUS has been reset.
sym_attach: settle flag 1 time 0x2e06 iorl 0
sym1: <875> rev 0x37 on pci bus 0 device 2 function 1 irq 131
sym1: No NVRAM, ID 7, Fast-20, SE, parity checking
sym1: SCSI BUS has been reset.
sym_attach: settle flag 1 time 0x2e1d iorl 0
sym2: <896> rev 0x7 on pci bus 0 device 1 function 0 irq 129
sym2: No NVRAM, ID 7, Fast-40, LVD, parity checking
sym2: SCSI BUS has been reset.
sym_attach: settle flag 1 time 0x2e2c iorl 0
sym3: <896> rev 0x7 on pci bus 0 device 1 function 1 irq 130
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
sym3: No NVRAM, ID 7, Fast-40, SE, parity checking
sym3: SCSI BUS has been reset.
sym_attach: settle flag 1 time 0x2e3c iorl 0
sym3: SCSI BUS mode change from SE to SE.
sym3: SCSI BUS has been reset.
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
sym4: <896> rev 0x1 on pci bus 32 device 0 function 0 irq 256
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
sym4: No NVRAM, ID 7, Fast-40, LVD, parity checking
sym4: SCSI BUS has been reset.
sym_attach: settle flag 1 time 0x2e62 iorl 0
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
sym5: <896> rev 0x1 on pci bus 32 device 0 function 1 irq 257
sym5: No NVRAM, ID 7, Fast-40, LVD, parity checking
sym5: SCSI BUS has been reset.
sym_attach: settle flag 1 time 0x2e91 iorl 0
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
sym53c8xx_timer(): iorl cpu 0 ret 270725880 line 1895
scsi_register_host: released iorl
scsi0 : sym-2.1.17a
scsi1 : sym-2.1.17a
scsi2 : sym-2.1.17a
scsi3 : sym-2.1.17a
scsi4 : sym-2.1.17a
scsi5 : sym-2.1.17a
Vendor: HP 73.4G Model: ST373405LC Rev: HP03
Type: Direct-Access ANSI SCSI revision: 02
sym1:15:0: tagged command queuing enabled, command queue depth 8.
Vendor: SEAGATE Model: ST39173LC Rev: 6381
Type: Direct-Access ANSI SCSI revision: 02
sym3:15:0: tagged command queuing enabled, command queue depth 8.
Vendor: HP Model: D5989B Rev: 1.02
Type: Processor ANSI SCSI revision: 02
Vendor: SEAGATE Model: ST34573WC Rev: HP05
Type: Direct-Access ANSI SCSI revision: 02
sym4:11: wide asynchronous.
sym4:11:0:M_REJECT to send for : 1-3-1-c-f.
Vendor: HP Model: 9.10GB B 80-1205 Rev:
Type: Direct-Access ANSI SCSI revision: 02
sym4:12: wide asynchronous.
sym4:12:0:M_REJECT to send for : 1-3-1-c-f.
Vendor: HP Model: 9.10GB B 80-1205 Rev:
Type: Direct-Access ANSI SCSI revision: 02
sym4:13: wide asynchronous.
sym4:13:0:M_REJECT to send for : 1-3-1-c-f.
Vendor: HP Model: 9.10GB B 80-1205 Rev:
Type: Direct-Access ANSI SCSI revision: 02
sym4:14: wide asynchronous.
sym4:14:0:M_REJECT to send for : 1-3-1-c-f.
Vendor: HP Model: 9.10GB B 80-1205 Rev:
Type: Direct-Access ANSI SCSI revision: 02
sym4:15: wide asynchronous.
sym4:15:0:M_REJECT to send for : 1-3-1-c-f.
Vendor: HP Model: 9.10GB B 80-1205 Rev:
Type: Direct-Access ANSI SCSI revision: 02
sym4:10:0: tagged command queuing enabled, command queue depth 8.
sym4:11:0: tagged command queuing enabled, command queue depth 8.
sym4:12:0: tagged command queuing enabled, command queue depth 8.
sym4:13:0: tagged command queuing enabled, command queue depth 8.
sym4:14:0: tagged command queuing enabled, command queue depth 8.
sym4:15:0: tagged command queuing enabled, command queue depth 8.
Attached scsi disk sda at scsi1, channel 0, id 15, lun 0
Attached scsi disk sdb at scsi3, channel 0, id 15, lun 0
Attached scsi disk sdc at scsi4, channel 0, id 10, lun 0
Attached scsi disk sdd at scsi4, channel 0, id 11, lun 0
Attached scsi disk sde at scsi4, channel 0, id 12, lun 0
Attached scsi disk sdf at scsi4, channel 0, id 13, lun 0
Attached scsi disk sdg at scsi4, channel 0, id 14, lun 0
Attached scsi disk sdh at scsi4, channel 0, id 15, lun 0
sym1:15: FAST-20 WIDE SCSI 40.0 MB/s ST (50.0 ns, offset 16)
SCSI device sda: 143374738 512-byte hdwr sectors (73408 MB)
Partition check:
sda: unknown partition table
sym3:15: FAST-20 WIDE SCSI 40.0 MB/s ST (50.0 ns, offset 15)
SCSI device sdb: 17781521 512-byte hdwr sectors (9104 MB)
sdb: sdb1 sdb2 sdb3 sdb4
sym4:10: wide asynchronous.
SCSI device sdc: 8388314 512-byte hdwr sectors (4295 MB)
sdc: unknown partition table
SCSI device sdd: 17773524 512-byte hdwr sectors (9100 MB)
sdd: unknown partition table
SCSI device sde: 17773524 512-byte hdwr sectors (9100 MB)
sde: unknown partition table
SCSI device sdf: 17773524 512-byte hdwr sectors (9100 MB)
sdf: sdf1 sdf2
SCSI device sdg: 17773524 512-byte hdwr sectors (9100 MB)
sdg:
SCSI device sdh: 17773524 512-byte hdwr sectors (9100 MB)
sdh: unknown partition table
Attached scsi generic sg2 at scsi4, channel 0, id 5, lun 0, type 3
md: linear personality registered as nr 1
md: raid0 personality registered as nr 2
md: raid1 personality registered as nr 3
md: raid5 personality registered as nr 4
raid5: measuring checksumming speed
8regs : 2983.600 MB/sec
8regs_prefetch: 2400.000 MB/sec
32regs : 2430.800 MB/sec
32regs_prefetch: 2285.600 MB/sec
raid5: using function: 8regs (2983.600 MB/sec)
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
IP: routing cache hash table of 1024 buckets, 64Kbytes
TCP: Hash tables configured (established 32768 bind 43690)
ip_conntrack (3982 buckets, 31856 max)
ip_tables: (C) 2000-2002 Netfilter core team
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
kjournald starting. Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 224k freed
Adding Swap: 525288k swap-space (priority -1)
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,19), internal journal
LVM version 1.0.3(19/02/2002) module loaded
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,20), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,32), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,48), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,64), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,81), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,82), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on sd(8,96), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
eth0: Setting full-duplex based on MII#1 link partner capability of 41e1.
sym53c8xx_timer(): iorl cpu 1 ret 270760060 line 700
sym53c8xx_timer(): iorl cpu 1 ret 270760060 line 700
grundler@gsyprf11:~$
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-13 4:40 [parisc-linux] rp2470 hang...getting closer Grant Grundler
@ 2002-10-13 13:56 ` Thibaut VARENE
2002-10-21 0:57 ` Grant Grundler
2002-10-21 14:59 ` Matthew Wilcox
2 siblings, 0 replies; 10+ messages in thread
From: Thibaut VARENE @ 2002-10-13 13:56 UTC (permalink / raw)
To: parisc-linux; +Cc: grundler
On Sat, 12 Oct 2002 22:40:33 -0600 (MDT)
"Grant Grundler" <grundler@dsl2.external.hp.com> wrote:
> My A500-4X boots and has 4 built-in SCSI ports (1 53c896 and 1 53c876).
> The A500-6X which hangs has an additional add-on 53c896.
> I suspect the two extra SCSI ports make the difference.
I don't think this is the only reason:
see, we have an A500-5X (4 built-in SCSI ports, 1 875 and 1 896), which hangs from boots to boots when booting with SYM2 driver, and at each boot _when SYM2 debug options are set to max verboseness_.
I also remember we had some troubles with SYM2 on our J5k (2 ports: 896), though I have not checked this box with SYM2 since 2.4.18-pa21.
We're investigating on that problem and keep the m-l informed.
HTH,
Thibaut VARENE
The PA/Linux ESIEE Team
http://pateam.esiee.fr/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-13 4:40 [parisc-linux] rp2470 hang...getting closer Grant Grundler
2002-10-13 13:56 ` Thibaut VARENE
@ 2002-10-21 0:57 ` Grant Grundler
2002-10-21 2:21 ` Matthew Wilcox
2002-10-21 14:59 ` Matthew Wilcox
2 siblings, 1 reply; 10+ messages in thread
From: Grant Grundler @ 2002-10-21 0:57 UTC (permalink / raw)
To: parisc-linux
Grant Grundler wrote:
> I'm getting closer to figuring out why rp2470 (a500-6x) hangs at boot time.
had some ideas to think about/work on.
> Here's the sequence I see so far:
> o scsi_register_host() acquires io_request_lock (tpnt->use_new_eh_code is tru
> e)
> o scsi_register_host() calls tpnt->detect(tpnt)
> o detect() points to sym53c8xx_detect()
> o sym53c8xx_detect() calls sym_attach()
> o sym_attach() initializes s.timer to point at sym53c8xx_timer but
> directly calls sym_timer() to kick off the self-arming timer.
> timer will pop in 0.5 seconds.
sym_attach() also calls request_irq().
request_irq() *enables* the IRQ for that line.
I suspect this might unmask the timer interrupt as well.
I'll add some debug code and test this out.
And after looking at arch/parisc/kernel/irq.c, I think we have a
race condition in our cpu_irq_ops. ie the eiem value read could be
different if we take an interupt at the wrong moment. ie need to
save_flags/local_irq_disable()/restore around touching the eiem.
If someone seconds that opinion, I'll add/test that.
Lastly, use of IPI to set_eiem() on all CPUs can probably go away.
In 2.5, I was under the impression we no longer require globally
disabling of interrupts - only on the local CPU. For both 2.4
and 2.5, parisc only needs to mask/unmask the EIEM bit on the CPU
that is the target of that IRQ, not all CPUs. ie if the IPI
is needed, it should just target the same CPU which will handle the
specific external intr.
> o other interfaces are detected/initialized.
> o timer_interrupt() calls timer_bh() and invokes sym53c8xx_timer().
> o sym53c8xx_timer() attempts to reacquire the io_request_lock. checkmate.
grant
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-21 0:57 ` Grant Grundler
@ 2002-10-21 2:21 ` Matthew Wilcox
2002-10-21 3:33 ` Grant Grundler
0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2002-10-21 2:21 UTC (permalink / raw)
To: Grant Grundler; +Cc: parisc-linux
On Sun, Oct 20, 2002 at 06:57:16PM -0600, Grant Grundler wrote:
> Lastly, use of IPI to set_eiem() on all CPUs can probably go away.
> In 2.5, I was under the impression we no longer require globally
> disabling of interrupts - only on the local CPU. For both 2.4
> and 2.5, parisc only needs to mask/unmask the EIEM bit on the CPU
> that is the target of that IRQ, not all CPUs. ie if the IPI
> is needed, it should just target the same CPU which will handle the
> specific external intr.
hmm.. not sure about disable_irq() -- certainly sti() is gone and __sti()
is local_irq_disable() in 2.5; but i think enable/disable_irq still act
globally. i think this is because it's supposed to go and enable/disable
delivery of interrupts in the (io)(s)(a)pic, rather than playing with
the cpu interrupt masks.
--
Revolutions do not require corporate support.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-21 2:21 ` Matthew Wilcox
@ 2002-10-21 3:33 ` Grant Grundler
0 siblings, 0 replies; 10+ messages in thread
From: Grant Grundler @ 2002-10-21 3:33 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: parisc-linux
Matthew Wilcox wrote:
> hmm.. not sure about disable_irq() -- certainly sti() is gone and __sti()
> is local_irq_disable() in 2.5; but i think enable/disable_irq still act
> globally. i think this is because it's supposed to go and enable/disable
> delivery of interrupts in the (io)(s)(a)pic, rather than playing with
> the cpu interrupt masks.
Normally, disable_irq() will only result in the IRQ being disabled
at the PIC. ie if a regular PCI driver calls disable_irq().
But if HP device (eg Dino/IOSAPIC) calls disable_irq(), the IRQ
is in the CPU region and we muck with EIEM.
grant
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-13 4:40 [parisc-linux] rp2470 hang...getting closer Grant Grundler
2002-10-13 13:56 ` Thibaut VARENE
2002-10-21 0:57 ` Grant Grundler
@ 2002-10-21 14:59 ` Matthew Wilcox
2002-10-21 15:26 ` Matthew Wilcox
2 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2002-10-21 14:59 UTC (permalink / raw)
To: Grant Grundler; +Cc: parisc-linux
On Sat, Oct 12, 2002 at 10:40:33PM -0600, Grant Grundler wrote:
> The problem is we shouldn't see *any* interrupts, not even timers.
> scsi_register_host() used spin_lock_irqsave() to acquire io_request_lock.
> Either someone is clobbering the irqsave or it's being ignored.
> That's the part the still needs to be worked out.
I wonder if this is interesting:
void handle_interruption(int code, struct pt_regs *regs)
{
unsigned long fault_address = 0;
unsigned long fault_space = 0;
struct siginfo si;
if (code == 1)
pdc_console_restart(); /* switch back to pdc if HPMC */
else
sti();
so if we take, say, a page fault, we reenable interrupts on all CPUs.
Who the hell put this code in there?!
Thanks to Thibaut for finding this particular excrescence during
CONFIG_SMP frobbing on 2.5. I'm not sure what this should be changed to.
local_irq_enable? Do we want interrupts enabled at this point?
--
Revolutions do not require corporate support.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-21 14:59 ` Matthew Wilcox
@ 2002-10-21 15:26 ` Matthew Wilcox
2002-10-21 21:58 ` Grant Grundler
0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2002-10-21 15:26 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Grant Grundler, parisc-linux
On Mon, Oct 21, 2002 at 03:59:32PM +0100, Matthew Wilcox wrote:
> so if we take, say, a page fault, we reenable interrupts on all CPUs.
> Who the hell put this code in there?!
>
> Thanks to Thibaut for finding this particular excrescence during
> CONFIG_SMP frobbing on 2.5. I'm not sure what this should be changed to.
> local_irq_enable? Do we want interrupts enabled at this point?
ok, going back through the CVS logs finds the answer:
http://cvs.parisc-linux.org/obsolete/linux-2.3/arch/parisc/kernel/traps.c#rev1.9
Revision 1.9 / (as text) / (view) - annotate - [select for diffs] , Tue Feb 15 17:02:03 2000 UTC (2 years, 8 months ago) by jsm
Branch: MAIN
Changes since 1.8: +11 -2 lines
Diff to previous 1.8
Enable interrupts in fault path. Trap kernel space faults.
So I think we want to change the sti() call to local_irq_enable().
--
Revolutions do not require corporate support.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-21 15:26 ` Matthew Wilcox
@ 2002-10-21 21:58 ` Grant Grundler
0 siblings, 0 replies; 10+ messages in thread
From: Grant Grundler @ 2002-10-21 21:58 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: parisc-linux
Matthew Wilcox wrote:
> So I think we want to change the sti() call to local_irq_enable().
I think we want to remove it all together. Since I don't have a PA2.0
book handy, PA1.1 Arch and Instruction Set Manual, page 5-141 says:
"... Execution of an RFIR instruction when any of the
PSW Q, I, or R bits are ones is an undefined operation."
When "handle_interrupts()" is done, execution will return to entry.S
and execute "rfir". I expect RFIR to put I-bit back the way it was.
And anyone decoding state/insns in the trap/fault handler should read
the next page (5-142):
"Because this sequence restores the state of the execution pipeline,
it is possible for software to place the processor in states which
could not result from the execution of any sequence of insns
not involving interrupts."
Anyway, I tried both current CVS and removing local_irq_enable().
Both "hung" in sym2 disk search. :^(
Still looking.
thanks,
grant
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
@ 2002-10-22 0:41 John Marvin
2002-10-23 1:19 ` Grant Grundler
0 siblings, 1 reply; 10+ messages in thread
From: John Marvin @ 2002-10-22 0:41 UTC (permalink / raw)
To: parisc-linux
> ok, going back through the CVS logs finds the answer:
>
> http://cvs.parisc-linux.org/obsolete/linux-2.3/arch/parisc/kernel/traps.c#rev1.
> 9
>
> Revision 1.9 / (as text) / (view) - annotate - [select for diffs] , Tue Feb 15
> 17:02:03 2000 UTC (2 years, 8 months ago) by jsm
> Branch: MAIN
> Changes since 1.8: +11 -2 lines
> Diff to previous 1.8
>
> Enable interrupts in fault path. Trap kernel space faults.
>
> So I think we want to change the sti() call to local_irq_enable().
>
Oooh, enabling interrupts on all processors is very bad. Bad jsm, bad
jsm. I tried to go back and figure out what was going on in my head back
then (2 years, 8 months ago). Had to rummage around in the CVS attic to
find interruption.S for changes made at the same time. At the time, user
processes were not being rescheduled properly (we had just turned on user
space), and they were running with I bit off. I don't really remember why
I thought enabling interrupts here was a good idea. Probably had
something to do with not wanting to handle faults with the I bit off. But
that is bogus too (see below). One possible reason is that we didn't
have exception table support yet, and were probably taking kernel space
faults for bogus user addresses in the user copy routines.
Also, at the time we had not begun the SMP support, and the CONFIG_SMP
part of system.h was not filled in, and therefore I just saw the UP
definition of sti. I probably couldn't imagine that the CONFIG_SMP
support would have sti enabling interrupts on all processors. Yes, I know
now that that is the solution that Linux has chosen for supporting broken
(non SMP safe) drivers on SMP kernels ...
> Matthew Wilcox wrote:
> > So I think we want to change the sti() call to local_irq_enable().
>
> I think we want to remove it all together. Since I don't have a PA2.0
> book handy, PA1.1 Arch and Instruction Set Manual, page 5-141 says:
> "... Execution of an RFIR instruction when any of the
> PSW Q, I, or R bits are ones is an undefined operation."
>
> When "handle_interrupts()" is done, execution will return to entry.S
> and execute "rfir". I expect RFIR to put I-bit back the way it was.
Well, I too think we shouldn't be calling local_irq_enable(), but NOT for
the reason above. We would have had more severe problems if we called
rfir with I bit on (or Q bit for that matter). Upon return from
handle_interruption we turn on the I bit, but then later we disable I and
Q before calling rfir in the return path from handle_interruption, so this
is not an issue.
Anyway, it would be bad to handle page faults with the I bit off. But the
only time we should be coming into handle_interruption with the I bit off
should be during a kernel fault. And we shouldn't be processing a user
space page fault in that case.
Note that we are currently turning the I bit on when we return from
handle_interruption, so this, so removing the local_irq_enable() is not
going to actually change any behaviour. I just don't think it is
necessary, and the enablement upon return may be wrong also. I believe
that the reason I put the original sti in (thinking it was a local cpu I
bit enable only) was based on a wrong understanding of the problem, and
that it was probably fixed the right way later on.
Now, I've also spent some time looking at the I bit enablement at the head
of intr_return in entry.S. I don't think that is correct either. But
just removing it is not the correct answer. I know at one time I was
under the impression that calling schedule() with I bit off was a bad
thing. But that is wrong. Other architectures explicitly do this before
checking the RESCHED bit, and it is OK to call schedule with I bit off,
since schedule unconditionally turns it off without saving the I bit state
near the front of the routine with a spin_lock_irq() anyway.
So there MAY be a race condition bug with our checking for rescheds and
signals pending, but it needs more thought (we also check software
interrupts in the same path, need to make sure we do the right thing for
all of that).
Note, I don't think any of this is going to explain the problem specific
to the rp2470.
But, the original bad sti call may very well explain the apt-get SMP bug
(apt-get was causing an unaligned fault, which by itself shouldn't have
been a problem).
John
P.S. I'll try to do a more thorough investigation of the I bit handling
in intr_return tonight.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [parisc-linux] rp2470 hang...getting closer
2002-10-22 0:41 John Marvin
@ 2002-10-23 1:19 ` Grant Grundler
0 siblings, 0 replies; 10+ messages in thread
From: Grant Grundler @ 2002-10-23 1:19 UTC (permalink / raw)
To: John Marvin; +Cc: parisc-linux
John Marvin wrote:
...
> Upon return from
> handle_interruption we turn on the I bit, but then later we disable I and
> Q before calling rfir in the return path from handle_interruption, so this
> is not an issue.
Ah ok - my bad. I should have checked.
When doing any bottom half stuff, I think it should be ok for
I-bit to be enabled.
> Note that we are currently turning the I bit on when we return from
> handle_interruption, so this, so removing the local_irq_enable() is not
> going to actually change any behaviour.
hmmm. It changes the timing a bit but basically you are right.
I'm worried about the nesting of traps/interrupts in this case.
My gut feeling says we never want to enable I-bit for handling faults/traps.
Maybe to enable console output in a kernel_die() code path.
The obvious deadlock sequence to me is:
spin_lock_irqsave() -> trap -> interrupt -> spin_lock()
or
spin_lock_irqsave() -> trap -> interrupt/bottom half -> spin_lock()
I use "interrupt" to mean "External Interrupts" as defined by parisc arch.
> I just don't think it is necessary, and the enablement upon return may
> be wrong also.
yeah. I suspect it is.
...
> Now, I've also spent some time looking at the I bit enablement at the head
> of intr_return in entry.S. I don't think that is correct either. But
> just removing it is not the correct answer.
Please let me know if (a) you are chasing this down or (b) you
already found the correct answer. I'm very tempted to just try
removing it and see what breaks. I may not be able to debug the
resulting mess though...
...
> Note, I don't think any of this is going to explain the problem specific
> to the rp2470.
When scsi_lock_irqsave() blocks interrupts, we can't take any
external interrupt. It results in a deadlock.
I'm suspecting that's also what was causing the deadlock on xtime_lock.
I added code to set EIEM to 0 in do_cpu_irq_mask() to avoid that deadlock.
I can't do the same for sym2 driver.
thanks,
grant
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2002-10-23 1:19 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-13 4:40 [parisc-linux] rp2470 hang...getting closer Grant Grundler
2002-10-13 13:56 ` Thibaut VARENE
2002-10-21 0:57 ` Grant Grundler
2002-10-21 2:21 ` Matthew Wilcox
2002-10-21 3:33 ` Grant Grundler
2002-10-21 14:59 ` Matthew Wilcox
2002-10-21 15:26 ` Matthew Wilcox
2002-10-21 21:58 ` Grant Grundler
-- strict thread matches above, loose matches on Subject: below --
2002-10-22 0:41 John Marvin
2002-10-23 1:19 ` Grant Grundler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox