* IP27: Random hard locks after ~16hrs uptime @ 2015-02-08 2:58 Joshua Kinard 2015-02-08 12:06 ` Maciej W. Rozycki 0 siblings, 1 reply; 6+ messages in thread From: Joshua Kinard @ 2015-02-08 2:58 UTC (permalink / raw) To: Linux MIPS List I've had my Onyx2 running quite a bit lately doing compile runs, and it seems that after about ~16 hours, there's a random possibility that the machine just completely stops. No errors printed anywhere, serial becomes completely unresponsive. I have to issue a 'rst' from the MSC to bring it back up again. It's currently got dual IP31 R14000 node boards (500MHz), and for the most part, runs great (I'll regret the electric bill later...). Clearly a bug, though, but I am not sure where to start debugging on this platform to find this bug, since I can't trigger it manually. Even tried an NMI interrupt, since this machine has an NMI handler in the kernel, but all that does is reset the machine. Already ran an extensive memory test from the PROM and had no issues with that. Haven't tried running any of the more thorough hardware tests from IRIX, though. Ideas? -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org 4096R/D25D95E3 2011-03-28 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: IP27: Random hard locks after ~16hrs uptime 2015-02-08 2:58 IP27: Random hard locks after ~16hrs uptime Joshua Kinard @ 2015-02-08 12:06 ` Maciej W. Rozycki 2015-02-09 0:53 ` Joshua Kinard 0 siblings, 1 reply; 6+ messages in thread From: Maciej W. Rozycki @ 2015-02-08 12:06 UTC (permalink / raw) To: Joshua Kinard; +Cc: Linux MIPS List On Sat, 7 Feb 2015, Joshua Kinard wrote: > I've had my Onyx2 running quite a bit lately doing compile runs, and it seems > that after about ~16 hours, there's a random possibility that the machine just > completely stops. No errors printed anywhere, serial becomes completely > unresponsive. I have to issue a 'rst' from the MSC to bring it back up again. If the time spent up is always similar, then one possibility is a counter wraparound or suchlike that is not handled correctly (i.e. the carry from the topmost bit is not taken into account), causing a kernel deadlock. > It's currently got dual IP31 R14000 node boards (500MHz), and for the most > part, runs great (I'll regret the electric bill later...). Clearly a bug, > though, but I am not sure where to start debugging on this platform to find > this bug, since I can't trigger it manually. Even tried an NMI interrupt, > since this machine has an NMI handler in the kernel, but all that does is reset > the machine. The NMI exception is routed to the same vector reset is, firmware would have to tell them apart (with the use of the CP0.Status.NMI bit) and then call a handler supplied. Perhaps there's a way to register such a handler with the firmware -- does the kernel do it? You could then use the handler to examine the kernel state and perhaps dump it somehow. On MIPS processors an NMI or even a reset event does not clobber any registers except from the CP0 ErrorEPC register, where the PC at the time the event happened is stored, some bits in the CP0 Status register (ERL, BEV, etc.), and of course the PC. So alternatively does the firmware have a way to dump registers on reset or NMI then somehow? For example R4k DECstations dump registers automatically, when the reset button is pressed at a time when the machine operates normally (a power-up reset can be told apart by the state of the CP0.Status.SR bit). Maciej ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: IP27: Random hard locks after ~16hrs uptime 2015-02-08 12:06 ` Maciej W. Rozycki @ 2015-02-09 0:53 ` Joshua Kinard 2015-02-09 8:50 ` Joshua Kinard 0 siblings, 1 reply; 6+ messages in thread From: Joshua Kinard @ 2015-02-09 0:53 UTC (permalink / raw) To: Maciej W. Rozycki; +Cc: Linux MIPS List On 02/08/2015 07:06, Maciej W. Rozycki wrote: > On Sat, 7 Feb 2015, Joshua Kinard wrote: > >> I've had my Onyx2 running quite a bit lately doing compile runs, and it seems >> that after about ~16 hours, there's a random possibility that the machine just >> completely stops. No errors printed anywhere, serial becomes completely >> unresponsive. I have to issue a 'rst' from the MSC to bring it back up again. > > If the time spent up is always similar, then one possibility is a counter > wraparound or suchlike that is not handled correctly (i.e. the carry from > the topmost bit is not taken into account), causing a kernel deadlock. I believe I've pinned the problem down to the block I/O layer and points beneath, such as SCSI core, qla1280, etc. I am using an out-of-tree patch to add the BFQ I/O scheduler in, so that may also be a cause to consider. I had a very similar hardlock on the Octane, too, when I upgraded the RAM to 3.5GB the other day, but going back to 2GB solves the problem there. Octane is, for all intents and purposes, a single-node Origin system w/ graphics options, HEART instead of HUB, and a much more simplified PROM). Both use the same SCSI chip, a QLogic ISP1040B, and thus the same driver, qla1280.o. The difference with the Octane is I can reproduce the hardlock on demand by untarring a large tarball (a Gentoo stage3, to be exact). Compared to the Onyx2, which has 8GB of RAM, and the lock seems more random. I'll have to reconfigure the Octane later on with 3.5GB of RAM again, but test BFQ, CFQ, and Deadline out to see if the hardlock happens with all three. I know BFQ is largely derived from the CFQ code, so if the system remains stable with Deadline, but not CFQ or BFQ, then I know the subsystem. Then, if it only happens on BFQ, I'll go pester their upstream for debugging advice. I thought it might've been filesystem related, but because the Octane is XFS and the Onyx2 is Ext4, that eliminates that subsystem from consideration (I hope). On the Onyx2, I don't think I can trigger it on-demand, but I may have found a way by running e4defrag on my large /usr partition. So if I can pin a cause down on the Octane, I might be able to test for that same cause on the Onyx2 as well. Provided it doesn't eat my filesystem... Currently trying a 3.19-rc7 kernel out to see if the effects are any different. I also switched to compiling packages in a RAM filesystem for now. >> It's currently got dual IP31 R14000 node boards (500MHz), and for the most >> part, runs great (I'll regret the electric bill later...). Clearly a bug, >> though, but I am not sure where to start debugging on this platform to find >> this bug, since I can't trigger it manually. Even tried an NMI interrupt, >> since this machine has an NMI handler in the kernel, but all that does is reset >> the machine. > > The NMI exception is routed to the same vector reset is, firmware would > have to tell them apart (with the use of the CP0.Status.NMI bit) and then > call a handler supplied. Perhaps there's a way to register such a handler > with the firmware -- does the kernel do it? You could then use the > handler to examine the kernel state and perhaps dump it somehow. > > On MIPS processors an NMI or even a reset event does not clobber any > registers except from the CP0 ErrorEPC register, where the PC at the time > the event happened is stored, some bits in the CP0 Status register (ERL, > BEV, etc.), and of course the PC. So alternatively does the firmware have > a way to dump registers on reset or NMI then somehow? > > For example R4k DECstations dump registers automatically, when the reset > button is pressed at a time when the machine operates normally (a power-up > reset can be told apart by the state of the CP0.Status.SR bit). I only mentioned the NMI bit because IP27 does have an NMI handler in it, and I can trigger it to dump some useful debugging information under normal circumstances prior to the hardware reset. But in this case, the kernel is so dead at this point, that not even the NMI handler is executing. I suspect it's either a total hardware lockup at some level or something gets stuck in the CPU so thoroughly, that the CPU stops processing all interrupts. Actually on one use of 'nmi' from the MSC, something didn't get cleared right in memory, so the booting of the PROM actually crashed and the Onyx2 dropped into the POD debugger. I was kinda hoping NMI would put me into the POD debugger without clearing any memory banks, but in this instance, half of the banks were cleared before the PROM crashed. From POD, I can inspect memory addresses (if I know where to look), but with half the banks cleared, there really wasn't a point by then. -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org 4096R/D25D95E3 2011-03-28 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: IP27: Random hard locks after ~16hrs uptime 2015-02-09 0:53 ` Joshua Kinard @ 2015-02-09 8:50 ` Joshua Kinard 2015-02-09 14:33 ` IP27: BUG() in mm/vmscan.c, isolate_lru_pages [was: Random hard locks after ~16hrs uptime] Joshua Kinard 0 siblings, 1 reply; 6+ messages in thread From: Joshua Kinard @ 2015-02-09 8:50 UTC (permalink / raw) To: Maciej W. Rozycki; +Cc: Linux MIPS List On 02/08/2015 19:53, Joshua Kinard wrote: > On 02/08/2015 07:06, Maciej W. Rozycki wrote: >> On Sat, 7 Feb 2015, Joshua Kinard wrote: >> >>> I've had my Onyx2 running quite a bit lately doing compile runs, and it seems >>> that after about ~16 hours, there's a random possibility that the machine just >>> completely stops. No errors printed anywhere, serial becomes completely >>> unresponsive. I have to issue a 'rst' from the MSC to bring it back up again. >> >> If the time spent up is always similar, then one possibility is a counter >> wraparound or suchlike that is not handled correctly (i.e. the carry from >> the topmost bit is not taken into account), causing a kernel deadlock. > > I believe I've pinned the problem down to the block I/O layer and points > beneath, such as SCSI core, qla1280, etc. I am using an out-of-tree patch to > add the BFQ I/O scheduler in, so that may also be a cause to consider. > > I had a very similar hardlock on the Octane, too, when I upgraded the RAM to > 3.5GB the other day, but going back to 2GB solves the problem there. Octane > is, for all intents and purposes, a single-node Origin system w/ graphics > options, HEART instead of HUB, and a much more simplified PROM). Both use the > same SCSI chip, a QLogic ISP1040B, and thus the same driver, qla1280.o. The > difference with the Octane is I can reproduce the hardlock on demand by > untarring a large tarball (a Gentoo stage3, to be exact). Compared to the > Onyx2, which has 8GB of RAM, and the lock seems more random. > > I'll have to reconfigure the Octane later on with 3.5GB of RAM again, but test > BFQ, CFQ, and Deadline out to see if the hardlock happens with all three. I > know BFQ is largely derived from the CFQ code, so if the system remains stable > with Deadline, but not CFQ or BFQ, then I know the subsystem. Then, if it only > happens on BFQ, I'll go pester their upstream for debugging advice. For the Octane, it looks like it's something with the scheduler. If I use "No-op", the machine can unpack a stage3 just fine. If I use Deadline or CFQ, it dies. I did get several oopses under both, but they're not specific to Octane or MIPS code, and the Oops output doesn't always trigger with each attempt. Several were actually "Reserved instruction in kernel code", but the failing instruction was an "sw", which should work fine. Other weird one was "do_cpu invoked from kernel context!" -- new to me. Saved all of them in case anyone is interested in it. I'll have to test this on the Onyx2 later to see if similar results happen there. That way, I'll know that I am chasing the same bug down. -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org 4096R/D25D95E3 2011-03-28 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: IP27: BUG() in mm/vmscan.c, isolate_lru_pages [was: Random hard locks after ~16hrs uptime] 2015-02-09 8:50 ` Joshua Kinard @ 2015-02-09 14:33 ` Joshua Kinard 2015-02-21 22:26 ` IP27: BUG() in mm/vmscan.c, isolate_lru_pages Joshua Kinard 0 siblings, 1 reply; 6+ messages in thread From: Joshua Kinard @ 2015-02-09 14:33 UTC (permalink / raw) To: Maciej W. Rozycki; +Cc: Linux MIPS List On 02/09/2015 03:50, Joshua Kinard wrote: > On 02/08/2015 19:53, Joshua Kinard wrote: >> On 02/08/2015 07:06, Maciej W. Rozycki wrote: >>> On Sat, 7 Feb 2015, Joshua Kinard wrote: >>> >>>> I've had my Onyx2 running quite a bit lately doing compile runs, and it seems >>>> that after about ~16 hours, there's a random possibility that the machine just >>>> completely stops. No errors printed anywhere, serial becomes completely >>>> unresponsive. I have to issue a 'rst' from the MSC to bring it back up again. >>> >>> If the time spent up is always similar, then one possibility is a counter >>> wraparound or suchlike that is not handled correctly (i.e. the carry from >>> the topmost bit is not taken into account), causing a kernel deadlock. >> >> I believe I've pinned the problem down to the block I/O layer and points >> beneath, such as SCSI core, qla1280, etc. I am using an out-of-tree patch to >> add the BFQ I/O scheduler in, so that may also be a cause to consider. >> >> I had a very similar hardlock on the Octane, too, when I upgraded the RAM to >> 3.5GB the other day, but going back to 2GB solves the problem there. Octane >> is, for all intents and purposes, a single-node Origin system w/ graphics >> options, HEART instead of HUB, and a much more simplified PROM). Both use the >> same SCSI chip, a QLogic ISP1040B, and thus the same driver, qla1280.o. The >> difference with the Octane is I can reproduce the hardlock on demand by >> untarring a large tarball (a Gentoo stage3, to be exact). Compared to the >> Onyx2, which has 8GB of RAM, and the lock seems more random. >> >> I'll have to reconfigure the Octane later on with 3.5GB of RAM again, but test >> BFQ, CFQ, and Deadline out to see if the hardlock happens with all three. I >> know BFQ is largely derived from the CFQ code, so if the system remains stable >> with Deadline, but not CFQ or BFQ, then I know the subsystem. Then, if it only >> happens on BFQ, I'll go pester their upstream for debugging advice. > > For the Octane, it looks like it's something with the scheduler. If I use > "No-op", the machine can unpack a stage3 just fine. If I use Deadline or CFQ, > it dies. I did get several oopses under both, but they're not specific to > Octane or MIPS code, and the Oops output doesn't always trigger with each > attempt. Several were actually "Reserved instruction in kernel code", but the > failing instruction was an "sw", which should work fine. Other weird one was > "do_cpu invoked from kernel context!" -- new to me. Saved all of them in case > anyone is interested in it. > > I'll have to test this on the Onyx2 later to see if similar results > happen there. That way, I'll know that I am chasing the same bug down. No, I think I am looking at two different bugs here. They just happen to share similar triggers and exhibit similar symptoms. Octane, the bug happens when I use two high-density memory modules (very rare). I'll have to find my IRIX disk and run diags on the modules to eliminate them being plain faulty. I can still crash the machine with just those two installed...oddly enough, though, only when using a scheduler other than no-op. Have to dig further into that... For the Onyx2, I was able to use bonnie++ to trigger the oops, and it seems I've tripped up a BUG() in mm/vmscan.c: command line: # bonnie++ -d /usr/space/bonnie/ -s 16g -f -b -u root [ 741.690000] Kernel bug detected[#1]: [ 741.690000] CPU: 2 PID: 42 Comm: kswapd1 Not tainted 3.19.0-rc7-mipsgit-20150207 #7 [ 741.690000] task: a8000000ffff0008 ti: a8000000ff524000 task.ti: a8000000ff524000 [ 741.690000] $ 0 : 0000000000000000 ffffffff94001ce0 0000000000000000 0000000000000000 [ 741.690000] $ 4 : a800000002957d40 0000000000000000 a8000000ff527b00 a8000000ff527b38 [ 741.690000] $ 8 : 0000000000000000 0000000000000020 000000000000039b 0000000000000003 [ 741.690000] $12 : fffffffffffffff8 0000000000100000 0000000000000000 0000000000000000 [ 741.690000] $16 : 0000000000000000 0000000000000020 0000000000000000 a8000000ff527b00 [ 741.690000] $20 : fffffffffffffff0 a8000000ffc9d800 0000000000000002 a800000002957d60 [ 741.690000] $24 : 0000000000000000 a800000000083140 [ 741.690000] $28 : a8000000ff524000 a8000000ff527a90 a8000000ffc9d820 a8000000000f68d0 [ 741.690000] Hi : 0000000000000000 [ 741.690000] Lo : 0000000000000000 [ 741.690000] epc : a8000000000f69b0 isolate_lru_pages.isra.27+0x170/0x188 [ 741.690000] Not tainted [ 741.690000] ra : a8000000000f68d0 isolate_lru_pages.isra.27+0x90/0x188 [ 741.690000] Status: 94001ce2 KX SX UX KERNEL EXL [ 741.690000] Cause : 00008024 [ 741.690000] PrId : 00000f14 (R14000) [ 741.690000] Process kswapd1 (pid: 42, threadinfo=a8000000ff524000, task=a8000000ffff0008, tls=0000000000000000) [ 741.690000] Stack : 0000000000000000 a8000000ff527b38 a800000100025000 a8000000ff527d30 0000000000000002 a8000000ffc9d800 0000000000000000 0000000000000020 0000000000000001 a800000100025600 0000000000000000 a8000000000f8868 a8000000ff527b10 0000000000000000 a8000000ff527b00 a8000000ff527b00 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 a8000000050332f0 0000000000000001 a8000000ffeb1760 a8000000ffc9d800 a8000000ff527d30 a8000000ff527c08 0000000000000002 0000000000000001 0000000000000020 0000000000000003 0000000000000004 0000000000000000 a8000000000f8ff8 a8000000ff527ba0 a8000000ff527ba0 a8000000ff527bb0 a8000000ff527bb0 a8000000ff527bc0 a8000000ff527bc0 ... [ 741.690000] Call Trace: [ 741.690000] [<a8000000000f69b0>] isolate_lru_pages.isra.27+0x170/0x188 [ 741.690000] [<a8000000000f8868>] shrink_inactive_list+0xf0/0x638 [ 741.690000] [<a8000000000f8ff8>] shrink_lruvec+0x248/0x718 [ 744.230000] [<a8000000000f956c>] shrink_zone+0xa4/0x2c8 [ 744.230000] [<a8000000000f9e54>] kswapd+0x6c4/0xab0 [ 744.230000] [<a80000000006f16c>] kthread+0x10c/0x128 [ 744.230000] [<a800000000025b88>] ret_from_kernel_thread+0x14/0x1c [ 744.230000] [ 744.230000] Code: 0803da48 ffd70000 00000000 <000c000d> 00000000 0000802d 0803da4f ffa00000 00000000 [ 744.660000] ---[ end trace 2796f87304e1e281 ]--- [ 744.660000] Fatal exception: panic in 5 seconds [ 744.660000] sched: RT throttling activated [ 749.670000] Kernel panic - not syncing: Fatal exception [ 749.730000] ---[ end Kernel panic - not syncing: Fatal exception 'isolate_lru_pages.isra.27' seems to be where GCC optimized 'isolate_lru_pages' with -fipa-sra: a8000000000f6840 <isolate_lru_pages.isra.27>: * @mode: One of the LRU isolation modes * @lru: LRU list id for isolating * * returns how many pages were moved onto *@dst. */ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, a8000000000f6840: 67bdffa0 daddiu sp,sp,-96 a8000000000f6844: ffb60040 sd s6,64(sp) a8000000000f6848: 0120b02d move s6,a5 struct lruvec *lruvec, struct list_head *dst, unsigned long *nr_scanned, struct scan_control *sc, isolate_mode_t mode, enum lru_list lru) { [snip] switch (__isolate_lru_page(page, mode)) { a8000000000f68c8: 0c03d9ba jal a8000000000f66e8 <__isolate_lru_page> a8000000000f68cc: 0240282d move a1,s2 a8000000000f68d0: 1054002b beq v0,s4,a8000000000f6980 <isolate_lru_pages.isra.27+0x140> a8000000000f68d4: 00000000 nop a8000000000f68d8: 14400035 bnez v0,a8000000000f69b0 <isolate_lru_pages.isra.27+0x170> ^^^^^^^^^^ +0x90, or $ra, the "default" case in this switch block. [snip] static inline void __noreturn BUG(void) { __asm__ __volatile__("break %0" : : "i" (BRK_BUG)); a8000000000f69b0: 000c000d break 0xc ^^^^^^^^^^ +0x1a0, or $epc, BUG() So I did trip up a kernel bug. But I have no idea *why* it tripped. All 8GB of memory in the Onyx2 should be fine -- I ran heavy diags on it from PROM (not from IRIX though) and all of the memory passed. It seems that by rebooting the machine after every compile run and using a ramdisk, I can avoid triggering the flaw for now. At least, I hope it's the same flaw. The hardlocks never dumped oops information. But the triggers seem the same, so I am hoping this is it. Ideas? -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org 4096R/D25D95E3 2011-03-28 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: IP27: BUG() in mm/vmscan.c, isolate_lru_pages 2015-02-09 14:33 ` IP27: BUG() in mm/vmscan.c, isolate_lru_pages [was: Random hard locks after ~16hrs uptime] Joshua Kinard @ 2015-02-21 22:26 ` Joshua Kinard 0 siblings, 0 replies; 6+ messages in thread From: Joshua Kinard @ 2015-02-21 22:26 UTC (permalink / raw) To: linux-mips On 02/09/2015 09:33, Joshua Kinard wrote: > On 02/09/2015 03:50, Joshua Kinard wrote: > > For the Onyx2, I was able to use bonnie++ to trigger the oops, and it seems > I've tripped up a BUG() in mm/vmscan.c: > > command line: > # bonnie++ -d /usr/space/bonnie/ -s 16g -f -b -u root > > [ 741.690000] Kernel bug detected[#1]: > [ 741.690000] CPU: 2 PID: 42 Comm: kswapd1 Not tainted 3.19.0-rc7-mipsgit-20150207 #7 > [ 741.690000] task: a8000000ffff0008 ti: a8000000ff524000 task.ti: a8000000ff524000 > [ 741.690000] $ 0 : 0000000000000000 ffffffff94001ce0 0000000000000000 0000000000000000 > [ 741.690000] $ 4 : a800000002957d40 0000000000000000 a8000000ff527b00 a8000000ff527b38 > [ 741.690000] $ 8 : 0000000000000000 0000000000000020 000000000000039b 0000000000000003 > [ 741.690000] $12 : fffffffffffffff8 0000000000100000 0000000000000000 0000000000000000 > [ 741.690000] $16 : 0000000000000000 0000000000000020 0000000000000000 a8000000ff527b00 > [ 741.690000] $20 : fffffffffffffff0 a8000000ffc9d800 0000000000000002 a800000002957d60 > [ 741.690000] $24 : 0000000000000000 a800000000083140 > [ 741.690000] $28 : a8000000ff524000 a8000000ff527a90 a8000000ffc9d820 a8000000000f68d0 > [ 741.690000] Hi : 0000000000000000 > [ 741.690000] Lo : 0000000000000000 > [ 741.690000] epc : a8000000000f69b0 isolate_lru_pages.isra.27+0x170/0x188 > [ 741.690000] Not tainted > [ 741.690000] ra : a8000000000f68d0 isolate_lru_pages.isra.27+0x90/0x188 > [ 741.690000] Status: 94001ce2 KX SX UX KERNEL EXL > [ 741.690000] Cause : 00008024 > [ 741.690000] PrId : 00000f14 (R14000) > [ 741.690000] Process kswapd1 (pid: 42, threadinfo=a8000000ff524000, task=a8000000ffff0008, tls=0000000000000000) > [ 741.690000] Stack : 0000000000000000 a8000000ff527b38 a800000100025000 a8000000ff527d30 > 0000000000000002 a8000000ffc9d800 0000000000000000 0000000000000020 > 0000000000000001 a800000100025600 0000000000000000 a8000000000f8868 > a8000000ff527b10 0000000000000000 a8000000ff527b00 a8000000ff527b00 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 a8000000050332f0 0000000000000001 a8000000ffeb1760 > a8000000ffc9d800 a8000000ff527d30 a8000000ff527c08 0000000000000002 > 0000000000000001 0000000000000020 0000000000000003 0000000000000004 > 0000000000000000 a8000000000f8ff8 a8000000ff527ba0 a8000000ff527ba0 > a8000000ff527bb0 a8000000ff527bb0 a8000000ff527bc0 a8000000ff527bc0 > ... > [ 741.690000] Call Trace: > [ 741.690000] [<a8000000000f69b0>] isolate_lru_pages.isra.27+0x170/0x188 > [ 741.690000] [<a8000000000f8868>] shrink_inactive_list+0xf0/0x638 > [ 741.690000] [<a8000000000f8ff8>] shrink_lruvec+0x248/0x718 > [ 744.230000] [<a8000000000f956c>] shrink_zone+0xa4/0x2c8 > [ 744.230000] [<a8000000000f9e54>] kswapd+0x6c4/0xab0 > [ 744.230000] [<a80000000006f16c>] kthread+0x10c/0x128 > [ 744.230000] [<a800000000025b88>] ret_from_kernel_thread+0x14/0x1c > [ 744.230000] > [ 744.230000] > Code: 0803da48 ffd70000 00000000 <000c000d> 00000000 0000802d 0803da4f ffa00000 00000000 > [ 744.660000] ---[ end trace 2796f87304e1e281 ]--- > [ 744.660000] Fatal exception: panic in 5 seconds > [ 744.660000] sched: RT throttling activated > [ 749.670000] Kernel panic - not syncing: Fatal exception > [ 749.730000] ---[ end Kernel panic - not syncing: Fatal exception > > [snip] Got another one today on the Onyx2. Managed to get >2 days of uptime, though. [255349.410000] Kernel bug detected[#1]: [255349.450000] CPU: 2 PID: 42 Comm: kswapd1 Not tainted 3.19.0-mipsgit-20150215 #2 [255349.470000] task: a8000000ffff0008 ti: a8000000ff524000 task.ti: a8000000ff524000 [255349.470000] $ 0 : 0000000000000000 ffffffffffffffdf 0000000000000001 000000000085b218 [255349.470000] $ 4 : a800000004e53000 0000000000000000 0000000000002587 a8000000ff527c98 [255349.470000] $ 8 : 0000000000000000 0000000000000010 fffffffffffffffc a800000005030fa0 [255349.470000] $12 : a8000000ff527fe0 0000000000001c00 0000000000000020 a8000000006d05b8 [255349.470000] $16 : 0000000000000001 0000000000000020 0000000000000000 a8000000ff527c80 [255349.470000] $20 : fffffffffffffff0 a8000000ffc9d800 0000000000000001 a800000004e53020 [255349.470000] $24 : 0000000000000005 0000000000000001 [255349.470000] $28 : a8000000ff524000 a8000000ff527c00 a8000000ffc9d810 a8000000000f6840 [255349.470000] Hi : 0000000000000000 [255349.470000] Lo : 0000000000000042 [255349.470000] epc : a8000000000f6920 isolate_lru_pages.isra.27+0x170/0x188 [255349.470000] Not tainted [255349.470000] ra : a8000000000f6840 isolate_lru_pages.isra.27+0x90/0x188 [255349.470000] Status: 94001ce2 KX SX UX KERNEL EXL [255349.470000] Cause : 00008024 [255349.470000] PrId : 00000f14 (R14000) [255349.470000] Process kswapd1 (pid: 42, threadinfo=a8000000ff524000, task=a8000000ffff0008, tls=0000000000000000) [255349.470000] Stack : 0000000000000001 a8000000ff527c98 a8000000ff527c80 a8000000ff527d30 0000000000000000 0000000000000020 ffffffffffffffff 000000000000000c a800000100025000 a8000000ffc9d800 a8000000ffc9f800 a8000000000f70dc a8000000ff527c60 a8000000ff527c60 a8000000ff527c70 a8000000ff527c70 a800000004e508e0 a800000004e508e0 a800000005030f80 a800000000869de0 a8000000ffc9d800 0000000000000001 0000000000000000 a8000000ffc9f800 a800000100025600 a8000000001466c8 00000000ffff8b73 a8000000ffdc0000 0000000000000000 a800000100025000 0000000000000000 a800000100025000 ffffffffffffffff 000000000000000c a800000000940000 a8000000ffc9d800 a8000000ffc9f800 a8000000000f9b00 0000000000000000 000000d000000000 ... [255349.470000] Call Trace: [255349.470000] [<a8000000000f6920>] isolate_lru_pages.isra.27+0x170/0x188 [255349.470000] [<a8000000000f70dc>] shrink_active_list+0xcc/0x478 [255349.470000] [<a8000000000f9b00>] kswapd+0x400/0xab0 [255349.470000] [<a80000000006f17c>] kthread+0x10c/0x128 [255349.470000] [<a800000000025b88>] ret_from_kernel_thread+0x14/0x1c [255349.470000] [255349.470000] Code: 0803da24 ffd70000 00000000 <000c000d> 00000000 0000802d 0803da2b ffa00000 00000000 [255352.230000] sched: RT throttling activated [255352.230000] ---[ end trace 4b44ace0cfc15c3d ]--- [255352.280000] Fatal exception: panic in 5 seconds [255357.290000] Kernel panic - not syncing: Fatal exception [255357.350000] ---[ end Kernel panic - not syncing: Fatal exception Interesting note, though, is the same process (kswapd1), same PID (42 -- HHGTTG?), same CPU (#2), as the oops I forced via bonnie++ almost two weeks ago. For a memory-related oops, that seems a bit uncanny. That BUG() has been there for the longest of time. I managed to trace down the initial commit, 21eac81f252f "[PATCH] Swap Migration V5: LRU operations", which says this: "Swap migration allows the moving of the physical location of pages between nodes in a numa system while the process is running. This means that the virtual addresses that the process sees do not change. However, the system rearranges the physical location of those pages." https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/mm/vmscan.c?id=21eac81f252fe31c3cf64b805a1e8652192f3a3b NUMA system and signed off by an SGI person back then... So I turned on CONFIG_NUMA, against the advice of the help entry (2-node Onyx2), and the problem appears to go away. Is CONFIG_NUMA a problem for a 2-node IP27 machine? These machines basically invented the concept. I'm not familiar with how enabling that option changes things to apparently avoid tripping the BUG() up. --J ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-02-21 22:26 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-08 2:58 IP27: Random hard locks after ~16hrs uptime Joshua Kinard 2015-02-08 12:06 ` Maciej W. Rozycki 2015-02-09 0:53 ` Joshua Kinard 2015-02-09 8:50 ` Joshua Kinard 2015-02-09 14:33 ` IP27: BUG() in mm/vmscan.c, isolate_lru_pages [was: Random hard locks after ~16hrs uptime] Joshua Kinard 2015-02-21 22:26 ` IP27: BUG() in mm/vmscan.c, isolate_lru_pages Joshua Kinard
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.