* ARM11MPcore: tlb_ops_need_broadcast causes deadlock @ 2012-03-22 12:24 EXTERNAL Waechtler Peter (Fa. TCP, CM-AI/PJ-CF31) 2012-03-23 17:30 ` Will Deacon 0 siblings, 1 reply; 12+ messages in thread From: EXTERNAL Waechtler Peter (Fa. TCP, CM-AI/PJ-CF31) @ 2012-03-22 12:24 UTC (permalink / raw) To: linux-arm-kernel Here we are again with another issue on ARM11mpcore (2 cores for Linux): In relatively rare circumstances the system soft-locks up: cpuA cpuB kswapd searches for pages to reclaim via shrink_zone page_referenced page_referenced_one page_check_address(&ptl) <- ptl gets locked! ptep_clear_flush_young_notify jump to the "innocent" page IRQS OFF do_DataAbort-> handle_mm_fault handle_pte_fault (inlined) ptl = pte_lockptr(mm, pmd); spin_lock(ptl); flush_tlb_page tlb_ops_need_broadcast on_each_cpu_mask(ipi_flush_tlb_page, with WAIT) csd_lock_wait() DEADLOCK, IPI on cpuB does not finish because IRQs are OFF pte_unmap_unlock(pte, ptl); And here is some explanation: Every then and now pages are marked inaccessible in the hardware PTE (page table entry) so that the VM subsystem can check if the page is accessed at all. If it's frequently accessed it will become a "young" page. On memory pressure "old" pages will be the first to get evicted. The kswapd kernel thread goes through a list of pages to check if they were accessed in a given interval and mark our target page as young. The cpuB executes some user code hitting that page and because the PTE is marked "inaccessible", so that the attempt can be stored, it results in a page fault. Unluckily the kswapd calls tlb_flush and that is configured to inform all cpus about that change via IPIs. cpuB is in an user abort handler (__dabt_usr) and the disaster takes its course: For checking if it's a thumb instruction that caused the fault the abort handler accesses the page resulting into another fault - but now entering svc abort handler (__dabt_svc) and that turns off interrupts! That leads to cpuA waiting in csd_lock_wait for the IPI to signal its end of execution (via csd->flags) but that does not happen because IRQs are off on cpuB that is stuck in the page fault handler spinning to get the lock for the mm->page_table_lock but this is still held on cpuA waiting for the IPIs to finish. possible solutions: a) do not wait for that particular IPI since the mapping does not change (just the access bits) b) open code the ptep_set_access_flags() and change the sequence that the IPI is called without holding the page_table_lock anymore This shows up on CPUs where tlb_ops_need_broadcast() returns true. Input welcome how to resolve this issue. regards Peter ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-22 12:24 ARM11MPcore: tlb_ops_need_broadcast causes deadlock EXTERNAL Waechtler Peter (Fa. TCP, CM-AI/PJ-CF31) @ 2012-03-23 17:30 ` Will Deacon 2012-03-25 12:08 ` Peter Waechtler [not found] ` <274124B9C6907D4B8CE985903EAA19E91B2D5798D9@SI-MBX06.de.bosch.com> 0 siblings, 2 replies; 12+ messages in thread From: Will Deacon @ 2012-03-23 17:30 UTC (permalink / raw) To: linux-arm-kernel Hi Peter, Thanks for the detailed explanation. However, there's one bit I'm not sure that I follow: On Thu, Mar 22, 2012 at 12:24:47PM +0000, EXTERNAL Waechtler Peter (Fa. TCP, CM-AI/PJ-CF31) wrote: > The cpuB executes some user code hitting that page and because the PTE > is marked "inaccessible", so that the attempt can be stored, it results > in a page fault. Shouldn't this trigger a prefetch abort rather than a data abort? > Unluckily the kswapd calls tlb_flush and that is configured to inform all > cpus about that change via IPIs. cpuB is in an user abort handler (__dabt_usr) > and the disaster takes its course: > > For checking if it's a thumb instruction that caused the fault the abort handler > accesses the page resulting into another fault - but now entering svc abort handler > (__dabt_svc) and that turns off interrupts! This is where I'm confused - why are we in the data abort handler due to an I-side fault? Thanks, Will ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-23 17:30 ` Will Deacon @ 2012-03-25 12:08 ` Peter Waechtler 2012-03-25 13:09 ` Russell King - ARM Linux [not found] ` <274124B9C6907D4B8CE985903EAA19E91B2D5798D9@SI-MBX06.de.bosch.com> 1 sibling, 1 reply; 12+ messages in thread From: Peter Waechtler @ 2012-03-25 12:08 UTC (permalink / raw) To: linux-arm-kernel Will Deacon <will.deacon <at> arm.com> writes: > > > The cpuB executes some user code hitting that page and because the PTE > > is marked "inaccessible", so that the attempt can be stored, it results > > in a page fault. > > Shouldn't this trigger a prefetch abort rather than a data abort? > Yes, I think the abort part is not completely understood yet. So it's a data abort, some code referenced a page resulting in a dabrt_usr. Then the abort handler tries to read the instruction and gets another dabrt. That would mean that the code page was marked as inaccessible in the mean time. Now how could that happen? Probably on the other cpu? Weird stuff. But Will, is that tlb_flush necessary at all? The ARM has only 3 permission bits in the page table (APX and AP0 and AP1). The young/accessed bit is done via software. Peter ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-25 12:08 ` Peter Waechtler @ 2012-03-25 13:09 ` Russell King - ARM Linux 2012-03-25 18:22 ` Peter Waechtler 0 siblings, 1 reply; 12+ messages in thread From: Russell King - ARM Linux @ 2012-03-25 13:09 UTC (permalink / raw) To: linux-arm-kernel On Sun, Mar 25, 2012 at 12:08:47PM +0000, Peter Waechtler wrote: > But Will, is that tlb_flush necessary at all? The ARM has only 3 permission > bits in the page table (APX and AP0 and AP1). The young/accessed bit is done > via software. Yes it most definitely is, because setting a page to be young means we must receive a subsequent fault to make it 'old' again. This means we must set the page to be inaccessible to get that fault, and flush the TLBs across all CPUs so that any CPU accessing that page receives a fault. ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-25 13:09 ` Russell King - ARM Linux @ 2012-03-25 18:22 ` Peter Waechtler 2012-03-25 19:15 ` Russell King - ARM Linux 0 siblings, 1 reply; 12+ messages in thread From: Peter Waechtler @ 2012-03-25 18:22 UTC (permalink / raw) To: linux-arm-kernel On 25.03.2012 15:09, Russell King - ARM Linux wrote: > On Sun, Mar 25, 2012 at 12:08:47PM +0000, Peter Waechtler wrote: >> But Will, is that tlb_flush necessary at all? The ARM has only 3 permission >> bits in the page table (APX and AP0 and AP1). The young/accessed bit is done >> via software. > Yes it most definitely is, because setting a page to be young means we > must receive a subsequent fault to make it 'old' again. This means we > must set the page to be inaccessible to get that fault, and flush the > TLBs across all CPUs so that any CPU accessing that page receives a > fault. Ok I see, it's also not the "right or perfect" fix. But the worst thing that can happen is: young page: causes no page fault anymore and stays longer young than kswapd wants in case a TLB has a stale entry but that access would have marked it young again - no big deal? old page: causes a page fault so that it can be made young, a stale TLB would cause still a page fault - but in that path the tlb_flush still happens From my point of view: I definitively prefer to avoid the deadlock ;) I'm afraid that I missed something? I hope not :) Peter ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-25 18:22 ` Peter Waechtler @ 2012-03-25 19:15 ` Russell King - ARM Linux 2012-03-25 20:22 ` Peter Waechtler 0 siblings, 1 reply; 12+ messages in thread From: Russell King - ARM Linux @ 2012-03-25 19:15 UTC (permalink / raw) To: linux-arm-kernel On Sun, Mar 25, 2012 at 08:22:05PM +0200, Peter Waechtler wrote: > On 25.03.2012 15:09, Russell King - ARM Linux wrote: >> On Sun, Mar 25, 2012 at 12:08:47PM +0000, Peter Waechtler wrote: >>> But Will, is that tlb_flush necessary at all? The ARM has only 3 permission >>> bits in the page table (APX and AP0 and AP1). The young/accessed bit is done >>> via software. >> Yes it most definitely is, because setting a page to be young means we >> must receive a subsequent fault to make it 'old' again. This means we >> must set the page to be inaccessible to get that fault, and flush the >> TLBs across all CPUs so that any CPU accessing that page receives a >> fault. > Ok I see, it's also not the "right or perfect" fix. It's not a fix or anything, it's required behaviour - otherwise we could end up throwing out pages from the system which are actually 'hot' because they've stayed in the TLB and we haven't received a fault to make them young again. Moreover, what about the case where we actually remove the page? Aren't we also holding the pte lock there? So I don't think there's an obvious solution to your deadlock. I think the real question is - in your example - why are you touching a userspace page with IRQs off _and_ expecting the fault to be fixed up? You never really explained what CPU B was doing. ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-25 19:15 ` Russell King - ARM Linux @ 2012-03-25 20:22 ` Peter Waechtler 2012-03-25 21:55 ` Russell King - ARM Linux 0 siblings, 1 reply; 12+ messages in thread From: Peter Waechtler @ 2012-03-25 20:22 UTC (permalink / raw) To: linux-arm-kernel On 25.03.2012 21:15, Russell King - ARM Linux wrote: > On Sun, Mar 25, 2012 at 08:22:05PM +0200, Peter Waechtler wrote: >> On 25.03.2012 15:09, Russell King - ARM Linux wrote: >>> On Sun, Mar 25, 2012 at 12:08:47PM +0000, Peter Waechtler wrote: >>>> But Will, is that tlb_flush necessary at all? The ARM has only 3 permission >>>> bits in the page table (APX and AP0 and AP1). The young/accessed bit is done >>>> via software. >>> Yes it most definitely is, because setting a page to be young means we >>> must receive a subsequent fault to make it 'old' again. This means we >>> must set the page to be inaccessible to get that fault, and flush the >>> TLBs across all CPUs so that any CPU accessing that page receives a >>> fault. >> Ok I see, it's also not the "right or perfect" fix. > It's not a fix or anything, it's required behaviour - otherwise we could > end up throwing out pages from the system which are actually 'hot' because > they've stayed in the TLB and we haven't received a fault to make them > young again. I'm arguing solely on kswapd making a young page old. So it can't be a hot page. But yes in theory it's possible that it just become hot on another cpu... And again I don't understand the abort handler: why do we get a page fault on a young page then? grrh > Moreover, what about the case where we actually remove the page? I don't claim that this is the only way to deadlock - but this is the case we encounter. > Aren't we also holding the pte lock there? So I don't think there's an > obvious solution to your deadlock. > > I think the real question is - in your example - why are you touching > a userspace page with IRQs off _and_ expecting the fault to be fixed up? > You never really explained what CPU B was doing. It was running some user space program. It was not in the kernel. I will post the jtag probe screenshots tomorrow. Peter ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-25 20:22 ` Peter Waechtler @ 2012-03-25 21:55 ` Russell King - ARM Linux 2012-03-26 15:20 ` Will Deacon 0 siblings, 1 reply; 12+ messages in thread From: Russell King - ARM Linux @ 2012-03-25 21:55 UTC (permalink / raw) To: linux-arm-kernel On Sun, Mar 25, 2012 at 10:22:49PM +0200, Peter Waechtler wrote: > On 25.03.2012 21:15, Russell King - ARM Linux wrote: >> On Sun, Mar 25, 2012 at 08:22:05PM +0200, Peter Waechtler wrote: >>> On 25.03.2012 15:09, Russell King - ARM Linux wrote: >>>> On Sun, Mar 25, 2012 at 12:08:47PM +0000, Peter Waechtler wrote: >>>>> But Will, is that tlb_flush necessary at all? The ARM has only 3 permission >>>>> bits in the page table (APX and AP0 and AP1). The young/accessed bit is done >>>>> via software. >>>> Yes it most definitely is, because setting a page to be young means we >>>> must receive a subsequent fault to make it 'old' again. This means we >>>> must set the page to be inaccessible to get that fault, and flush the >>>> TLBs across all CPUs so that any CPU accessing that page receives a >>>> fault. >>> Ok I see, it's also not the "right or perfect" fix. >> It's not a fix or anything, it's required behaviour - otherwise we could >> end up throwing out pages from the system which are actually 'hot' because >> they've stayed in the TLB and we haven't received a fault to make them >> young again. > > I'm arguing solely on kswapd making a young page old. So it can't be a > hot page. No, it can be a hot page because the way it finds out that it's a hot page is when it has to make the page repeatedly young, having first made it old. There's no other way the system can know what pages are being accessed by userspace. > But yes in theory it's possible that it just become hot on another cpu... > > And again I don't understand the abort handler: why do we get a page > fault on > a young page then? grrh Permissions? Userspace trying to write to the page when it isn't marked writable and dirty? >> Moreover, what about the case where we actually remove the page? > I don't claim that this is the only way to deadlock - but this is the > case we encounter. No, but you're arguing that we drop the TLB flush for your specific case. I'm telling you that's pointless if there's other cases as well which we'll deadlock. But that's neither here nor there because you haven't fully explained what the problem is yet... >> Aren't we also holding the pte lock there? So I don't think there's an >> obvious solution to your deadlock. >> >> I think the real question is - in your example - why are you touching >> a userspace page with IRQs off _and_ expecting the fault to be fixed up? >> You never really explained what CPU B was doing. > It was running some user space program. It was not in the kernel. > I will post the jtag probe screenshots tomorrow. Why do we need silly screen shots, why can't you explain it, or paste the output? I'm not going to bother looking at GIFs, PNGs or jpegs because my mail reader is text only. Moreover, user space programs can't disable interrupts. So you should not be receiving a data abort from userspace with interrupts disabled. Yes, when the CPU enters the data abort handler, it will disable interrupts, but we re-enable them before processing the abort. ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-25 21:55 ` Russell King - ARM Linux @ 2012-03-26 15:20 ` Will Deacon 0 siblings, 0 replies; 12+ messages in thread From: Will Deacon @ 2012-03-26 15:20 UTC (permalink / raw) To: linux-arm-kernel Hmm, I somehow got dropped from this, but I'll pick it back up here. On Sun, Mar 25, 2012 at 10:55:35PM +0100, Russell King - ARM Linux wrote: > On Sun, Mar 25, 2012 at 10:22:49PM +0200, Peter Waechtler wrote: > > And again I don't understand the abort handler: why do we get a page > > fault on > > a young page then? grrh > > Permissions? Userspace trying to write to the page when it isn't marked > writable and dirty? > > >> Moreover, what about the case where we actually remove the page? > > I don't claim that this is the only way to deadlock - but this is the > > case we encounter. > > No, but you're arguing that we drop the TLB flush for your specific case. > I'm telling you that's pointless if there's other cases as well which > we'll deadlock. > > But that's neither here nor there because you haven't fully explained > what the problem is yet.. Yes, I'm inclined to agree with Russell here. Things don't add up and without further information there's not a lot we can do. Peter - are you able to reproduce and investigate this problem or was it a one-off observation? If you can figure out what really goes on inside CPU B in your example, then we may be able to look into this further. A good first step might be to work out what triggers the initial data abort and look at the state of the world at that point. Will ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <274124B9C6907D4B8CE985903EAA19E91B2D5798D9@SI-MBX06.de.bosch.com>]
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock [not found] ` <274124B9C6907D4B8CE985903EAA19E91B2D5798D9@SI-MBX06.de.bosch.com> @ 2012-03-27 13:32 ` Will Deacon 2012-03-27 17:41 ` George G. Davis 0 siblings, 1 reply; 12+ messages in thread From: Will Deacon @ 2012-03-27 13:32 UTC (permalink / raw) To: linux-arm-kernel Peter, On Mon, Mar 26, 2012 at 05:10:45PM +0100, EXTERNAL Waechtler Peter (Fa. TCP, CM-AI/PJ-CF31) wrote: > Probably just an "expected deadlock" as mentioned in the comment > of the v6_early_abort macro: > > * Purpose : obtain information about current aborted instruction. > * Note: we read user space. This means we might cause a data > * abort here if the I-TLB and D-TLB aren't seeing the same > * picture. Unfortunately, this does happen. We live with it. I don't see this referring to an expected deadlock. > For now the errata workarounds are removed for the 11MPcore > like proposed in this thread to avoid faulting with IRQs turned off: > > http://lists.infradead.org/pipermail/linux-arm-kernel/2011-February/041869.html > > But there it looked like an optimization, but it wasn't. I have a theory about what goes on: Say we have a valid (i.e. non-faulting) page which contains a load instruction that will fault. A CPU executes this load and takes a data abort but at the same time another CPU marks the page being executed as old. So when the original CPU tries to load the faulting instruction in do_thumb_abort, we take a second data abort (assumedly because we don't have a D-side TLB entry for the text page, so we immediately see that it is old) and, because interrupts were not yet re-enabled in the first fault, they are not enabled in the nested fault either. At this point, the faulting CPU will be unable to get the lock on the page, since the other guy has it and is waiting for the TLB broadcast to complete. Given that interrupts are disabled on the faulting CPU, everything locks up. Possible solutions: (1) Enable interrupts if they are enabled in the faulting context before loading instructions on the dabt path. (2) Use the FSR to determine whather a fault is due to a read or a write on ARMv6 - only load and disassemble the instruction on 1136 CPUs affected by erratum #325103 (which aren't SMP, so cannot hit the problem above). The latter is probably best. Please can you try the patch below? I've checked that it does the right thing on an r0p1 1136 core using a simple fork/swp program to trigger a CoW. Will diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index dfb0312..dedb885 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1163,6 +1163,15 @@ if !MMU source "arch/arm/Kconfig-nommu" endif +config ARM_ERRATA_326103 + bool "ARM errata: FSR write bit incorrect on a SWP to read-only memory" + depends on CPU_V6 + help + Executing a SWP instruction to read-only memory does not set bit 11 + of the FSR on the ARM 1136 prior to r1p0. This causes the kernel to + treat the access as a read, preventing a COW from occurring and + causing the faulting task to livelock. + config ARM_ERRATA_411920 bool "ARM errata: Invalidation of the Instruction Cache operation can fail" depends on CPU_V6 || CPU_V6K diff --git a/arch/arm/mm/abort-ev6.S b/arch/arm/mm/abort-ev6.S index ff1f7cc..8074199 100644 --- a/arch/arm/mm/abort-ev6.S +++ b/arch/arm/mm/abort-ev6.S @@ -26,18 +26,23 @@ ENTRY(v6_early_abort) mrc p15, 0, r1, c5, c0, 0 @ get FSR mrc p15, 0, r0, c6, c0, 0 @ get FAR /* - * Faulty SWP instruction on 1136 doesn't set bit 11 in DFSR (erratum 326103). - * The test below covers all the write situations, including Java bytecodes + * Faulty SWP instruction on 1136 doesn't set bit 11 in DFSR. */ - bic r1, r1, #1 << 11 @ clear bit 11 of FSR +#ifdef CONFIG_ARM_ERRATA_326103 + ldr ip, =0x4107b36 + mrc p15, 0, r3, c0, c0, 0 @ get processor id + teq ip, r3, lsr #4 @ r0 ARM1136? + bne do_DataAbort tst r5, #PSR_J_BIT @ Java? + tsteq r5, #PSR_T_BIT @ Thumb? bne do_DataAbort - do_thumb_abort fsr=r1, pc=r4, psr=r5, tmp=r3 - ldreq r3, [r4] @ read aborted ARM instruction + bic r1, r1, #1 << 11 @ clear bit 11 of FSR + ldr r3, [r4] @ read aborted ARM instruction #ifdef CONFIG_CPU_ENDIAN_BE8 - reveq r3, r3 + rev r3, r3 #endif do_ldrd_abort tmp=ip, insn=r3 tst r3, #1 << 20 @ L = 0 -> write orreq r1, r1, #1 << 11 @ yes. +#endif b do_DataAbort ^ permalink raw reply related [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-27 13:32 ` Will Deacon @ 2012-03-27 17:41 ` George G. Davis 2012-03-28 8:56 ` Will Deacon 0 siblings, 1 reply; 12+ messages in thread From: George G. Davis @ 2012-03-27 17:41 UTC (permalink / raw) To: linux-arm-kernel Hello, On Tue, Mar 27, 2012 at 02:32:26PM +0100, Will Deacon wrote: > Peter, > > On Mon, Mar 26, 2012 at 05:10:45PM +0100, EXTERNAL Waechtler Peter (Fa. TCP, CM-AI/PJ-CF31) wrote: > > Probably just an "expected deadlock" as mentioned in the comment > > of the v6_early_abort macro: > > > > * Purpose : obtain information about current aborted instruction. > > * Note: we read user space. This means we might cause a data > > * abort here if the I-TLB and D-TLB aren't seeing the same > > * picture. Unfortunately, this does happen. We live with it. > > I don't see this referring to an expected deadlock. > > > For now the errata workarounds are removed for the 11MPcore > > like proposed in this thread to avoid faulting with IRQs turned off: > > > > http://lists.infradead.org/pipermail/linux-arm-kernel/2011-February/041869.html > > > > But there it looked like an optimization, but it wasn't. > > I have a theory about what goes on: > > Say we have a valid (i.e. non-faulting) page which contains a load instruction > that will fault. A CPU executes this load and takes a data abort but at the > same time another CPU marks the page being executed as old. So when the > original CPU tries to load the faulting instruction in do_thumb_abort, we take > a second data abort (assumedly because we don't have a D-side TLB entry for the > text page, so we immediately see that it is old) and, because interrupts were > not yet re-enabled in the first fault, they are not enabled in the nested fault > either. This is precisely what happened here. The only difference is that the traces I've reviewed faulted at "not_thumb:" while attempting to read the userspace ARM instruction which lead to the (second) data abort with interrupts disabled. > At this point, the faulting CPU will be unable to get the lock on the page, > since the other guy has it and is waiting for the TLB broadcast to complete. > Given that interrupts are disabled on the faulting CPU, everything locks up. Right again. > Possible solutions: > > (1) Enable interrupts if they are enabled in the faulting context before > loading instructions on the dabt path. > > (2) Use the FSR to determine whather a fault is due to a read or a write on > ARMv6 - only load and disassemble the instruction on 1136 CPUs affected > by erratum #325103 (which aren't SMP, so cannot hit the problem above). We submitted a change similar to (2) above to the ARM Linux kernel mailing list for RFC [1] over a year ago. That change [1] is similar to your change below. > The latter is probably best. Please can you try the patch below? I've > checked that it does the right thing on an r0p1 1136 core using a simple > fork/swp program to trigger a CoW. I tested your patch but only on a CPU_V6K based SMP machine. In this case, ARM_ERRATA_326103 depends on CPU_V6, so is left disabled, renderring this patch functionally equivaltent to [1] below. > Will > > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > index dfb0312..dedb885 100644 > --- a/arch/arm/Kconfig > +++ b/arch/arm/Kconfig > @@ -1163,6 +1163,15 @@ if !MMU > source "arch/arm/Kconfig-nommu" > endif > > +config ARM_ERRATA_326103 > + bool "ARM errata: FSR write bit incorrect on a SWP to read-only memory" > + depends on CPU_V6 > + help > + Executing a SWP instruction to read-only memory does not set bit 11 > + of the FSR on the ARM 1136 prior to r1p0. This causes the kernel to > + treat the access as a read, preventing a COW from occurring and > + causing the faulting task to livelock. > + > config ARM_ERRATA_411920 > bool "ARM errata: Invalidation of the Instruction Cache operation can fail" > depends on CPU_V6 || CPU_V6K > diff --git a/arch/arm/mm/abort-ev6.S b/arch/arm/mm/abort-ev6.S > index ff1f7cc..8074199 100644 > --- a/arch/arm/mm/abort-ev6.S > +++ b/arch/arm/mm/abort-ev6.S > @@ -26,18 +26,23 @@ ENTRY(v6_early_abort) > mrc p15, 0, r1, c5, c0, 0 @ get FSR > mrc p15, 0, r0, c6, c0, 0 @ get FAR > /* > - * Faulty SWP instruction on 1136 doesn't set bit 11 in DFSR (erratum 326103). > - * The test below covers all the write situations, including Java bytecodes > + * Faulty SWP instruction on 1136 doesn't set bit 11 in DFSR. > */ > - bic r1, r1, #1 << 11 @ clear bit 11 of FSR > +#ifdef CONFIG_ARM_ERRATA_326103 > + ldr ip, =0x4107b36 > + mrc p15, 0, r3, c0, c0, 0 @ get processor id > + teq ip, r3, lsr #4 @ r0 ARM1136? > + bne do_DataAbort > tst r5, #PSR_J_BIT @ Java? > + tsteq r5, #PSR_T_BIT @ Thumb? > bne do_DataAbort > - do_thumb_abort fsr=r1, pc=r4, psr=r5, tmp=r3 > - ldreq r3, [r4] @ read aborted ARM instruction > + bic r1, r1, #1 << 11 @ clear bit 11 of FSR > + ldr r3, [r4] @ read aborted ARM instruction > #ifdef CONFIG_CPU_ENDIAN_BE8 > - reveq r3, r3 > + rev r3, r3 > #endif > do_ldrd_abort tmp=ip, insn=r3 > tst r3, #1 << 20 @ L = 0 -> write > orreq r1, r1, #1 << 11 @ yes. > +#endif > b do_DataAbort FYI/FWIW, your patch above suffered whitespace damage. Thanks! -- Regards, George [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2011-February/041733.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* ARM11MPcore: tlb_ops_need_broadcast causes deadlock 2012-03-27 17:41 ` George G. Davis @ 2012-03-28 8:56 ` Will Deacon 0 siblings, 0 replies; 12+ messages in thread From: Will Deacon @ 2012-03-28 8:56 UTC (permalink / raw) To: linux-arm-kernel On Tue, Mar 27, 2012 at 06:41:52PM +0100, George G. Davis wrote: > On Tue, Mar 27, 2012 at 02:32:26PM +0100, Will Deacon wrote: > > I have a theory about what goes on: > > > > Say we have a valid (i.e. non-faulting) page which contains a load instruction > > that will fault. A CPU executes this load and takes a data abort but at the > > same time another CPU marks the page being executed as old. So when the > > original CPU tries to load the faulting instruction in do_thumb_abort, we take > > a second data abort (assumedly because we don't have a D-side TLB entry for the > > text page, so we immediately see that it is old) and, because interrupts were > > not yet re-enabled in the first fault, they are not enabled in the nested fault > > either. > > This is precisely what happened here. The only difference is that the traces > I've reviewed faulted at "not_thumb:" while attempting to read the userspace > ARM instruction which lead to the (second) data abort with interrupts disabled. Right, I think that's the same problem though. > > Possible solutions: > > > > (1) Enable interrupts if they are enabled in the faulting context before > > loading instructions on the dabt path. > > > > (2) Use the FSR to determine whather a fault is due to a read or a write on > > ARMv6 - only load and disassemble the instruction on 1136 CPUs affected > > by erratum #325103 (which aren't SMP, so cannot hit the problem above). > > We submitted a change similar to (2) above to the ARM Linux kernel mailing > list for RFC [1] over a year ago. That change [1] is similar to your change > below. Apologies, I missed that. Are you happy for me to continue with my change below? I'd really like it if Peter could confirm it fixes his problem. > > The latter is probably best. Please can you try the patch below? I've > > checked that it does the right thing on an r0p1 1136 core using a simple > > fork/swp program to trigger a CoW. > > I tested your patch but only on a CPU_V6K based SMP machine. In this > case, ARM_ERRATA_326103 depends on CPU_V6, so is left disabled, renderring > this patch functionally equivaltent to [1] below. Thanks George. Do you have a testcase for reliably reproducing the deadlock without this patch applied? > FYI/FWIW, your patch above suffered whitespace damage. That'll be the sorry excuse for an email system that I'm forced to use. Perhaps the list archive has a better version: http://lists.arm.linux.org.uk/lurker/attach/1 at 20120327.133226.639a8b79.attach Will ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2012-03-28 8:56 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-22 12:24 ARM11MPcore: tlb_ops_need_broadcast causes deadlock EXTERNAL Waechtler Peter (Fa. TCP, CM-AI/PJ-CF31) 2012-03-23 17:30 ` Will Deacon 2012-03-25 12:08 ` Peter Waechtler 2012-03-25 13:09 ` Russell King - ARM Linux 2012-03-25 18:22 ` Peter Waechtler 2012-03-25 19:15 ` Russell King - ARM Linux 2012-03-25 20:22 ` Peter Waechtler 2012-03-25 21:55 ` Russell King - ARM Linux 2012-03-26 15:20 ` Will Deacon [not found] ` <274124B9C6907D4B8CE985903EAA19E91B2D5798D9@SI-MBX06.de.bosch.com> 2012-03-27 13:32 ` Will Deacon 2012-03-27 17:41 ` George G. Davis 2012-03-28 8:56 ` Will Deacon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).