Bringing up sequence for non-boot CPU fails

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* Bringing up sequence for non-boot CPU fails
@ 2014-02-18 11:44 Oleksandr Tyshchenko
  2014-02-18 12:14 ` Ian Campbell
  0 siblings, 1 reply; 17+ messages in thread
From: Oleksandr Tyshchenko @ 2014-02-18 11:44 UTC (permalink / raw)
  To: xen-devel

Hi, all.
I am working with DRA7 (OMAP5) platform.

I rarely see that Hypervisor hangs during bringing up CPU1.
It is not a deadlock, XEN console is working, but system is unusable(

We have next situation.
The CPU0 can not exit from busy loop in __cpu_up func (smpboot.c),
he is waiting CPU1 to become online but CPU1 stucks during bring up sequence.
The last print which I see is "- Turning on paging -".
It seems, it happens during enable MMU/Cache or immediately after it.
But anyway,
instruction "mov pc, r1" does not result to transition to the label "paging".

(XEN) Latest ChangeSet: Mon Feb 17 13:43:16 2014 +0200 git:c30f01f
(XEN) Processor: 412fc0f2: "ARM Limited", variant: 0x2, part 0xc0f, rev 0x2
(XEN) 32-bit Execution:
(XEN)   Processor Features: 00001131:00011011
(XEN)     Instruction Sets: AArch32 Thumb Thumb-2 ThumbEE Jazelle
(XEN)     Extensions: GenericTimer Security
(XEN)   Debug Features: 02010555
(XEN)   Auxiliary Features: 00000000
(XEN)   Memory Model Features: 10201105 20000000 01240000 02102211
(XEN)  ISA Features: 02101110 13112111 21232041 11112131 10011142 00000000
(XEN) Platform: TI DRA7
(XEN) /psci method must be smc, but is: "hvc"
(XEN) Set AuxCoreBoot1 to 00000000dec0004c (0020004c)
(XEN) Set AuxCoreBoot0 to 0x20
(XEN) Generic Timer IRQ: phys=30 hyp=26 virt=27
(XEN) Using generic timer at 6144 KHz
(XEN) GIC initialization:
(XEN)         gic_dist_addr=0000000048211000
(XEN)         gic_cpu_addr=0000000048212000
(XEN)         gic_hyp_addr=0000000048214000
(XEN)         gic_vcpu_addr=0000000048216000
(XEN)         gic_maintenance_irq=25
(XEN) GIC: 192 lines, 2 cpus, secure (IID 0000043b).
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Allocated console ring of 16 KiB.
(XEN) VFP implementer 0x41 architecture 4 part 0x30 variant 0xf rev 0x0
(XEN) Bringing up CPU1
- CPU 00000001 booting -
- NOT HYP, setting it ... -
- Xen starting in Hyp mode -
- Setting up control registers -
- Turning on paging -

Currently I am using XEN 4.4 RC3, but this issue is reproduced since XEN 4.3.
We do not have any patches related to bring up sequence on top of 4.4
RC3 except one:
So long as our bootloader does not set HYP mode we need to keep this
hack in Hypervisor.

diff --git a/xen/arch/arm/arm32/head.S b/xen/arch/arm/arm32/head.S
index 96230ac..6fa1e3d 100644
--- a/xen/arch/arm/arm32/head.S
+++ b/xen/arch/arm/arm32/head.S
@@ -174,6 +174,15 @@ common_start:
         teq   r0, #0x1a              /* Hyp Mode? */
         beq   hyp

+        PRINT("- NOT HYP, setting it ... -\r\n")
+        bl    enter_hyp_mode
+
+        /* Check that set HYP mode was succesfull */
+        mrs   r0, cpsr
+        and   r0, r0, #0x1f          /* Mode is in the low 5 bits of CPSR */
+        teq   r0, #0x1a              /* Hyp Mode? */
+        beq   hyp
+
         /* OK, we're boned. */
         PRINT("- Xen must be entered in NS Hyp mode -\r\n" \
               "- Please update the bootloader -\r\n")
@@ -547,6 +556,24 @@ putn:   mov   pc, lr

 #endif /* !EARLY_PRINTK */

+GLOBAL(enter_hyp_mode)
+enter_hyp_mode:
+        adr   r0, save
+        stmea r0, {r4-r13,lr}
+        ldr   r12, =0x102
+        adr   r0, hyp_return
+        dsb
+        isb
+        dmb
+        smc   #0
+hyp_return:
+        adr   r0, save
+        ldmfd r0, {r4-r13,pc}
+save:
+        .rept 11
+        .word 0
+        .endr
+
 /*
  * Local variables:
  * mode: ASM


Please, could anyone give me advice about this issue? And what do you
think about solution to exit from busy loop by timeout and restart
CPU1 bringing up sequence in this case?

Thank you.

-- 

Oleksandr Tyshchenko | Embedded Developer
GlobalLogic
www.globallogic.com

http://www.globallogic.com/email_disclaimer.txt

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 11:44 Bringing up sequence for non-boot CPU fails Oleksandr Tyshchenko
@ 2014-02-18 12:14 ` Ian Campbell
  2014-02-18 12:36   ` Andrii Anisov
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2014-02-18 12:14 UTC (permalink / raw)
  To: Oleksandr Tyshchenko; +Cc: xen-devel

On Tue, 2014-02-18 at 13:44 +0200, Oleksandr Tyshchenko wrote:

> +GLOBAL(enter_hyp_mode)
> +enter_hyp_mode:
> +        adr   r0, save
> +        stmea r0, {r4-r13,lr}
> +        ldr   r12, =0x102
> +        adr   r0, hyp_return
> +        dsb
> +        isb
> +        dmb
> +        smc   #0

Who/what implements this handler?

> +hyp_return:
> +        adr   r0, save
> +        ldmfd r0, {r4-r13,pc}
> +save:
> +        .rept 11
> +        .word 0
> +        .endr
> +
>  /*
>   * Local variables:
>   * mode: ASM
> 
> 
> Please, could anyone give me advice about this issue?

Do you have any hardware debugging tools which could give some insight?

Usually these things are down to either missing cache flushes or
barriers, but tracking them down has historically been a total pain.

>  And what do you
> think about solution to exit from busy loop by timeout and restart
> CPU1 bringing up sequence in this case?

A timeout isn't a bad idea, although I would not be inclined to try
again with the CPU in an indeterminate state.

Either we should carry on without it or we should panic (which is more
obvious than a hang). I think carrying on would be surprising (you'd
only get half the processing power you expected).

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 12:14 ` Ian Campbell
@ 2014-02-18 12:36   ` Andrii Anisov
  2014-02-18 12:53     ` Ian Campbell
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Anisov @ 2014-02-18 12:36 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Oleksandr Tyshchenko, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 3131 bytes --]

>
> > +GLOBAL(enter_hyp_mode)
> > +enter_hyp_mode:
> > +        adr   r0, save
> > +        stmea r0, {r4-r13,lr}
> > +        ldr   r12, =0x102
> > +        adr   r0, hyp_return
> > +        dsb
> > +        isb
> > +        dmb
> > +        smc   #0
>
> Who/what implements this handler?
>

Ian, this handler is implemented by ROM code, and this is the common OMAP
sequence to switch to HYP mode. On our side we decided to leave switch to
hyp in XEN for now.

Do you have any hardware debugging tools which could give some insight?
>

Yep, we have one (TI's Code Composer Studio with STM560v2 JTAG) but it has
no proper HYP mode debug support yet, TI says it will have in 6 months or
so :( So we the only thing we can do with it is stop CPU at some moment and
see some registers, no breakpoints or stepping.

What we discovered yet is that the last command executed by CPU1 before
hang is

mcr   CP32(r0, HSCTLR)       /* now paging is enabled */

After this PC contains 0x00000004, CPSR.M is b11010 what is HYP mode, not
abort.
It looks like we have broken MMU translation.

Usually these things are down to either missing cache flushes or barriers,
> but tracking them down has historically been a total pain.
>
I suspected missing flushes during CPU1 MMU tables preparation but that
code looks correct, I do not see any issues there.

Andrii Anisov | Software Engineer
GlobalLogic
Kyiv, 03038, Protasov Business Park, M.Grinchenka, 2/1
P +38.044.492.9695x3664  M +380505738852  S andriyanisov
www.globallogic.com
<http://www.globallogic.com/>
http://www.globallogic.com/email_disclaimer.txt


On Tue, Feb 18, 2014 at 2:14 PM, Ian Campbell <Ian.Campbell@citrix.com>wrote:

> On Tue, 2014-02-18 at 13:44 +0200, Oleksandr Tyshchenko wrote:
>
> > +GLOBAL(enter_hyp_mode)
> > +enter_hyp_mode:
> > +        adr   r0, save
> > +        stmea r0, {r4-r13,lr}
> > +        ldr   r12, =0x102
> > +        adr   r0, hyp_return
> > +        dsb
> > +        isb
> > +        dmb
> > +        smc   #0
>
> Who/what implements this handler?
>
> > +hyp_return:
> > +        adr   r0, save
> > +        ldmfd r0, {r4-r13,pc}
> > +save:
> > +        .rept 11
> > +        .word 0
> > +        .endr
> > +
> >  /*
> >   * Local variables:
> >   * mode: ASM
> >
> >
> > Please, could anyone give me advice about this issue?
>
> Do you have any hardware debugging tools which could give some insight?
>
> Usually these things are down to either missing cache flushes or
> barriers, but tracking them down has historically been a total pain.
>
> >  And what do you
> > think about solution to exit from busy loop by timeout and restart
> > CPU1 bringing up sequence in this case?
>
> A timeout isn't a bad idea, although I would not be inclined to try
> again with the CPU in an indeterminate state.
>
> Either we should carry on without it or we should panic (which is more
> obvious than a hang). I think carrying on would be surprising (you'd
> only get half the processing power you expected).
>
> Ian.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

[-- Attachment #1.2: Type: text/html, Size: 7216 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 12:36   ` Andrii Anisov
@ 2014-02-18 12:53     ` Ian Campbell
  2014-02-18 13:23       ` Andrii Anisov
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2014-02-18 12:53 UTC (permalink / raw)
  To: Andrii Anisov; +Cc: Oleksandr Tyshchenko, xen-devel@lists.xen.org

On Tue, 2014-02-18 at 14:36 +0200, Andrii Anisov wrote:
>         > +GLOBAL(enter_hyp_mode)
>         > +enter_hyp_mode:
>         > +        adr   r0, save
>         > +        stmea r0, {r4-r13,lr}
>         > +        ldr   r12, =0x102
>         > +        adr   r0, hyp_return
>         > +        dsb
>         > +        isb
>         > +        dmb
>         > +        smc   #0
>         
>         
>         Who/what implements this handler?
> 
> 
> Ian, this handler is implemented by ROM code, and this is the common
> OMAP sequence to switch to HYP mode. On our side we decided to leave
> switch to hyp in XEN for now.

OK, fair enough. I was wonder if maybe it left some cache lines dirty or
leaving caches enabled or something? It might be worth adding a full
cacheflush (i.e. the loop over set/way stuff).

>         Do you have any hardware debugging tools which could give some
>         insight?
>  
> Yep, we have one (TI's Code Composer Studio with STM560v2 JTAG) but it
> has no proper HYP mode debug support yet, TI says it will have in 6
> months or so :( So we the only thing we can do with it is stop CPU at
> some moment and see some registers, no breakpoints or stepping.

Hrm, I guess that might be sufficient to gain some insight.

> What we discovered yet is that the last command executed by CPU1
> before hang is 
>         mcr   CP32(r0, HSCTLR)       /* now paging is enabled */
> After this PC contains 0x00000004, CPSR.M is b11010 what is HYP mode,
> not abort.

I think this is correct for a trap taken from HYP mode -- you jump to
the corresponding vector but there is no actual mode change (since ABT
mode is PL1 and HYP mode is PL2 that would mean you dropped a privilege
level).

The patch at
http://lists.xen.org/archives/html/xen-devel/2013-09/msg00886.html might
help confirm this.

> It looks like we have broken MMU translation.

What do the fault status and fault address registers say?

Offset 0x4 is undefined instruction not prefetch abort which suggests
that there is at least some mapping present, but apparently not the
expected one so the instruction is invalid.

Is the debugger able to tell you what bytes it read instead of the real
instruction?

>         Usually these things are down to either missing cache flushes
>         or barriers, but tracking them down has historically been a
>         total pain.
> I suspected missing flushes during CPU1 MMU tables preparation but
> that code looks correct, I do not see any issues there.

Right, that's why I was wondering about firmware leaving dirty cache
lines around.

On the Cortex-A15 we (actually, firmware) need to set a bit in the ACTLR
to enable cache coherency etc -- I suppose it is worth checking that the
OMAP doesn't have anything similar but I expect that this would be
handled in the firmware already.

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 12:53     ` Ian Campbell
@ 2014-02-18 13:23       ` Andrii Anisov
  2014-02-18 13:34         ` Ian Campbell
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Anisov @ 2014-02-18 13:23 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Oleksandr Tyshchenko, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 3527 bytes --]

Ian,

First of all, unfortunately none of our team are so strong in ARM processor
low levels (yet).
We will check you suggestions with full cache flush and early trap debug
patch.

I've checked right now, the PC is 0xc (prefetch abort).

Andrii Anisov | Software Engineer
GlobalLogic
Kyiv, 03038, Protasov Business Park, M.Grinchenka, 2/1
P +38.044.492.9695x3664  M +380505738852  S andriyanisov
www.globallogic.com
<http://www.globallogic.com/>
http://www.globallogic.com/email_disclaimer.txt


On Tue, Feb 18, 2014 at 2:53 PM, Ian Campbell <Ian.Campbell@citrix.com>wrote:

> On Tue, 2014-02-18 at 14:36 +0200, Andrii Anisov wrote:
> >         > +GLOBAL(enter_hyp_mode)
> >         > +enter_hyp_mode:
> >         > +        adr   r0, save
> >         > +        stmea r0, {r4-r13,lr}
> >         > +        ldr   r12, =0x102
> >         > +        adr   r0, hyp_return
> >         > +        dsb
> >         > +        isb
> >         > +        dmb
> >         > +        smc   #0
> >
> >
> >         Who/what implements this handler?
> >
> >
> > Ian, this handler is implemented by ROM code, and this is the common
> > OMAP sequence to switch to HYP mode. On our side we decided to leave
> > switch to hyp in XEN for now.
>
> OK, fair enough. I was wonder if maybe it left some cache lines dirty or
> leaving caches enabled or something? It might be worth adding a full
> cacheflush (i.e. the loop over set/way stuff).
>
> >         Do you have any hardware debugging tools which could give some
> >         insight?
> >
> > Yep, we have one (TI's Code Composer Studio with STM560v2 JTAG) but it
> > has no proper HYP mode debug support yet, TI says it will have in 6
> > months or so :( So we the only thing we can do with it is stop CPU at
> > some moment and see some registers, no breakpoints or stepping.
>
> Hrm, I guess that might be sufficient to gain some insight.
>
> > What we discovered yet is that the last command executed by CPU1
> > before hang is
> >         mcr   CP32(r0, HSCTLR)       /* now paging is enabled */
> > After this PC contains 0x00000004, CPSR.M is b11010 what is HYP mode,
> > not abort.
>
> I think this is correct for a trap taken from HYP mode -- you jump to
> the corresponding vector but there is no actual mode change (since ABT
> mode is PL1 and HYP mode is PL2 that would mean you dropped a privilege
> level).
>
> The patch at
> http://lists.xen.org/archives/html/xen-devel/2013-09/msg00886.html might
> help confirm this.
>
> > It looks like we have broken MMU translation.
>
> What do the fault status and fault address registers say?
>
> Offset 0x4 is undefined instruction not prefetch abort which suggests
> that there is at least some mapping present, but apparently not the
> expected one so the instruction is invalid.
>
> Is the debugger able to tell you what bytes it read instead of the real
> instruction?
>
> >         Usually these things are down to either missing cache flushes
> >         or barriers, but tracking them down has historically been a
> >         total pain.
> > I suspected missing flushes during CPU1 MMU tables preparation but
> > that code looks correct, I do not see any issues there.
>
> Right, that's why I was wondering about firmware leaving dirty cache
> lines around.
>
> On the Cortex-A15 we (actually, firmware) need to set a bit in the ACTLR
> to enable cache coherency etc -- I suppose it is worth checking that the
> OMAP doesn't have anything similar but I expect that this would be
> handled in the firmware already.
>
> Ian.
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 6758 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 13:23       ` Andrii Anisov
@ 2014-02-18 13:34         ` Ian Campbell
  2014-02-18 13:37           ` Andrii Anisov
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2014-02-18 13:34 UTC (permalink / raw)
  To: Andrii Anisov; +Cc: Oleksandr Tyshchenko, xen-devel@lists.xen.org

On Tue, 2014-02-18 at 15:23 +0200, Andrii Anisov wrote:
> Ian,
> 
> 
> First of all, unfortunately none of our team are so strong in ARM
> processor low levels (yet).
> We will check you suggestions with full cache flush and early trap
> debug patch.

Thanks.

> I've checked right now, the PC is 0xc (prefetch abort).

Good, that makes more sense!

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 13:34         ` Ian Campbell
@ 2014-02-18 13:37           ` Andrii Anisov
  2014-02-18 15:33             ` Ian Campbell
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Anisov @ 2014-02-18 13:37 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Oleksandr Tyshchenko, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 477 bytes --]

>
> > I've checked right now, the PC is 0xc (prefetch abort).
>
> Good, that makes more sense!
>

Any additional suggestions, by any chance?
As I understand your early trap debug patch would not give a lot of help
here.

Andrii Anisov | Software Engineer
GlobalLogic
Kyiv, 03038, Protasov Business Park, M.Grinchenka, 2/1
P +38.044.492.9695x3664  M +380505738852  S andriyanisov
www.globallogic.com
<http://www.globallogic.com/>
http://www.globallogic.com/email_disclaimer.txt

[-- Attachment #1.2: Type: text/html, Size: 2922 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 13:37           ` Andrii Anisov
@ 2014-02-18 15:33             ` Ian Campbell
  2014-02-18 18:17               ` Oleksandr Tyshchenko
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2014-02-18 15:33 UTC (permalink / raw)
  To: Andrii Anisov; +Cc: Oleksandr Tyshchenko, xen-devel@lists.xen.org

On Tue, 2014-02-18 at 15:37 +0200, Andrii Anisov wrote:
>         > I've checked right now, the PC is 0xc (prefetch abort).
>         
>         
>         Good, that makes more sense!
> 
> 
> Any additional suggestions, by any chance?

I'm afraid nothing is coming to my mind. :-(

> As I understand your early trap debug patch would not give a lot of
> help here.

Not terribly -- but I suppose it would allow you to add logging of the
fault address|status registers etc, although if your debugger can handle
reading those then that is of little extra value.

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 15:33             ` Ian Campbell
@ 2014-02-18 18:17               ` Oleksandr Tyshchenko
  2014-02-18 18:33                 ` Julien Grall
  2014-02-19  9:04                 ` Ian Campbell
  0 siblings, 2 replies; 17+ messages in thread
From: Oleksandr Tyshchenko @ 2014-02-18 18:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrii Anisov, xen-devel@lists.xen.org

Ian,
I have checked your suggestion with full cache flush.
For this purposes I have used ARMV7 specific function from our U-Boot.
This function performs clean and invalidation of the entire data cache
at all levels.
I call it in __cpu_up() before arch_cpu_up() calling.

It seems, the issue doesn't reproduced.
Now we need to find missing cache flushes).

Thank you.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 18:17               ` Oleksandr Tyshchenko
@ 2014-02-18 18:33                 ` Julien Grall
  2014-02-18 18:40                   ` Oleksandr Tyshchenko
  2014-02-19  9:15                   ` Ian Campbell
  2014-02-19  9:04                 ` Ian Campbell
  1 sibling, 2 replies; 17+ messages in thread
From: Julien Grall @ 2014-02-18 18:33 UTC (permalink / raw)
  To: Oleksandr Tyshchenko; +Cc: Ian Campbell, Andrii Anisov, xen-devel@lists.xen.org

On 02/18/2014 06:17 PM, Oleksandr Tyshchenko wrote:
> Ian,

Hello Oleksandr,

> I have checked your suggestion with full cache flush.
> For this purposes I have used ARMV7 specific function from our U-Boot.
> This function performs clean and invalidation of the entire data cache
> at all levels.

Did you try to only clean the cache? When page table for the secondary
CPU is created Xen only clean the cache for the specific range.
I suspect it's not enough and we need to invalidate.

It should be easy to try with this small patch:

diff --git a/xen/include/asm-arm/page.h b/xen/include/asm-arm/page.h
index e00be9e..5a8aba2 100644
--- a/xen/include/asm-arm/page.h
+++ b/xen/include/asm-arm/page.h
@@ -234,7 +234,7 @@ static inline void clean_xen_dcache_va_range(void
*p, unsigned long size)
     void *end;
     dsb();           /* So the CPU issues all writes to the range */
     for ( end = p + size; p < end; p += cacheline_bytes )
-        asm volatile (__clean_xen_dcache_one(0) : : "r" (p));
+        asm volatile (__clean_and_invalidate_xen_dcache_one(0) : : "r"
(p));
     dsb();           /* So we know the flushes happen before continuing */
 }

Regards,

-- 
Julien Grall

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 18:33                 ` Julien Grall
@ 2014-02-18 18:40                   ` Oleksandr Tyshchenko
  2014-02-19  9:15                   ` Ian Campbell
  1 sibling, 0 replies; 17+ messages in thread
From: Oleksandr Tyshchenko @ 2014-02-18 18:40 UTC (permalink / raw)
  To: Julien Grall; +Cc: Ian Campbell, Andrii Anisov, xen-devel@lists.xen.org

On Tue, Feb 18, 2014 at 8:33 PM, Julien Grall <julien.grall@citrix.com> wrote:
> On 02/18/2014 06:17 PM, Oleksandr Tyshchenko wrote:
>> Ian,
>
> Hello Oleksandr,
>
>> I have checked your suggestion with full cache flush.
>> For this purposes I have used ARMV7 specific function from our U-Boot.
>> This function performs clean and invalidation of the entire data cache
>> at all levels.
>
> Did you try to only clean the cache? When page table for the secondary
> CPU is created Xen only clean the cache for the specific range.
> I suspect it's not enough and we need to invalidate.
I tried to invalidate too. Because this function performs a clean &
invalidation.
>
> It should be easy to try with this small patch:
>
> diff --git a/xen/include/asm-arm/page.h b/xen/include/asm-arm/page.h
> index e00be9e..5a8aba2 100644
> --- a/xen/include/asm-arm/page.h
> +++ b/xen/include/asm-arm/page.h
> @@ -234,7 +234,7 @@ static inline void clean_xen_dcache_va_range(void
> *p, unsigned long size)
>      void *end;
>      dsb();           /* So the CPU issues all writes to the range */
>      for ( end = p + size; p < end; p += cacheline_bytes )
> -        asm volatile (__clean_xen_dcache_one(0) : : "r" (p));
> +        asm volatile (__clean_and_invalidate_xen_dcache_one(0) : : "r"
> (p));
>      dsb();           /* So we know the flushes happen before continuing */
>  }
I will try.
>
> Regards,
>
> --
> Julien Grall

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 18:17               ` Oleksandr Tyshchenko
  2014-02-18 18:33                 ` Julien Grall
@ 2014-02-19  9:04                 ` Ian Campbell
  1 sibling, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2014-02-19  9:04 UTC (permalink / raw)
  To: Oleksandr Tyshchenko; +Cc: Andrii Anisov, xen-devel@lists.xen.org

On Tue, 2014-02-18 at 20:17 +0200, Oleksandr Tyshchenko wrote:
> Ian,
> I have checked your suggestion with full cache flush.
> For this purposes I have used ARMV7 specific function from our U-Boot.
> This function performs clean and invalidation of the entire data cache
> at all levels.
> I call it in __cpu_up() before arch_cpu_up() calling.

> It seems, the issue doesn't reproduced.

OK good, thanks for doing the experiment.

> Now we need to find missing cache flushes).

One approach would be to move the flush earlier and earlier, essentially
bisecting the boot sequence.

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-18 18:33                 ` Julien Grall
  2014-02-18 18:40                   ` Oleksandr Tyshchenko
@ 2014-02-19  9:15                   ` Ian Campbell
  2014-02-19 12:35                     ` Andrii Anisov
  1 sibling, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2014-02-19  9:15 UTC (permalink / raw)
  To: Julien Grall; +Cc: Oleksandr Tyshchenko, Andrii Anisov, xen-devel@lists.xen.org

On Tue, 2014-02-18 at 18:33 +0000, Julien Grall wrote:
> On 02/18/2014 06:17 PM, Oleksandr Tyshchenko wrote:
> > Ian,
> 
> Hello Oleksandr,
> 
> > I have checked your suggestion with full cache flush.
> > For this purposes I have used ARMV7 specific function from our U-Boot.
> > This function performs clean and invalidation of the entire data cache
> > at all levels.
> 
> Did you try to only clean the cache? When page table for the secondary
> CPU is created Xen only clean the cache for the specific range.
> I suspect it's not enough and we need to invalidate.

I think the error is occurring when paging is first enabled with HTTBR
set to boot_pgtable, which the secondary CPU is setting up itself during
head.S. The switch to the secondaries own PTs, which are setup by the
boot processor for it, happens a bit later. Other than the boot_pgtable
setup that the CPUs do themselves I don't think anything is touching the
boot pts with caches turned on.

So long as the data is not modified between the clean and enabling
MMU/caches the invalidate shouldn't really matter -- any data which
magically appears when you enable the caches should be in sync with the
underlying RAM I think.

BTW, it would be worth just confirming that caches are not enabled
either upon entry to Xen or after the magic smc #0.

> It should be easy to try with this small patch:
> 
> diff --git a/xen/include/asm-arm/page.h b/xen/include/asm-arm/page.h
> index e00be9e..5a8aba2 100644
> --- a/xen/include/asm-arm/page.h
> +++ b/xen/include/asm-arm/page.h
> @@ -234,7 +234,7 @@ static inline void clean_xen_dcache_va_range(void
> *p, unsigned long size)
>      void *end;
>      dsb();           /* So the CPU issues all writes to the range */
>      for ( end = p + size; p < end; p += cacheline_bytes )
> -        asm volatile (__clean_xen_dcache_one(0) : : "r" (p));
> +        asm volatile (__clean_and_invalidate_xen_dcache_one(0) : : "r"
> (p));
>      dsb();           /* So we know the flushes happen before continuing */
>  }
> 
> Regards,
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-19  9:15                   ` Ian Campbell
@ 2014-02-19 12:35                     ` Andrii Anisov
  2014-03-01 15:58                       ` Julien Grall
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Anisov @ 2014-02-19 12:35 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Julien Grall, Oleksandr Tyshchenko, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 458 bytes --]

Hello Ian,

The latest update is: Oleksandr tried solution suggested by Julien and the
issue seems to be not reproducible.

p.s. He is on vacations next 1,5 week, and will be back to this issue after.

Andrii Anisov | Software Engineer
GlobalLogic
Kyiv, 03038, Protasov Business Park, M.Grinchenka, 2/1
P +38.044.492.9695x3664  M +380505738852  S andriyanisov
www.globallogic.com
<http://www.globallogic.com/>
http://www.globallogic.com/email_disclaimer.txt

[-- Attachment #1.2: Type: text/html, Size: 2783 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-02-19 12:35                     ` Andrii Anisov
@ 2014-03-01 15:58                       ` Julien Grall
  2014-03-04 16:50                         ` Oleksandr Tyshchenko
  0 siblings, 1 reply; 17+ messages in thread
From: Julien Grall @ 2014-03-01 15:58 UTC (permalink / raw)
  To: Andrii Anisov, Oleksandr Tyshchenko
  Cc: Julien Grall, Ian Campbell, xen-devel@lists.xen.org

Hello Andrii and Oleksandr,

Sorry for the late answer.

On 19/02/14 20:35, Andrii Anisov wrote:
> The latest update is: Oleksandr tried solution suggested by Julien and
> the issue seems to be not reproducible.

Thanks for the update. I talked with Ian about this issue, the problem 
may come from setup_pagetables (arch/arm/mm.c).
The function is zeros boot_pgtable, boot_first and boot_second, then 
just clean the cache for theses ranges.
But ... when the secondary CPU sets up theses variables, the cache is 
disabled. When the cache is enabled (just before we enable paging), 
cache may contain data that was not reached the memory, therefore it 
will shadow the real data.

Can you try these 2 small tests (separately):
   - Flush all the cache before boot_pgtable is zeroed (arch/arm/mm.c:486)
   - Flush all the cache after clean_xen_dcache(boot_second) 
(arch/arm/mm.c:493)

If the former test is failing and not the latter, then we have found the 
issue :).

I think the 3 clean_xen_dcache should be replaced by a clean and 
invalidate dcache.

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-03-01 15:58                       ` Julien Grall
@ 2014-03-04 16:50                         ` Oleksandr Tyshchenko
  2014-03-05  1:45                           ` Ian Campbell
  0 siblings, 1 reply; 17+ messages in thread
From: Oleksandr Tyshchenko @ 2014-03-04 16:50 UTC (permalink / raw)
  To: Julien Grall
  Cc: Julien Grall, Ian Campbell, Andrii Anisov,
	xen-devel@lists.xen.org

Hello Julien and Ian.

Sorry for the late response.

> Can you try these 2 small tests (separately):
>   - Flush all the cache before boot_pgtable is zeroed (arch/arm/mm.c:486)
>   - Flush all the cache after clean_xen_dcache(boot_second)
> (arch/arm/mm.c:493)
>
> If the former test is failing and not the latter, then we have found the
> issue :).

First test failed.
Second test passed.
it seems, you have found the issue :).

>
> I think the 3 clean_xen_dcache should be replaced by a clean and invalidate
> dcache.

I have checked. It works!

Thank you very much.

How it should be properly made for mainline? Shall it be a separate
macro clean_and_invalidate_xen_dcache()
(and function clean_and_invalidate_xen_dcache_va_range()) which will
be called three times
(only for boot pagetables) from setup_pagetables()?

-- 

Oleksandr Tyshchenko | Embedded Developer
GlobalLogic
www.globallogic.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bringing up sequence for non-boot CPU fails
  2014-03-04 16:50                         ` Oleksandr Tyshchenko
@ 2014-03-05  1:45                           ` Ian Campbell
  0 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2014-03-05  1:45 UTC (permalink / raw)
  To: Oleksandr Tyshchenko
  Cc: Julien Grall, Julien Grall, Andrii Anisov,
	xen-devel@lists.xen.org

On Tue, 2014-03-04 at 18:50 +0200, Oleksandr Tyshchenko wrote:
> >
> > I think the 3 clean_xen_dcache should be replaced by a clean and invalidate
> > dcache.
> 
> I have checked. It works!

Excellent. Thanks for testing.

> 
> Thank you very much.
> 
> How it should be properly made for mainline? Shall it be a separate
> macro clean_and_invalidate_xen_dcache()
> (and function clean_and_invalidate_xen_dcache_va_range()) which will
> be called three times
> (only for boot pagetables) from setup_pagetables()?

I think we should duplicate clean_xen_dcache as
clean_and_invalidate_xen_dcache (with the obvious difference) and use it
in these three places.

If your cpp-fu is up to it you could perhaps even manage to turn
clean_xen_dcache into a common helper __maintain_xen_dcache used by both
clean_xen_dcache and clean_and_invalidate_xen_cache, passing either
clean or clean_and_invalidate as an argument and doing some cpp token
pasting to make the final call.

Ian.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-03-05  1:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-18 11:44 Bringing up sequence for non-boot CPU fails Oleksandr Tyshchenko
2014-02-18 12:14 ` Ian Campbell
2014-02-18 12:36   ` Andrii Anisov
2014-02-18 12:53     ` Ian Campbell
2014-02-18 13:23       ` Andrii Anisov
2014-02-18 13:34         ` Ian Campbell
2014-02-18 13:37           ` Andrii Anisov
2014-02-18 15:33             ` Ian Campbell
2014-02-18 18:17               ` Oleksandr Tyshchenko
2014-02-18 18:33                 ` Julien Grall
2014-02-18 18:40                   ` Oleksandr Tyshchenko
2014-02-19  9:15                   ` Ian Campbell
2014-02-19 12:35                     ` Andrii Anisov
2014-03-01 15:58                       ` Julien Grall
2014-03-04 16:50                         ` Oleksandr Tyshchenko
2014-03-05  1:45                           ` Ian Campbell
2014-02-19  9:04                 ` Ian Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).