* Re: Unstable Kernel behavior on an ARM based board
[not found] ` <CA+_ZnZTCtghS+oUdXGW+h6SVD6nMPfsyDAvOwEugrhx6NJ3yFg@mail.gmail.com>
@ 2019-03-04 14:25 ` Thierry Reding
2019-03-04 15:51 ` Embedded Engineer
2019-03-05 10:01 ` Embedded Engineer
0 siblings, 2 replies; 22+ messages in thread
From: Thierry Reding @ 2019-03-04 14:25 UTC (permalink / raw)
To: Embedded Engineer
Cc: linux-tegra, Vladimir Murzin, linux-arm-kernel, Jon Hunter
[-- Attachment #1.1: Type: text/plain, Size: 2327 bytes --]
On Mon, Mar 04, 2019 at 05:25:28PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 3:26 PM Vladimir Murzin <vladimir.murzin@arm.com> wrote:
> >
> > You can try in-kernel memtest:
> >
> > - CONFIG_MEMTEST=y
> > - pass memtest in kernel's command line
> >
>
> Thanks Vladimir, I tried running mtest as suggested by Clemens in
> u-boot and memtest in kernel as suggested by you. Both tests didn't
> show any errors, however the board sometime hangs at "Starting kernel
> ...". Following logs were obtained when it booted but ended in a
> crash:
>
> https://pastebin.com/sZZjUcbh
Other than the memory corruption issue this looks like a fairly regular
boot. It's not clear whether the crash of your /sbin/init is related to
any memory issues. The earlier boot log that you had posted showed that
it was failing to mount the root filesystem and dropped you to a
maintenance shell, so that could be an indication that something isn't
right about the root filesystem. Or it could indicate that something is
wrong when loading files from the root filesystem.
The earlier log showed EMEM address decode errors, which are odd because
the addresses clearly lie in regions that should be system memory. EMEM
address decode usually only happens if the memory controller thinks you
are trying to access memory outside of system memory.
The good news is that I think you're pretty close. The memory corruption
is somewhat worrying, but at the same time it's unlikely that you'd get
as far as you do if your memory timings are completely off. However, I
think we need to gather more information to narrow down what's going
wrong.
All of the memory related configuration is part of a file called the
BCT. I think if you could provide that it would be very useful to have.
Also, it looks like you're using the Jetson TK1 device tree to boot, so
can I assume you haven't modified it at all?
Other bits of information that would be good to know are how you are
generating the BCT and your boot images, what exactly you do to flash
the board and which release of L4T you use.
Perhaps also try to run a recent linux-next just to exclude any issues
that may have been part of the 4.8.0-rc7 that you tested.
Also adding Jon and linux-tegra for a broader audience.
Thanks,
Thierry
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-04 14:25 ` Unstable Kernel behavior on an ARM based board Thierry Reding
@ 2019-03-04 15:51 ` Embedded Engineer
2019-03-05 10:01 ` Embedded Engineer
1 sibling, 0 replies; 22+ messages in thread
From: Embedded Engineer @ 2019-03-04 15:51 UTC (permalink / raw)
To: Thierry Reding; +Cc: linux-tegra, Vladimir Murzin, linux-arm-kernel, Jon Hunter
Thanks a lot Thierry for considering this thread.
On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> Or it could indicate that something is
> wrong when loading files from the root filesystem.
When I used the downstream kernel and L4T filesystem, there was no
problem regarding filesystem mounting.
> All of the memory related configuration is part of a file called the
> BCT. I think if you could provide that it would be very useful to have.
Please find the link to our BCT:
https://drive.google.com/open?id=1Az4nDIImCm14cnDSfHeBPlQYlYijGMrS
> Also, it looks like you're using the Jetson TK1 device tree to boot, so
> can I assume you haven't modified it at all?
Yes, I modified the downstream kernel's dtb by generating new pinmux
using Nvidia's dts generation tool but for upstream kernel I haven't
modified any dts.
> Other bits of information that would be good to know are how you are
> generating the BCT and your boot images, what exactly you do to flash
> the board and which release of L4T you use.
We run Shmoo memory characterization tool and get cfg file from that.
Then we convert that cfg to BCT (using mkbct command I guess).
We were never able to flash the board using nvflash/flash.sh utility. So
1. We build and flash u-boot & BCT using tegra-uboot-flasher.
2. We build kernel using make separately using sources available on
Nvidia download center.
3. We use apply_binaries.sh to copy tegra related files to sample file
system downloaded from Nvidia download center.
4. We mount the emmc/SD-card using u-boot's ums command on our Linux
host, and copy the whole filesystem, kernel and DTB to it.
We are using R21.7
> Perhaps also try to run a recent linux-next just to exclude any issues
> that may have been part of the 4.8.0-rc7 that you tested.
Ok I will build kernel using linux-next and update here.
Thanks again.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-04 14:25 ` Unstable Kernel behavior on an ARM based board Thierry Reding
2019-03-04 15:51 ` Embedded Engineer
@ 2019-03-05 10:01 ` Embedded Engineer
2019-03-05 10:07 ` Russell King - ARM Linux admin
2019-03-05 10:32 ` Thierry Reding
1 sibling, 2 replies; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 10:01 UTC (permalink / raw)
To: Thierry Reding
Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel,
Jon Hunter
On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> Perhaps also try to run a recent linux-next just to exclude any issues
> that may have been part of the 4.8.0-rc7 that you tested.
Thierry I have disabled cache as per Andrew's suggestion by calling
dcache_disable() and icache_disable() just before kernel_entry() in
u-boot source. I have also build the linux-next kernel and tested by
booting from microSD card but it is not going upto login console and
hangs midway. Please have a look at kernel logs in below link:
https://pastebin.com/ByuaLxTt
P.S: If I replace zImage and DTB of downstream same microSD card, it
successfully takes me to login console (although it has hanging issues
as I mentioned in previous posts)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 10:01 ` Embedded Engineer
@ 2019-03-05 10:07 ` Russell King - ARM Linux admin
2019-03-05 10:29 ` Embedded Engineer
2019-03-05 10:32 ` Thierry Reding
1 sibling, 1 reply; 22+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 10:07 UTC (permalink / raw)
To: Embedded Engineer
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 05, 2019 at 03:01:35PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> > Perhaps also try to run a recent linux-next just to exclude any issues
> > that may have been part of the 4.8.0-rc7 that you tested.
>
> Thierry I have disabled cache as per Andrew's suggestion by calling
> dcache_disable() and icache_disable() just before kernel_entry() in
> u-boot source. I have also build the linux-next kernel and tested by
> booting from microSD card but it is not going upto login console and
> hangs midway. Please have a look at kernel logs in below link:
>
> https://pastebin.com/ByuaLxTt
Please apply this patch so we can see the (ptrval) values. Thanks.
8<===
From: Russell King <rmk+kernel@armlinux.org.uk>
Subject: [PATCH] lib: make vsprintf print pointers without munging
Printing pointers is useful for debugging, disable this so I can debug
the kernel.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
lib/vsprintf.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 37a54a6dd594..c2ae4075c786 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -687,9 +687,9 @@ early_initcall(initialize_ptr_random);
static char *ptr_to_id(char *buf, char *end, const void *ptr,
struct printf_spec spec)
{
- const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)";
unsigned long hashval;
+#if 0
/* When debugging early boot use non-cryptographically secure hash. */
if (unlikely(debug_boot_weak_hash)) {
hashval = hash_long((unsigned long)ptr, 32);
@@ -697,6 +697,7 @@ static char *ptr_to_id(char *buf, char *end, const void *ptr,
}
if (static_branch_unlikely(¬_filled_random_ptr_key)) {
+ const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)";
spec.field_width = 2 * sizeof(ptr);
/* string length must be less than default_width */
return string(buf, end, str, spec);
@@ -712,6 +713,9 @@ static char *ptr_to_id(char *buf, char *end, const void *ptr,
#else
hashval = (unsigned long)siphash_1u32((u32)ptr, &ptr_key);
#endif
+#else
+ hashval = (unsigned long)ptr;
+#endif
return pointer_string(buf, end, (const void *)hashval, spec);
}
--
2.7.4
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 10:07 ` Russell King - ARM Linux admin
@ 2019-03-05 10:29 ` Embedded Engineer
2019-03-05 11:20 ` Thierry Reding
2019-03-05 11:22 ` Russell King - ARM Linux admin
0 siblings, 2 replies; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 10:29 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Please apply this patch so we can see the (ptrval) values. Thanks.
Please find below logs after applying patch:
https://pastebin.com/6TaBxPX5
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 10:01 ` Embedded Engineer
2019-03-05 10:07 ` Russell King - ARM Linux admin
@ 2019-03-05 10:32 ` Thierry Reding
2019-03-05 11:05 ` Embedded Engineer
1 sibling, 1 reply; 22+ messages in thread
From: Thierry Reding @ 2019-03-05 10:32 UTC (permalink / raw)
To: Embedded Engineer
Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel,
Jon Hunter
[-- Attachment #1.1: Type: text/plain, Size: 1795 bytes --]
On Tue, Mar 05, 2019 at 03:01:35PM +0500, Embedded Engineer wrote:
> On Mon, Mar 4, 2019 at 7:25 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> > Perhaps also try to run a recent linux-next just to exclude any issues
> > that may have been part of the 4.8.0-rc7 that you tested.
>
> Thierry I have disabled cache as per Andrew's suggestion by calling
> dcache_disable() and icache_disable() just before kernel_entry() in
> u-boot source. I have also build the linux-next kernel and tested by
> booting from microSD card but it is not going upto login console and
> hangs midway. Please have a look at kernel logs in below link:
>
> https://pastebin.com/ByuaLxTt
Okay, looks fairly normal so far, except for the corrupted data. That's
definitely not normal and I think we need to fix that first, otherwise
we can't really be certain what's going on later.
One thing besides memory timings in BCT that comes to mind that could be
causing memory corruption are power supplies. Are you sure they're all
correctly configured and enabled as required? It might be worth looking
at all of them and marking them "regulator-always-on" just to make sure
an essential one isn't disabled inadvertently during boot. The
corruption happens long before unused regulators are disabled, so that
doesn't sound like it would be very relevant here. But perhaps best to
check it anyway, just in case.
> P.S: If I replace zImage and DTB of downstream same microSD card, it
> successfully takes me to login console (although it has hanging issues
> as I mentioned in previous posts)
Does the upstream kernel and DTB boot reliably, even if it doesn't get
you to a login prompt? Or does it also behave erratically like the
downstream kernel and DTB that you have?
Thierry
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 10:32 ` Thierry Reding
@ 2019-03-05 11:05 ` Embedded Engineer
2019-03-05 11:36 ` Thierry Reding
0 siblings, 1 reply; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 11:05 UTC (permalink / raw)
To: Thierry Reding
Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel,
Jon Hunter
On Tue, Mar 5, 2019 at 3:32 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> One thing besides memory timings in BCT that comes to mind that could be
> causing memory corruption are power supplies. Are you sure they're all
> correctly configured and enabled as required?
This part is 100% same as the Jetson TK1 on hardware end. And in
device tree, the node 'vdd_1v35_lp0: sd2' has already
'regulator-always-on' property. We also tried once by using
oscilloscope to check if the power drops/fluctuates during operation
but noticed that DDR chips were getting stable power.
> Does the upstream kernel and DTB boot reliably, even if it doesn't get
> you to a login prompt? Or does it also behave erratically like the
> downstream kernel and DTB that you have?
2 out of 10 times it behaved erratically, i.e. one time it stuck at
'Starting kernel ...' and the other time it stuck after following
prints:
Starting kernel ...
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 5.0.0-rc8-next-20190304-dirty
(teresol@ubuntu) (gcc version 6.1.1 20160711 (Linaro GCC 6.1-2016.08))
#2 SMP PREEMPT Tue Mar 5 02:15:14 PST 2019
[ 0.000000] CPU: ARMv7 Processor [413fc0f3] revision 3 (ARMv7), cr=10c5387d
[ 0.000000] CPU: div instructions available: patching division code
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[ 0.000000] OF: fdt: Machine model: NVIDIA Tegra124 Jetson TK1
[ 0.000000] earlycon: uart0 at MMIO 0x70006300 (options '115200n8')
[ 0.000000] printk: bootconsole [uart0] enabled
[ 0.000000] Memory policy: Data cache writealloc
[ 0.000000] cma: Reserved 64 MiB at 0xac000000
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 10:29 ` Embedded Engineer
@ 2019-03-05 11:20 ` Thierry Reding
2019-03-05 11:22 ` Russell King - ARM Linux admin
1 sibling, 0 replies; 22+ messages in thread
From: Thierry Reding @ 2019-03-05 11:20 UTC (permalink / raw)
To: Embedded Engineer
Cc: Andrew Lunn, Vladimir Murzin, Russell King - ARM Linux admin,
Jon Hunter, linux-tegra, linux-arm-kernel
[-- Attachment #1.1: Type: text/plain, Size: 3924 bytes --]
On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Please apply this patch so we can see the (ptrval) values. Thanks.
>
> Please find below logs after applying patch:
>
> https://pastebin.com/6TaBxPX5
Hm... so looks like what you're getting here is the error spew from the
DMA pool debug code in mm/dmax_pool.c. The way I understand it is that
that will initialize the memory for each page allocated from the pool
with the POOL_POISON_FREED (0xa7) (see pool_alloc_page()) and then upon
adding the page to the pool list, it'll store the offset to page->offset
field and check the contents of the page.
The contents of the page then don't match the expected poison. The dump
of the corrupted memory is somewhat confusing because the values that
don't match the poison are actually expected, at least partially. From
my reading of the DMA pool code, the first four bytes store the offset
of the DMA block into the physical memory page. However, given the size
of the hexdump, it looks like the pool was allocated with a block size
of 64 bytes, which matches the code in drivers/usb/chipidea/udc.c that
allocates the "ci_hw_qh" pool.
What's strange here, though, is that the offset that's stored to the
first four bytes of a block seems to actually be stored twice per block.
The first offset seems to be correct, since it's apparently used to find
the offset of the next block to allocate. If you look at the first
corrupted hexdump:
[ 1.327553] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
[ 1.335058] 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
[ 1.343077] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
[ 1.351095] 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
[ 1.359113] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
This is the entry for the block at offset 0x00000080 and the offset for
the next block is 0x000000c0, which is exactly 64 bytes after the
current block. However, if you then look at the second offset that's
stored at offset 0x00000020 in the block, it's 0x00000080, which does
match the offset of the current block, but I think that may just be
coincidence. The same coincidence happens for the second corrupted
block:
[ 1.367210] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
[ 1.374709] 00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
[ 1.382727] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
[ 1.390744] 00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
[ 1.398760] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
But not for the third:
[ 1.406965] tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
[ 1.414466] 00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
[ 1.422483] 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
[ 1.430502] 00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
[ 1.438519] 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
The fact that we see the offset stored at offset 0x20 in each block
makes me think there's perhaps some sort of aliasing happening here. But
I'm not sure how the system would even boot this far if aliasing was
really the problem. Things should be falling apart much sooner if that's
really what's going on here.
However, this sort of aliasing is not something that your typical memory
test will catch, so it could explain why they aren't reporting any
errors.
Thierry
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 10:29 ` Embedded Engineer
2019-03-05 11:20 ` Thierry Reding
@ 2019-03-05 11:22 ` Russell King - ARM Linux admin
2019-03-05 11:57 ` Thierry Reding
1 sibling, 1 reply; 22+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 11:22 UTC (permalink / raw)
To: Embedded Engineer
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Please apply this patch so we can see the (ptrval) values. Thanks.
>
> Please find below logs after applying patch:
>
> https://pastebin.com/6TaBxPX5
So we have a pattern here:
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056200 (corrupted)
00000000: 40 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000020: 40 05 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
and so it goes on.
The first four bytes are the offset to the next free block of memory in
this page, so can be ignored. The remainder of the bytes should all be
0xa7, but every word at offset 32 into these is corrupted with what
looks to be a similar offset.
We dump 0x40 bytes, which, reading the code makes the pool size 0x40
bytes in size. Tabulating the object offset, the next offset, and
the corruption at offset 32. Corruption1 is from your latest log,
corruption2 is derived from your previous log using the next pointer
to tie up between the two:
object offset next corruption1 corruption2
0x0080 0x00c0 0x00000080 0x00000080
0x0140 0x0180 0x00000140 0x00000100
0x01c0 0x0200 0x00000340 0x000001c0
0x0200 0x0240 0x00000540 0x000001c0
0x0280 0x02c0 0x00000340 0x00000300
0x0340 0x0380 0x00000540 0x00000140
0x03c0 0x0400 0x00000540 0x00000300
0x0400 0x0440 0x000003c0 0x00000140
0x0480 0x04c0 0x00000540 0x000003c0
0x0540 0x0580 0x00000480 0x00000540
0x05c0 0x0600 0x000005c0 0x000005c0
0x0600 0x0640 0x00000500 0x000005c0
0x0680 0x06c0 0x00000740 0x00000680
?????? 0x0780 0x00000740
0x07c0 0x0800 0x000007c0 0x00000700
The corruption looks very much like offset values, except they do not
seem to follow any rhyme or reason. They also appear to be different
on each boot.
Given that the sequence here when a pool allocation occurs is:
1. allocate DMA coherent page
2. memset entire page with 0xa7
3. write next offsets
4. initialise 'offset' to zero (offset of first free object)
5. add page to pools list of pages
6. allocate first object, updating offset to the next free offset read
from the first word of the object.
then when the next allocation request comes along, we allocate the
next object in the same way as step 6. At the point of allocating the
third object, we find that there is corruption in the third object at
0x20 bytes into it - or 0xa0 bytes into the page.
Now, what does the driver that's allocating these do with them? That
is done via init_eps() in drivers/usb/chipidea/udc.c, which doesn't do
anything with the allocated memory. This is the only place that the
driver allocates from this DMA pool, which is done in a loop, so we
know that the objects allocated from this pool will be in relatively
quick succession.
So this does not make sense.
I really doubt that there is anything wrong with the kernel - this USB
driver is used on other SoCs (such as iMX6) and does not exhibit this
problem - it also works on the Tegra TK1 platform as well.
You are definitely seeing memory corruption here - but given what the
above looks like, I'd put forward another possible scenario - maybe
u-boot or something else is leaving a USB controller or some other DMA
agent active, which is writing over memory while the kernel is trying
to boot, resulting in memory corruption.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 11:05 ` Embedded Engineer
@ 2019-03-05 11:36 ` Thierry Reding
0 siblings, 0 replies; 22+ messages in thread
From: Thierry Reding @ 2019-03-05 11:36 UTC (permalink / raw)
To: Embedded Engineer
Cc: linux-tegra, Andrew Lunn, Vladimir Murzin, linux-arm-kernel,
Jon Hunter
[-- Attachment #1.1: Type: text/plain, Size: 2213 bytes --]
On Tue, Mar 05, 2019 at 04:05:14PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 3:32 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> >
> > One thing besides memory timings in BCT that comes to mind that could be
> > causing memory corruption are power supplies. Are you sure they're all
> > correctly configured and enabled as required?
>
> This part is 100% same as the Jetson TK1 on hardware end. And in
> device tree, the node 'vdd_1v35_lp0: sd2' has already
> 'regulator-always-on' property. We also tried once by using
> oscilloscope to check if the power drops/fluctuates during operation
> but noticed that DDR chips were getting stable power.
Okay, sounds like that's not relevant here, then.
> > Does the upstream kernel and DTB boot reliably, even if it doesn't get
> > you to a login prompt? Or does it also behave erratically like the
> > downstream kernel and DTB that you have?
>
> 2 out of 10 times it behaved erratically, i.e. one time it stuck at
> 'Starting kernel ...' and the other time it stuck after following
> prints:
>
> Starting kernel ...
>
> [ 0.000000] Booting Linux on physical CPU 0x0
> [ 0.000000] Linux version 5.0.0-rc8-next-20190304-dirty
> (teresol@ubuntu) (gcc version 6.1.1 20160711 (Linaro GCC 6.1-2016.08))
> #2 SMP PREEMPT Tue Mar 5 02:15:14 PST 2019
> [ 0.000000] CPU: ARMv7 Processor [413fc0f3] revision 3 (ARMv7), cr=10c5387d
> [ 0.000000] CPU: div instructions available: patching division code
> [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
> [ 0.000000] OF: fdt: Machine model: NVIDIA Tegra124 Jetson TK1
> [ 0.000000] earlycon: uart0 at MMIO 0x70006300 (options '115200n8')
> [ 0.000000] printk: bootconsole [uart0] enabled
> [ 0.000000] Memory policy: Data cache writealloc
> [ 0.000000] cma: Reserved 64 MiB at 0xac000000
Okay, this could corroborate the aliasing hypothesis. If aliasing is
really the problem, it would most likely indicate an issue in the BCT
that happened as part of shmooing. I'm not very familiar with the tests
run as part of the Shmoo suite, but I would've hoped that it contains
tests for aliasing.
Thierry
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 11:22 ` Russell King - ARM Linux admin
@ 2019-03-05 11:57 ` Thierry Reding
2019-03-05 13:16 ` Embedded Engineer
0 siblings, 1 reply; 22+ messages in thread
From: Thierry Reding @ 2019-03-05 11:57 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: Embedded Engineer, Vladimir Murzin, Andrew Lunn, Jon Hunter,
linux-tegra, linux-arm-kernel
[-- Attachment #1.1: Type: text/plain, Size: 7119 bytes --]
On Tue, Mar 05, 2019 at 11:22:26AM +0000, Russell King - ARM Linux admin wrote:
> On Tue, Mar 05, 2019 at 03:29:26PM +0500, Embedded Engineer wrote:
> > On Tue, Mar 5, 2019 at 3:07 PM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > >
> > > Please apply this patch so we can see the (ptrval) values. Thanks.
> >
> > Please find below logs after applying patch:
> >
> > https://pastebin.com/6TaBxPX5
>
> So we have a pattern here:
>
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056080 (corrupted)
> 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056140 (corrupted)
> 00000000: 80 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000020: 40 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec0561c0 (corrupted)
> 00000000: 00 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000020: 40 03 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> tegra-udc 7d000000.usb: dma_pool_alloc ci_hw_qh, ec056200 (corrupted)
> 00000000: 40 02 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
> 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
> 00000020: 40 05 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 @...............
> 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
>
> and so it goes on.
>
> The first four bytes are the offset to the next free block of memory in
> this page, so can be ignored. The remainder of the bytes should all be
> 0xa7, but every word at offset 32 into these is corrupted with what
> looks to be a similar offset.
>
> We dump 0x40 bytes, which, reading the code makes the pool size 0x40
> bytes in size. Tabulating the object offset, the next offset, and
> the corruption at offset 32. Corruption1 is from your latest log,
> corruption2 is derived from your previous log using the next pointer
> to tie up between the two:
>
> object offset next corruption1 corruption2
> 0x0080 0x00c0 0x00000080 0x00000080
> 0x0140 0x0180 0x00000140 0x00000100
> 0x01c0 0x0200 0x00000340 0x000001c0
> 0x0200 0x0240 0x00000540 0x000001c0
> 0x0280 0x02c0 0x00000340 0x00000300
> 0x0340 0x0380 0x00000540 0x00000140
> 0x03c0 0x0400 0x00000540 0x00000300
> 0x0400 0x0440 0x000003c0 0x00000140
> 0x0480 0x04c0 0x00000540 0x000003c0
> 0x0540 0x0580 0x00000480 0x00000540
> 0x05c0 0x0600 0x000005c0 0x000005c0
> 0x0600 0x0640 0x00000500 0x000005c0
> 0x0680 0x06c0 0x00000740 0x00000680
> ?????? 0x0780 0x00000740
> 0x07c0 0x0800 0x000007c0 0x00000700
>
> The corruption looks very much like offset values, except they do not
> seem to follow any rhyme or reason. They also appear to be different
> on each boot.
>
> Given that the sequence here when a pool allocation occurs is:
>
> 1. allocate DMA coherent page
> 2. memset entire page with 0xa7
> 3. write next offsets
> 4. initialise 'offset' to zero (offset of first free object)
> 5. add page to pools list of pages
> 6. allocate first object, updating offset to the next free offset read
> from the first word of the object.
>
> then when the next allocation request comes along, we allocate the
> next object in the same way as step 6. At the point of allocating the
> third object, we find that there is corruption in the third object at
> 0x20 bytes into it - or 0xa0 bytes into the page.
>
> Now, what does the driver that's allocating these do with them? That
> is done via init_eps() in drivers/usb/chipidea/udc.c, which doesn't do
> anything with the allocated memory. This is the only place that the
> driver allocates from this DMA pool, which is done in a loop, so we
> know that the objects allocated from this pool will be in relatively
> quick succession.
>
> So this does not make sense.
>
> I really doubt that there is anything wrong with the kernel - this USB
> driver is used on other SoCs (such as iMX6) and does not exhibit this
> problem - it also works on the Tegra TK1 platform as well.
>
> You are definitely seeing memory corruption here - but given what the
> above looks like, I'd put forward another possible scenario - maybe
> u-boot or something else is leaving a USB controller or some other DMA
> agent active, which is writing over memory while the kernel is trying
> to boot, resulting in memory corruption.
That had occurred to me as well. The kernel command line contains a
couple of memory regions that I think our downstream kernel parses and
uses to reserve memory (redacted here for readability):
console=ttyS0,115200n8
console=tty1
no_console_suspend=1
lp0_vec=2064@0xf46ff000
mem=2015M@2048M
memtype=255
ddr_die=2048M@2048M
section=256M
pmuboard=0x0177:0x0000:0x02:0x43:0x00
tsec=32M@3913M
otf_key=c75e5bb91eb3bd947560357b64422f85
usbcore.old_scheme_first=1
core_edp_mv=1150
core_edp_ma=4000
tegraid=40.1.1.0.0
debug_uartport=lsport,3
power_supply=Adapter
audio_codec=rt5640
modem_id=0
android.kerneltype=normal
fbcon=map:1
commchip_id=0
usb_port_owner_info=0
lane_owner_info=6
emc_max_dvfs=0
touch_id=0@0
board_info=0x0177:0x0000:0x02:0x43:0x00
net.ifnames=0
root=/dev/mmcblk1p1
rw
rootwait
tegraboot=sdmmc
gpt
maxcpus=0
pci=noaer
Two things stand out here:
mem=2015M@2048M
tsec=32M@3913M
So it looks like there are two carveout regions that the kernel isn't
supposed to touch and presumably somebody else could be using them. If
there's overlap between them and the DMA memory used by the DMA pool,
that could perhaps explain what's going on here.
Can you try the following patch and send the boot log again?
Thanks,
Thierry
--- >8 ---
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 76a160083506..6343d74cb963 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -361,11 +361,11 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
continue;
if (pool->dev)
dev_err(pool->dev,
- "dma_pool_alloc %s, %p (corrupted)\n",
- pool->name, retval);
+ "dma_pool_alloc %s, %px/%pad (corrupted)\n",
+ pool->name, retval, handle);
else
- pr_err("dma_pool_alloc %s, %p (corrupted)\n",
- pool->name, retval);
+ pr_err("dma_pool_alloc %s, %px/%pad (corrupted)\n",
+ pool->name, retval, handle);
/*
* Dump the first 4 bytes even if they are not
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 11:57 ` Thierry Reding
@ 2019-03-05 13:16 ` Embedded Engineer
2019-03-05 13:23 ` Russell King - ARM Linux admin
0 siblings, 1 reply; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 13:16 UTC (permalink / raw)
To: Thierry Reding
Cc: Andrew Lunn, Vladimir Murzin, Russell King - ARM Linux admin,
Jon Hunter, linux-tegra, linux-arm-kernel
That was quite an in-depth analysis that you shared and took some time
get my head around it :)
On Tue, Mar 5, 2019 at 4:57 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>
> Can you try the following patch and send the boot log again?
Please check the following logs after applying your patch:
https://pastebin.com/hGGKZcLU
Sorry to add more to your confusion, now the board is getting stuck
once in a while at following:
U-Boot SPL 2014.10-rc2 (Mar 05 2019 - 14:29:35)
U-Boot 2014.10-rc2 (Mar 05 2019 - 14:29:35)
TEGRA124
Board: NVIDIA Jetson TK1
DRAM:
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 13:16 ` Embedded Engineer
@ 2019-03-05 13:23 ` Russell King - ARM Linux admin
2019-03-05 13:32 ` Embedded Engineer
0 siblings, 1 reply; 22+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 13:23 UTC (permalink / raw)
To: Embedded Engineer
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 05, 2019 at 06:16:38PM +0500, Embedded Engineer wrote:
> That was quite an in-depth analysis that you shared and took some time
> get my head around it :)
>
> On Tue, Mar 5, 2019 at 4:57 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> >
> > Can you try the following patch and send the boot log again?
>
> Please check the following logs after applying your patch:
>
> https://pastebin.com/hGGKZcLU
So they're at 0xec056XXX virtual, 0xac056XXX physical, which is about
704MiB into system memory, and nowhere near either of the two regions
that Theirry identified.
> Sorry to add more to your confusion, now the board is getting stuck
> once in a while at following:
>
> U-Boot SPL 2014.10-rc2 (Mar 05 2019 - 14:29:35)
>
> U-Boot 2014.10-rc2 (Mar 05 2019 - 14:29:35)
>
> TEGRA124
> Board: NVIDIA Jetson TK1
> DRAM:
Is there no later u-boot you can use to rule that out?
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 13:23 ` Russell King - ARM Linux admin
@ 2019-03-05 13:32 ` Embedded Engineer
2019-03-05 14:23 ` Russell King - ARM Linux admin
0 siblings, 1 reply; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 13:32 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 5, 2019 at 6:23 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Is there no later u-boot you can use to rule that out?
This u-boot was working just fine with our board so didn't try
updating it to some newer version. Also the downstream u-boot has
different text base addresses than mainline ones I guess so didn't put
any effort in that.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 13:32 ` Embedded Engineer
@ 2019-03-05 14:23 ` Russell King - ARM Linux admin
2019-03-05 14:57 ` Embedded Engineer
0 siblings, 1 reply; 22+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 14:23 UTC (permalink / raw)
To: Embedded Engineer
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 05, 2019 at 06:32:19PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 6:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Is there no later u-boot you can use to rule that out?
>
> This u-boot was working just fine with our board so didn't try
> updating it to some newer version. Also the downstream u-boot has
> different text base addresses than mainline ones I guess so didn't put
> any effort in that.
As it is also suffering from the "hanging" issue, it seems that the
problem is not specific to the kernel.
It leaves only a few possible causes:
1. The board firmware (including u-boot) is enabling some DMA that is
causing corruption of some RAM.
2. You really do have an issue between the CPU and RAM causing
random-ish data corruption.
It may be worth getting mm/dmapool.c to print the hexdump a number of
times to see whether the data read from the corrupted region changes.
Around line 372, there is a call to print_hex_dump(). Just replicate
that a number of times.
Another idea would be to print a hexdump of each object as it's
allocated and the next object.
Maybe something like this (untested, may need tweaks to get it to build,
you'll also need to revert Thierry's patch):
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 6d4b97e7e9e9..3db1e9b63809 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -219,6 +219,47 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
} while (offset < pool->allocation);
}
+#ifdef DMAPOOL_DEBUG
+static int verify_one(struct dma_pool *pool, struct dma_page *page,
+ unsigned int offset, const char *desc)
+{
+ dma_addr_t handle = page->dma + offset;
+ u8 *data = page->vaddr + offset;
+ int i;
+
+ for (i = sizeof(page->offset); i < pool->size; i++) {
+ if (data[i] == POOL_POISON_FREED)
+ continue;
+ if (pool->dev)
+ dev_err(pool->dev,
+ "%s %s, %pad (corrupted)\n",
+ desc, pool->name, &handle);
+ else
+ pr_err("%s %s, %pad (corrupted)\n",
+ desc, pool->name, &handle);
+
+ /*
+ * Dump the first 4 bytes even if they are not
+ * POOL_POISON_FREED
+ */
+ print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
+ data, pool->size, 1);
+ return 1;
+ }
+ return 0;
+}
+
+static void verify_free(struct dma_pool *pool, struct dma_page *page, const char *desc)
+{
+ unsigned int offset;
+
+ for (offset = page->offset; offset < page->allocation;
+ offset = *(int *)(page->vaddr + offset))
+ if (verify_one(pool, page, offset, desc))
+ break;
+}
+#endif
+
static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
{
struct dma_page *page;
@@ -235,6 +276,9 @@ static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
pool_initialise_page(pool, page);
page->in_use = 0;
page->offset = 0;
+#ifdef DMAPOOL_DEBUG
+ verify_free(pool, page, "pool_alloc_page");
+#endif
} else {
kfree(page);
page = NULL;
@@ -345,35 +389,17 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
list_add(&page->page_list, &pool->page_list);
ready:
page->in_use++;
+#ifdef DMAPOOL_DEBUG
+ verify_free(pool, page, "dma_pool_alloc pre");
+#endif
offset = page->offset;
page->offset = *(int *)(page->vaddr + offset);
retval = offset + page->vaddr;
*handle = offset + page->dma;
#ifdef DMAPOOL_DEBUG
- {
- int i;
- u8 *data = retval;
- /* page->offset is stored in first 4 bytes */
- for (i = sizeof(page->offset); i < pool->size; i++) {
- if (data[i] == POOL_POISON_FREED)
- continue;
- if (pool->dev)
- dev_err(pool->dev,
- "dma_pool_alloc %s, %p (corrupted)\n",
- pool->name, retval);
- else
- pr_err("dma_pool_alloc %s, %p (corrupted)\n",
- pool->name, retval);
-
- /*
- * Dump the first 4 bytes even if they are not
- * POOL_POISON_FREED
- */
- print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
- data, pool->size, 1);
- break;
- }
- }
+ verify_one(pool, page, offset, "dma_pool_alloc");
+ if (page->offset < pool->allocation)
+ verify_one(pool, page, page->offset, "dma_pool_alloc next");
if (!(mem_flags & __GFP_ZERO))
memset(retval, POOL_POISON_ALLOCATED, pool->size);
#endif
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 14:23 ` Russell King - ARM Linux admin
@ 2019-03-05 14:57 ` Embedded Engineer
2019-03-05 14:58 ` Russell King - ARM Linux admin
0 siblings, 1 reply; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 14:57 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 5, 2019 at 7:23 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
> + for (offset = page->offset; offset < page->allocation;
> + offset = *(int *)(page->vaddr + offset))
error: 'struct dma_page' has no member named 'allocation'. So I
replaced 'page->allocation' with 'page->in_use'. Did you really meant
that? If yes, following are the boot logs:
https://pastebin.com/rgfGdYcj
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 14:57 ` Embedded Engineer
@ 2019-03-05 14:58 ` Russell King - ARM Linux admin
2019-03-05 15:11 ` Embedded Engineer
0 siblings, 1 reply; 22+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 14:58 UTC (permalink / raw)
To: Embedded Engineer
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 05, 2019 at 07:57:18PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > + for (offset = page->offset; offset < page->allocation;
> > + offset = *(int *)(page->vaddr + offset))
>
> error: 'struct dma_page' has no member named 'allocation'. So I
> replaced 'page->allocation' with 'page->in_use'. Did you really meant
> that? If yes, following are the boot logs:
Should've been pool->allocation. Sorry about that.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 14:58 ` Russell King - ARM Linux admin
@ 2019-03-05 15:11 ` Embedded Engineer
2019-03-05 15:31 ` Russell King - ARM Linux admin
0 siblings, 1 reply; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 15:11 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Should've been pool->allocation. Sorry about that.
No problems, here are the new logs:
https://pastebin.com/dfey3LwB
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 15:11 ` Embedded Engineer
@ 2019-03-05 15:31 ` Russell King - ARM Linux admin
2019-03-05 15:44 ` Embedded Engineer
[not found] ` <47d0bea5-0dfe-9ce0-fedc-92061769e0a1@gmx.net>
0 siblings, 2 replies; 22+ messages in thread
From: Russell King - ARM Linux admin @ 2019-03-05 15:31 UTC (permalink / raw)
To: Embedded Engineer
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > Should've been pool->allocation. Sorry about that.
>
> No problems, here are the new logs:
>
> https://pastebin.com/dfey3LwB
Thanks - the patch I posted substantially increases the amount of checking
that is done... so not surprisingly we find new forms of corruption:
tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted)
00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ....kkkkkkkkkkkk
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
and that corruption occurred _right_ after we allocated the page, memset
the entire page to 0xa7, and wrote the "next" pointers.
Again, similar scenario to the above:
tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted)
00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
which is again right after the page is allocated and initialised.
If we look at the ci_hw_qh case, which is the one originally identified:
tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................
Again, just allocated the coherent DMA page, memset() it and written
the offsets to it, and it is already corrupted. Tegra124 does not
appear to be dma-coherent, so these allocations will be for normal,
uncached memory. That means the cache won't be loading entire
cachelines at a time from memory for these accesses, but will be
reading them byte by byte as we print the hex values.
The window for this corruption occuring is now very small.
Right now, I don't have anything further to add beyond what I've
already suggested as causes - this is *definitely* memory corruption
either by something else writing to memory, by the CPU writes not
properly being stored in RAM or the CPU not being able to reliably
read data back from RAM.
I wonder whether any of the memory testers run with normal, uncached
memory.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 15:31 ` Russell King - ARM Linux admin
@ 2019-03-05 15:44 ` Embedded Engineer
2019-03-15 8:55 ` Marcel Ziswiler
[not found] ` <47d0bea5-0dfe-9ce0-fedc-92061769e0a1@gmx.net>
1 sibling, 1 reply; 22+ messages in thread
From: Embedded Engineer @ 2019-03-05 15:44 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: Andrew Lunn, Vladimir Murzin, Jon Hunter, Thierry Reding,
linux-tegra, linux-arm-kernel
On Tue, Mar 5, 2019 at 8:31 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Right now, I don't have anything further to add beyond what I've
> already suggested as causes - this is *definitely* memory corruption
> either by something else writing to memory, by the CPU writes not
> properly being stored in RAM or the CPU not being able to reliably
> read data back from RAM.
Thanks alot for your help, I will try updating u-boot to newer version
so that we can eliminate the chance that u-boot has left something on
in undesired state.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
[not found] ` <47d0bea5-0dfe-9ce0-fedc-92061769e0a1@gmx.net>
@ 2019-03-09 7:50 ` Embedded Engineer
0 siblings, 0 replies; 22+ messages in thread
From: Embedded Engineer @ 2019-03-09 7:50 UTC (permalink / raw)
To: Clemens Koller, Thierry Reding, linux-tegra, Andrew Lunn,
Vladimir Murzin, linux-arm-kernel, Jon Hunter
On Tue, Mar 5, 2019 at 9:01 PM Clemens Koller <clemens.ml@gmx.net> wrote:
>
> Yes, this really smells like memory timing issues.
> Did you try the more extensive memory test of the latest u-boot? The regular one is quite naive. This is usually *not* enabled as default.
Unfortunately I was unable to get the latest (or any other upstream)
u-boot running on my board and even on Jetson TK1 kit. Although it
seems that the u-boot has support for Jetson TK1 in mainline but don't
know why its not working. The mtest command in the Nvidia's downstream
version of u-boot did not report any errors.
> Then, a Shmoo plot with different memory timing/voltage/temperature might be useful as well as a PCB layout review.
Tried running Shmoo test again but it generated the BCT file with same
parameters again. So it didn't seem to work.
I also stopped u-boot at command line and checked using oscilloscope
if there's some activity at data lines between DDR and TK1 processor.
There was no activity on data lines which made me believe that when
u-boot is in idle state, there's no peripheral (or DMA) writing to
memory which might be corrupting the memory used by kernel as suggest
by Thierry Reding.
Can anyone give some other suggestion as it seems I'm running out of
options to test :(
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Unstable Kernel behavior on an ARM based board
2019-03-05 15:44 ` Embedded Engineer
@ 2019-03-15 8:55 ` Marcel Ziswiler
0 siblings, 0 replies; 22+ messages in thread
From: Marcel Ziswiler @ 2019-03-15 8:55 UTC (permalink / raw)
To: linux@armlinux.org.uk, embed786@gmail.com
Cc: andrew@lunn.ch, vladimir.murzin@arm.com, jonathanh@nvidia.com,
thierry.reding@gmail.com, linux-tegra@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, 2019-03-05 at 20:44 +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 8:31 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > Right now, I don't have anything further to add beyond what I've
> > already suggested as causes - this is *definitely* memory
> > corruption
> > either by something else writing to memory, by the CPU writes not
> > properly being stored in RAM or the CPU not being able to reliably
> > read data back from RAM.
>
> Thanks alot for your help, I will try updating u-boot to newer
> version
> so that we can eliminate the chance that u-boot has left something on
> in undesired state.
Sorry, I just saw this thread now. I have quite some TK1 experience
from our Apalis TK1 bring-up. For us mainline U-Boot works quite nicely
but I do remember some magic stuff called RAM repair NVIDIA has done to
their downstream which fixed a strange hang issue we have seen at times
during our extensive validation & verification:
http://git.toradex.com/cgit/u-boot-toradex.git/commit/?h=2016.11-toradex&id=df2b46ba248687c208767865abe5fca32a43faaf
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2019-03-15 8:55 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CA+_ZnZTpES5bGZ0Cprnpy7HapMP3i=VTHgMZco6jq2zcKQzAVg@mail.gmail.com>
[not found] ` <20190302113651.zerdonykcvv5ex3e@shell.armlinux.org.uk>
[not found] ` <CA+_ZnZR_aTubOh3PJEVK0hbcDxxfK3vW1eLoknVo1yRBExawsw@mail.gmail.com>
[not found] ` <20190302115729.zbowssfwf4j7jf22@shell.armlinux.org.uk>
[not found] ` <CA+_ZnZS48d0OOTQJjktieO-yf09Ch6MRQLFoCjRtUgcXMrjWmg@mail.gmail.com>
[not found] ` <20190302123907.qoe46qs6qmx7qnjs@shell.armlinux.org.uk>
[not found] ` <453072a9-52e2-7591-750f-624ca27e0bbf@gmx.net>
[not found] ` <CA+_ZnZTrGCzPLKWZdxBzV1Z9b+c-Sp27DeUeV2XFW0+ZNZipGw@mail.gmail.com>
[not found] ` <e530ec49-942d-a993-9081-901c0ce29c75@arm.com>
[not found] ` <CA+_ZnZTCtghS+oUdXGW+h6SVD6nMPfsyDAvOwEugrhx6NJ3yFg@mail.gmail.com>
2019-03-04 14:25 ` Unstable Kernel behavior on an ARM based board Thierry Reding
2019-03-04 15:51 ` Embedded Engineer
2019-03-05 10:01 ` Embedded Engineer
2019-03-05 10:07 ` Russell King - ARM Linux admin
2019-03-05 10:29 ` Embedded Engineer
2019-03-05 11:20 ` Thierry Reding
2019-03-05 11:22 ` Russell King - ARM Linux admin
2019-03-05 11:57 ` Thierry Reding
2019-03-05 13:16 ` Embedded Engineer
2019-03-05 13:23 ` Russell King - ARM Linux admin
2019-03-05 13:32 ` Embedded Engineer
2019-03-05 14:23 ` Russell King - ARM Linux admin
2019-03-05 14:57 ` Embedded Engineer
2019-03-05 14:58 ` Russell King - ARM Linux admin
2019-03-05 15:11 ` Embedded Engineer
2019-03-05 15:31 ` Russell King - ARM Linux admin
2019-03-05 15:44 ` Embedded Engineer
2019-03-15 8:55 ` Marcel Ziswiler
[not found] ` <47d0bea5-0dfe-9ce0-fedc-92061769e0a1@gmx.net>
2019-03-09 7:50 ` Embedded Engineer
2019-03-05 10:32 ` Thierry Reding
2019-03-05 11:05 ` Embedded Engineer
2019-03-05 11:36 ` Thierry Reding
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).