Memory Incoherence Issue

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Memory Incoherence Issue
@ 2017-02-03  2:00 Doug Berger
  2017-02-09 14:33 ` Will Deacon
  0 siblings, 1 reply; 3+ messages in thread
From: Doug Berger @ 2017-02-03  2:00 UTC (permalink / raw)
  To: linux-arm-kernel

We have a device that is based on a dual-core A15 MPCore host CPU
complex that has been exhibiting a problem with very infrequent memory
corruption when exercising a user space memory tester program
(memtester) in a system designed around a v3.14 Linux environment.
Unfortunately, it is not possible to update this system to the latest
kernel version for testing at this time.

We originally suspected hardware issues with the memory, but found no
apparent dependencies on environmental factors such as voltage and
temperature.

The behavior is similar to the issue that was patched in the ARM
architecture Linux kernel and referenced here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-January/319761.html

This patch is included in our kernel and our cores are supposed to
contain the hardware fix for ARM erratum 798181 so while the kernel
contains the ARM_ERRATA_798181 patch, erratum_a15_798181_handler() is NULL.

The general failure case can be described as follows:
A memtester process is executed that runs a set of simple memory tests
over an address range. The address range is allocated at the beginning
of the program (based on command line parameters) and is split into two
buffers (named buf_a and buf_b) with a fixed offset between their
virtual addresses of half the size of the address range. Each individual
memory test follows the basic procedure of writing a pattern to both
buffers and then reading and comparing the results. The buffers are
accessed through pointers to volatile unsigned long integers (32-bit in
this case) in simple loops over the size of each buffer where each
pointer is dereferenced and incremented in each iteration. For example,
a specific memory test might contain one loop in which a value is
written to the first unsigned long integer location in buf_a and to the
first unsigned long integer location in buf_b. The pointers are
incremented and the loop continues to write the value at the next
respective location in each buffer until both buffers are full with the
same content. After the first loop completes, a second loop then reads
the first unsigned long integer location in buf_a and the first unsigned
long integer location in buf_b and compares them. If the read values do
not match each other an error message is output that displays the
mismatched values (the pointers are dereferenced again for the displayed
values). The second loop then updates the pointers and continues
comparing respective entries in each buffer until they have all been
compared. The memtester program is configured to repeat a set of memory
tests of the same cache-able, shareable, normal memory address range
indefinitely.

In preproduction testing we received reports that when running the
memtester process on approximately 100 systems a few would output error
messages reflecting mismatched values after a day or two and we have
been trying to determine the cause of the errors.

Observations:
The most common pattern of failure reported is a mismatch over a 32KB
(32768-byte) range within the buffers during a single memory test with
subsequent memory tests not showing any errors.
The next most common pattern of failure is a mismatch over a 64-byte
(cache line length) range within the buffers during a single memory test
with subsequent memory tests not showing any errors.
When it is possible to recognize the data pattern of a particular memory
test, the error messages generally show the mismatched data displayed
from buf_a and buf_b to be from two consecutive tests (i.e. one buffer
appears to hold stale data within the mismatch range).
The mismatched ranges appear to start on virtual addresses that are
aligned to the size of the mismatch range. For 32KB mismatches the
underlying physical addresses are only page aligned (i.e. 4KB not 32KB).
There is no obvious correlation in the location of a mismatch range
within a buffer.
Our L1 cache size is 32KB, but it seems unlikely that the alternating
buffer access pattern of memtester would allow the L1 data cache to
contain only lines from one buffer to account for the 32KB stale data.
One theory is that a page table walk might somehow read the wrong values
in a cache line of page table entries. Since we are using long
descriptors in our translation tables this would amount to 8 64-bit page
table entries and mismap 8 4KB pages or 32KB. However, we have not been
able to come up with a scenario that could cause this.
We tried switching to short descriptors for the page tables
(CONFIG_ARM_LPAE=n) to see if we might start getting 64KB failure ranges
to support this theory, but we have yet to see any failure ranges longer
than 64 bytes in this configuration.
There is some evidence in our testing that the failures may require
process migrations between processor cores since using taskset to set
the affinity of the processes appears to prevent the problem. We have
tried running multiple memtester processes in parallel and also forcing
memtester processes to switch back and forth between processors with
perhaps a slightly higher failure rate, but it is likely not
statistically significant.
Tests with many processes seem to show more 64-byte (or shorter)
failures and the mismatch data seems less likely to be from two
consecutive tests.  The data values may be from two different tests and
in some more interesting cases one of the buffers is observed to contain
page table entries. This suggests data leakage between user space processes.
The error behavior is almost always transient with the appearance that a
comparison is using stale data (e.g. from a cache) that may become
coherent during the compare loop. Some mismatch ranges are less than
64-byte and 32KB. We have even seen the extreme case where the values
read and compared mismatched but when they were reread for output in the
error message the values matched even though there are no writes to the
buffers between the reads.
We have also had some failures where the mismatch range is stable over
subsequent memory tests. In these cases it appears that the values of
one of the buffers in a 32KB mismatch range match the content of our
boot ROM. It is suspected that the writes of a test pattern may be
corrupting a page table such that the corresponding virtual addresses
are being mapped to the boot ROM. Attempts by memtester to write the
next pattern to the buffer fail to change the value of the ROM so the
failures reappear in the same 32KB range of the buffers in each memory
test that follows the first failure.  The expected test pattern in this
case was 0x00000800FFFFF7FF which if stored in a long descriptor page
table entry would point to our ROM physical address of 0x00FFFFF000.
However, I would expect a user space write to this address to fault
since AP[2:1] are 11b.

My current thinking is that the data cache lines themselves may not be
getting directly corrupted, but perhaps there is a problem with the
cache indexing which somehow allows the wrong cacheline content to be
returned on a cache read or a cache write may store data in the wrong
cacheline.  It would appear from the failure logs that under some
circumstance the data transactions initiated by the TLB page table walk
bus master and the data transactions initiated by the CPU load/store
master may interfere in a way that allows the data from one to be
incorrectly observed within the data cache(s) by the other.

Does this type of failure ring any bells?
Are there any test codes or procedures that you are aware of to
specifically stress these hardware subsytems (i.e. TLB and data caches)
to detect timing or implementation errors in an A15 MPCore system?
If you can provide any suggestions of what may be happening or methods
of gaining increased visibility into the source of the failures or
further experiments you think might be helpful in determining the root
of the failures and its solution we would greatly appreciate it.

Regards,
    Doug

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Memory Incoherence Issue
  2017-02-03  2:00 Memory Incoherence Issue Doug Berger
@ 2017-02-09 14:33 ` Will Deacon
  2017-03-01  0:50   ` Doug Berger
  0 siblings, 1 reply; 3+ messages in thread
From: Will Deacon @ 2017-02-09 14:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Doug,

On Thu, Feb 02, 2017 at 06:00:02PM -0800, Doug Berger wrote:
> We have a device that is based on a dual-core A15 MPCore host CPU
> complex that has been exhibiting a problem with very infrequent memory
> corruption when exercising a user space memory tester program
> (memtester) in a system designed around a v3.14 Linux environment.
> Unfortunately, it is not possible to update this system to the latest
> kernel version for testing at this time.

So what are the options for changing the kernel being used here? Are you
using v3.14, or a stable variant?

There have been many fixes since 3.14 (e.g. 8e6480667246 ("ARM: 8299/1:
mm: ensure local active ASID is marked as allocated on rollover") and
so you could simply be hitting a known, fixed issue.

Will

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Memory Incoherence Issue
  2017-02-09 14:33 ` Will Deacon
@ 2017-03-01  0:50   ` Doug Berger
  0 siblings, 0 replies; 3+ messages in thread
From: Doug Berger @ 2017-03-01  0:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/09/2017 06:33 AM, Will Deacon wrote:
> Hi Doug,
> 
> On Thu, Feb 02, 2017 at 06:00:02PM -0800, Doug Berger wrote:
>> We have a device that is based on a dual-core A15 MPCore host CPU
>> complex that has been exhibiting a problem with very infrequent memory
>> corruption when exercising a user space memory tester program
>> (memtester) in a system designed around a v3.14 Linux environment.
>> Unfortunately, it is not possible to update this system to the latest
>> kernel version for testing at this time.
> 
> So what are the options for changing the kernel being used here? Are you
> using v3.14, or a stable variant?
> 
> There have been many fixes since 3.14 (e.g. 8e6480667246 ("ARM: 8299/1:
> mm: ensure local active ASID is marked as allocated on rollover") and
> so you could simply be hitting a known, fixed issue.
> 
> Will
> 

Following up for the curious:
The observed failures have been associated with a software bug in the
Broadcom Brahma B15 readahead cache support patch originally submitted
for review here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/328706.html

The software bug was caught and corrected before the updated submission
of the Broadcom Brahma B15 readahead cache support patch resubmitted here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/480806.html

So you were correct that we were in fact hitting a known, fixed issue.

Unfortunately, the fix had not made its way into the 3.14 kernel used on
this system.

As usual the error is obvious once you know it's there:
static inline void __b15_rac_flush(void)
{
	u32 reg;

	__raw_writel(FLUSH_RAC, b15_rac_base + RAC_FLUSH_REG);
	do {
		/* This dmb() is required to force the Bus Interface Unit
		 * to clean oustanding writes, and forces an idle cycle
		 * to be inserted.
		 */
		dmb();
		reg = __raw_readl(b15_rac_base + RAC_FLUSH_REG);
-	} while (reg & RAC_FLUSH_REG);
+	} while (reg & FLUSH_RAC);
}

My only consolation is that you missed it too ;).

Thanks so much for your consideration and support,
    Doug

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-03-01  0:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-03  2:00 Memory Incoherence Issue Doug Berger
2017-02-09 14:33 ` Will Deacon
2017-03-01  0:50   ` Doug Berger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).