From mboxrd@z Thu Jan  1 00:00:00 1970
From: jgunthorpe@obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 10 Jan 2014 17:19:32 -0700
Subject: [PATCH] ARM: Kirkwood: Add support for Excito Bubba B3
In-Reply-To: <20140110230248.GO9681@lunn.ch>
References: <1388247131-19301-1-git-send-email-andrew@lunn.ch>
 <20131228170114.GH19878@titan.lakedaemon.net>
 <20140102194924.GA3321@obsidianresearch.com>
 <1388702192.16958.54.camel@hastur.hellion.org.uk>
 <20140102230832.GB9339@obsidianresearch.com>
 <20140110192032.GQ19878@titan.lakedaemon.net>
 <20140110194437.GJ18269@obsidianresearch.com>
 <20140110230248.GO9681@lunn.ch>
Message-ID: <20140111001931.GM18269@obsidianresearch.com>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Sat, Jan 11, 2014 at 12:02:48AM +0100, Andrew Lunn wrote:
> > > So, I'd prefer to handle this more gracefully.  I don't have much
> > > experience at the low-level init of the caches, couldn't we enable and
> > > flush rather than throwing the error?
> > 
> > It cannot be solved in cache-feroceon-l2.c.
> > 
> > In many cases you won't even get that far:
> > - It will blow up in head.S when the cache is off: decompressor wrote
> >   writeback data to into the L2 and uncached fetches see memory
> >   content prior to decompression
> > - It will blow up after head.S enables the L1: decompressor wrote
> >   data into the L2 but the relocation writes done with the L1 off
> >   made changes to memory that were not captured in the L2.
> 
> My impression with booting a lot of times while doing i bisect, was
> that it got as far as:
> 
> Uncompressing Linux... done, booting the kernel.
> Booting Linux on physical CPU 0x0
> Linux version 3.13.0-rc1-dirty (lunn at laptop) (gcc version 4.4.5 (Debian 4.4.5-8) ) #7 PREEMPT Fri Jan 3 09:25:07 CST 2014           
> CPU: Feroceon 88FR131 [56251311] revision 1 (ARMv5TE), cr=00053977
> CPU: VIVT data cache, VIVT instruction cache
> Machine: Marvell Kirkwood (Flattened Device Tree), model: QNAP TS219 family
> Memory policy: ECC disabled, Data cache writeback
> 
> every time.

It should be deterministic, but 'random' in the sense it is sensitive
to exactly the data pattern generated by the decompressor and exactly
the data pattern generated by the relocation writes - so the
corruption will vary depending on the kernel image content (position
of the relocations) and also the size of the L2.

You are likely seeing the 2nd failure I predicted, which is a failure
when a cache line containing a relocation is used, and happened to be
in the L2.

The first failure would require the kernel image to be smaller than
the cache or the cache eviction policy to be such that the initial
part of the decompressed image remains dirty and unevicted in the
L2.

> It does however seem possible to insert platform specific code soon
> after it prints:
> 
> Machine: Marvell Kirkwood (Flattened Device Tree), model: QNAP TS219 family

So if you attempt to do the flush/disable at this point you are
relying on the L2 containing no dirty lines that overlap with lines
that were changed by the relocator while the cache was off.

This is *probably* true in many cases, but it is sktechy, and creates
a hidden hard to find pitfall. This is basically the same 'probably'
that enables non-relocated kernels to boot.

Basically, you might get a bootable system, but it is technically
wrong and has a theoretical hole.

Jason