* I-cache/D-cache inconsistency issue with page cache
@ 2011-09-23 11:57 Mike Hommey
2011-09-23 19:39 ` Russell King - ARM Linux
0 siblings, 1 reply; 11+ messages in thread
From: Mike Hommey @ 2011-09-23 11:57 UTC (permalink / raw)
To: linux-arm-kernel
Hi,
We've been hitting random crashes at startup with Firefox on tegras
(under Android), and narrowed it down to a I-cache/D-cache
inconsistency. A reduced testcase of the issue looks like the following
(compile as ARM, not Thumb):
-----------------8<--------------
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>
__asm__(
".text\n"
".align 4\n"
".type foo, %function\n"
"foo:\n"
" bx lr\n"
);
static void foo() __attribute__((used));
int main(int argc, char *argv[]) {
if (argc < 2)
return 0;
int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
ftruncate(fd, 4096);
void *m = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
memcpy(m, foo, 4);
munmap(m, 4096);
void *mx = mmap(NULL, 4096, PROT_EXEC, MAP_SHARED, fd, 0);
void (*func)(void) = (void (*)(void)) mx;
func();
return 0;
}
----------------->8--------------
We've been able to reliably reproduce with the above reduced testcase on
tegras under both Android and Ubuntu (Maverick). It however doesn't seem
to happen on all kinds of ARM processors, though.
A corresponding real world use case is that we are (were) uncompressing
libraries in mmap()ed memory and dlopen()ing the resulting file. We have
been doing so for a long time, but only recently we got a library small
enough to trigger an actual problem.
Something along these lines has been discussed on this very list:
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-September/001074.html
What happens in practice with the above code is that by the time we jump
into the copied function, RAM still has the zeroed out page, while the
actual content is still in D-cache. Execution thus happens on zeroes, up
to the point it reaches the next page, which in most cases would not be
mapped, thus a segmentation fault.
In our real world scenario, the execution would start on zeroes, up to
some point where memory would have actual content, at which time we
crash with SIGILL at a cache line boundary (adresses ending in 0x20,
0x40, 0x60 or 0x80), depending on how much D-cache would have been
flushed in between because there are various things happening between
the uncompression and the execution of init functions in the library.
This didn't happen until we had a library smaller than 4KB with an init
function.
Adding a cache flush in between does solve the problem. I however think
the kernel should mitigate by making sure the page cache backing PROT_EXEC
mappings is fresh.
Please note that I'm not expecting
void *m = mmap(NULL, 4096, PROT_WRITE | PROT_EXEC, MAP_SHARED, fd, 0);
memcpy(m, foo, 4);
void (*func)(void) = (void (*)(void)) m;
func();
to work, this would be unreasonable.
Cheers,
Mike
PS: Please Cc me, I'm not subscribed.
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-23 11:57 I-cache/D-cache inconsistency issue with page cache Mike Hommey
@ 2011-09-23 19:39 ` Russell King - ARM Linux
2011-09-24 9:35 ` Mike Hommey
0 siblings, 1 reply; 11+ messages in thread
From: Russell King - ARM Linux @ 2011-09-23 19:39 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Sep 23, 2011 at 01:57:21PM +0200, Mike Hommey wrote:
> We've been hitting random crashes at startup with Firefox on tegras
> (under Android), and narrowed it down to a I-cache/D-cache
> inconsistency. A reduced testcase of the issue looks like the following
> (compile as ARM, not Thumb):
If you write code at run time, you need to use the sys_cacheflush
API to ensure that it's properly synchronized with the I-cache. It's
a well known issue, and it applies to any harvard cache structured
CPU which doesn't automatically ensure coherence (which essentially
means all ARMs.)
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-23 19:39 ` Russell King - ARM Linux
@ 2011-09-24 9:35 ` Mike Hommey
2011-09-24 9:47 ` Russell King - ARM Linux
2011-09-24 10:14 ` Siarhei Siamashka
0 siblings, 2 replies; 11+ messages in thread
From: Mike Hommey @ 2011-09-24 9:35 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Sep 23, 2011 at 08:39:41PM +0100, Russell King - ARM Linux wrote:
> On Fri, Sep 23, 2011 at 01:57:21PM +0200, Mike Hommey wrote:
> > We've been hitting random crashes at startup with Firefox on tegras
> > (under Android), and narrowed it down to a I-cache/D-cache
> > inconsistency. A reduced testcase of the issue looks like the following
> > (compile as ARM, not Thumb):
>
> If you write code at run time, you need to use the sys_cacheflush
> API to ensure that it's properly synchronized with the I-cache. It's
> a well known issue, and it applies to any harvard cache structured
> CPU which doesn't automatically ensure coherence (which essentially
> means all ARMs.)
I do agree it's reasonable to have applications doing that to handle
cache synchronization themselves. I wrote such in my message. But I
think the kernel should make sure that its page cache is fresh when
it maps it PROT_EXEC. I think it's unreasonable to expect applications
doing mmap(PROT_WRITE), inflate, munmap, something, mmap(PROT_EXEC),
and execute something there to have to handle cache synchronisation
themselves. Especially when it's very CPU dependent (the testcase does
not even fail on all ARMs, only tegras, apparently). I'm not talking
actual code generation here, which needs platform-dependent behaviour.
Mike
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-24 9:35 ` Mike Hommey
@ 2011-09-24 9:47 ` Russell King - ARM Linux
2011-09-24 10:07 ` Mike Hommey
2011-09-25 9:51 ` Catalin Marinas
2011-09-24 10:14 ` Siarhei Siamashka
1 sibling, 2 replies; 11+ messages in thread
From: Russell King - ARM Linux @ 2011-09-24 9:47 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Sep 24, 2011 at 11:35:44AM +0200, Mike Hommey wrote:
> On Fri, Sep 23, 2011 at 08:39:41PM +0100, Russell King - ARM Linux wrote:
> > On Fri, Sep 23, 2011 at 01:57:21PM +0200, Mike Hommey wrote:
> > > We've been hitting random crashes at startup with Firefox on tegras
> > > (under Android), and narrowed it down to a I-cache/D-cache
> > > inconsistency. A reduced testcase of the issue looks like the following
> > > (compile as ARM, not Thumb):
> >
> > If you write code at run time, you need to use the sys_cacheflush
> > API to ensure that it's properly synchronized with the I-cache. It's
> > a well known issue, and it applies to any harvard cache structured
> > CPU which doesn't automatically ensure coherence (which essentially
> > means all ARMs.)
>
> I do agree it's reasonable to have applications doing that to handle
> cache synchronization themselves. I wrote such in my message. But I
> think the kernel should make sure that its page cache is fresh when
> it maps it PROT_EXEC. I think it's unreasonable to expect applications
> doing mmap(PROT_WRITE), inflate, munmap, something, mmap(PROT_EXEC),
> and execute something there to have to handle cache synchronisation
> themselves. Especially when it's very CPU dependent (the testcase does
> not even fail on all ARMs, only tegras, apparently). I'm not talking
> actual code generation here, which needs platform-dependent behaviour.
Ok. Which kernel are you trying this with, and which CPU (please
confirm Cortex-A9)?
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-24 9:47 ` Russell King - ARM Linux
@ 2011-09-24 10:07 ` Mike Hommey
2011-09-24 10:12 ` Russell King - ARM Linux
2011-09-25 9:51 ` Catalin Marinas
1 sibling, 1 reply; 11+ messages in thread
From: Mike Hommey @ 2011-09-24 10:07 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Sep 24, 2011 at 10:47:34AM +0100, Russell King - ARM Linux wrote:
> On Sat, Sep 24, 2011 at 11:35:44AM +0200, Mike Hommey wrote:
> > On Fri, Sep 23, 2011 at 08:39:41PM +0100, Russell King - ARM Linux wrote:
> > > On Fri, Sep 23, 2011 at 01:57:21PM +0200, Mike Hommey wrote:
> > > > We've been hitting random crashes at startup with Firefox on tegras
> > > > (under Android), and narrowed it down to a I-cache/D-cache
> > > > inconsistency. A reduced testcase of the issue looks like the following
> > > > (compile as ARM, not Thumb):
> > >
> > > If you write code at run time, you need to use the sys_cacheflush
> > > API to ensure that it's properly synchronized with the I-cache. It's
> > > a well known issue, and it applies to any harvard cache structured
> > > CPU which doesn't automatically ensure coherence (which essentially
> > > means all ARMs.)
> >
> > I do agree it's reasonable to have applications doing that to handle
> > cache synchronization themselves. I wrote such in my message. But I
> > think the kernel should make sure that its page cache is fresh when
> > it maps it PROT_EXEC. I think it's unreasonable to expect applications
> > doing mmap(PROT_WRITE), inflate, munmap, something, mmap(PROT_EXEC),
> > and execute something there to have to handle cache synchronisation
> > themselves. Especially when it's very CPU dependent (the testcase does
> > not even fail on all ARMs, only tegras, apparently). I'm not talking
> > actual code generation here, which needs platform-dependent behaviour.
>
> Ok. Which kernel are you trying this with, and which CPU (please
> confirm Cortex-A9)?
This has been seen on tegra boards under Ubuntu Maverick
(2.6.35.7.something) and under Android (2.6.32.9.something) and on the Asus
Transformer (Android, 2.6.36.3.something). All Cortex-A9 tegras. It has
*not* been reproduced on pandaboards (Cortex-A9 OMAP4430).
Mike
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-24 10:07 ` Mike Hommey
@ 2011-09-24 10:12 ` Russell King - ARM Linux
0 siblings, 0 replies; 11+ messages in thread
From: Russell King - ARM Linux @ 2011-09-24 10:12 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Sep 24, 2011 at 12:07:01PM +0200, Mike Hommey wrote:
> This has been seen on tegra boards under Ubuntu Maverick
> (2.6.35.7.something) and under Android (2.6.32.9.something) and on the Asus
> Transformer (Android, 2.6.36.3.something). All Cortex-A9 tegras. It has
> *not* been reproduced on pandaboards (Cortex-A9 OMAP4430).
Ah, your kernels are probably too old.
You need to ensure that you have at least 6012191 (ARM: 6380/1:
Introduce __sync_icache_dcache() for VIPT caches) and 85848dd (ARM:
6381/1: Use lazy cache flushing on ARMv7 SMP systems). Note that both
these depend on some preceding patches too.
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-24 9:35 ` Mike Hommey
2011-09-24 9:47 ` Russell King - ARM Linux
@ 2011-09-24 10:14 ` Siarhei Siamashka
1 sibling, 0 replies; 11+ messages in thread
From: Siarhei Siamashka @ 2011-09-24 10:14 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Sep 24, 2011 at 12:35 PM, Mike Hommey <mh@glandium.org> wrote:
> On Fri, Sep 23, 2011 at 08:39:41PM +0100, Russell King - ARM Linux wrote:
>> On Fri, Sep 23, 2011 at 01:57:21PM +0200, Mike Hommey wrote:
>> > We've been hitting random crashes at startup with Firefox on tegras
>> > (under Android), and narrowed it down to a I-cache/D-cache
>> > inconsistency. A reduced testcase of the issue looks like the following
>> > (compile as ARM, not Thumb):
>>
>> If you write code at run time, you need to use the sys_cacheflush
>> API to ensure that it's properly synchronized with the I-cache. ?It's
>> a well known issue, and it applies to any harvard cache structured
>> CPU which doesn't automatically ensure coherence (which essentially
>> means all ARMs.)
>
> I do agree it's reasonable to have applications doing that to handle
> cache synchronization themselves. I wrote such in my message. But I
> think the kernel should make sure that its page cache is fresh when
> it maps it PROT_EXEC. I think it's unreasonable to expect applications
> doing mmap(PROT_WRITE), inflate, munmap, something, mmap(PROT_EXEC),
> and execute something there to have to handle cache synchronisation
> themselves. Especially when it's very CPU dependent (the testcase does
> not even fail on all ARMs, only tegras, apparently). I'm not talking
> actual code generation here, which needs platform-dependent behaviour.
Unfortunately we can't rely on what is reasonable, but have to
strictly follow how it is specified to work. My understanding is that
'sys_cacheflush' has been always mandatory for arm linux and
orthogonal to 'mmap'/'mprotect', no matter what discussion threads you
find in the mailing list archives. Being new to ARM architecture at
that time, I had been also burned by this issue years ago [1], when
various pieces of documentation did not match each other, there was a
transition from OABI to EABI ongoing, etc.
The way forward could be to try and ask linux man pages maintainers to
update the entries for 'mmap' and 'mprotect' to explicitly state that
certain architectures require mandatory instruction/data caches
synchronization no matter how you play with the protection flags, and
that it can be usually done via gcc '__builtin___clear_cache' function
[2].
1. http://ffmpeg.org/pipermail/ffmpeg-devel/2007-January/027847.html
2. http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
--
Best regards,
Siarhei Siamashka
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-24 9:47 ` Russell King - ARM Linux
2011-09-24 10:07 ` Mike Hommey
@ 2011-09-25 9:51 ` Catalin Marinas
2011-09-25 10:34 ` Russell King - ARM Linux
1 sibling, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2011-09-25 9:51 UTC (permalink / raw)
To: linux-arm-kernel
On 24 September 2011 10:47, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Sat, Sep 24, 2011 at 11:35:44AM +0200, Mike Hommey wrote:
>> On Fri, Sep 23, 2011 at 08:39:41PM +0100, Russell King - ARM Linux wrote:
>> > On Fri, Sep 23, 2011 at 01:57:21PM +0200, Mike Hommey wrote:
>> > > We've been hitting random crashes at startup with Firefox on tegras
>> > > (under Android), and narrowed it down to a I-cache/D-cache
>> > > inconsistency. A reduced testcase of the issue looks like the following
>> > > (compile as ARM, not Thumb):
>> >
>> > If you write code at run time, you need to use the sys_cacheflush
>> > API to ensure that it's properly synchronized with the I-cache. ?It's
>> > a well known issue, and it applies to any harvard cache structured
>> > CPU which doesn't automatically ensure coherence (which essentially
>> > means all ARMs.)
>>
>> I do agree it's reasonable to have applications doing that to handle
>> cache synchronization themselves. I wrote such in my message. But I
>> think the kernel should make sure that its page cache is fresh when
>> it maps it PROT_EXEC. I think it's unreasonable to expect applications
>> doing mmap(PROT_WRITE), inflate, munmap, something, mmap(PROT_EXEC),
>> and execute something there to have to handle cache synchronisation
>> themselves. Especially when it's very CPU dependent (the testcase does
>> not even fail on all ARMs, only tegras, apparently). I'm not talking
>> actual code generation here, which needs platform-dependent behaviour.
>
> Ok. ?Which kernel are you trying this with, and which CPU (please
> confirm Cortex-A9)?
I had a discussion on Friday with the Firefox guys here in ARM. We
need to do some investigation next week but some random unverified
thoughts (that's on A9) - the scenario seems to be that a library
decompresses some data to a file using mmap(write) (which happens to
be code but it doesn't need to know that) while some other application
part tries, at a later time, to execute code in the same file using
mmap(exec).
By default, a new page cache page is dirty. At a first look,
mmap(write) and further access would not trigger a cache operation in
__sync_icache_dcache() and the page is still marked as dirty. Later
on, when the page is munmap'ed and mmap'ed(exec),
__sync_icache_dcache() (during fault processing) would flush the
D-cache and invalidate the I-cache, while marking the page 'clean'.
I wonder whether during the first mmap(write) and uncompressing, the
'clean' state could be set (maybe some flush_dcache_page) call. This
state would be preserved in the page cache page status and a
subsequent __sync_icache_dcache(), even from a different file, would
just notice that the page is 'clean'.
As I said, just some thoughts, I haven't tested this theory yet.
--
Catalin
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-25 9:51 ` Catalin Marinas
@ 2011-09-25 10:34 ` Russell King - ARM Linux
2011-09-25 15:26 ` Catalin Marinas
0 siblings, 1 reply; 11+ messages in thread
From: Russell King - ARM Linux @ 2011-09-25 10:34 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Sep 25, 2011 at 10:51:30AM +0100, Catalin Marinas wrote:
> I had a discussion on Friday with the Firefox guys here in ARM. We
> need to do some investigation next week but some random unverified
> thoughts (that's on A9) - the scenario seems to be that a library
> decompresses some data to a file using mmap(write) (which happens to
> be code but it doesn't need to know that) while some other application
> part tries, at a later time, to execute code in the same file using
> mmap(exec).
>
> By default, a new page cache page is dirty. At a first look,
> mmap(write) and further access would not trigger a cache operation in
> __sync_icache_dcache() and the page is still marked as dirty. Later
> on, when the page is munmap'ed and mmap'ed(exec),
> __sync_icache_dcache() (during fault processing) would flush the
> D-cache and invalidate the I-cache, while marking the page 'clean'.
>
> I wonder whether during the first mmap(write) and uncompressing, the
> 'clean' state could be set (maybe some flush_dcache_page) call. This
> state would be preserved in the page cache page status and a
> subsequent __sync_icache_dcache(), even from a different file, would
> just notice that the page is 'clean'.
>
> As I said, just some thoughts, I haven't tested this theory yet.
Not quite. Whenever we establish any page in the system which is
executable, we always flush the D cache and entire I cache.
As I've already pointed out though, the report is against old kernels
which doesn't have this code, so there's no point us speculating about
it until the issue has been confirmed against a kernel which we expect
_not_ to have the issue in the first place (rather than one which we
_do_ expect it to go wrong.)
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-25 10:34 ` Russell King - ARM Linux
@ 2011-09-25 15:26 ` Catalin Marinas
2011-09-25 19:30 ` Russell King - ARM Linux
0 siblings, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2011-09-25 15:26 UTC (permalink / raw)
To: linux-arm-kernel
On 25 September 2011 11:34, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Sun, Sep 25, 2011 at 10:51:30AM +0100, Catalin Marinas wrote:
>> I had a discussion on Friday with the Firefox guys here in ARM. We
>> need to do some investigation next week but some random unverified
>> thoughts (that's on A9) - the scenario seems to be that a library
>> decompresses some data to a file using mmap(write) (which happens to
>> be code but it doesn't need to know that) while some other application
>> part tries, at a later time, to execute code in the same file using
>> mmap(exec).
>>
>> By default, a new page cache page is dirty. At a first look,
>> mmap(write) and further access would not trigger a cache operation in
>> __sync_icache_dcache() and the page is still marked as dirty. Later
>> on, when the page is munmap'ed and mmap'ed(exec),
>> __sync_icache_dcache() (during fault processing) would flush the
>> D-cache and invalidate the I-cache, while marking the page 'clean'.
>>
>> I wonder whether during the first mmap(write) and uncompressing, the
>> 'clean' state could be set (maybe some flush_dcache_page) call. This
>> state would be preserved in the page cache page status and a
>> subsequent __sync_icache_dcache(), even from a different file, would
>> just notice that the page is 'clean'.
>>
>> As I said, just some thoughts, I haven't tested this theory yet.
>
> Not quite. ?Whenever we establish any page in the system which is
> executable, we always flush the D cache and entire I cache.
We flush the D-cache only if the page was not marked 'clean'. Is there
any chance that the page gets marked as clean before the first part of
the application wrote the data (uncompressing) via a mmap(write)
mapping? If this would happen, a subsequent mmap(exec) of the same
page (as the kernel would most likely find it in the page cache) would
find it 'clean' and avoid the D-cache flushing.
> As I've already pointed out though, the report is against old kernels
> which doesn't have this code, so there's no point us speculating about
> it until the issue has been confirmed against a kernel which we expect
> _not_ to have the issue in the first place (rather than one which we
> _do_ expect it to go wrong.)
Yes, they should definitely try a more recent kernel.
--
Catalin
^ permalink raw reply [flat|nested] 11+ messages in thread
* I-cache/D-cache inconsistency issue with page cache
2011-09-25 15:26 ` Catalin Marinas
@ 2011-09-25 19:30 ` Russell King - ARM Linux
0 siblings, 0 replies; 11+ messages in thread
From: Russell King - ARM Linux @ 2011-09-25 19:30 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Sep 25, 2011 at 04:26:42PM +0100, Catalin Marinas wrote:
> On 25 September 2011 11:34, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > As I've already pointed out though, the report is against old kernels
> > which doesn't have this code, so there's no point us speculating about
> > it until the issue has been confirmed against a kernel which we expect
> > _not_ to have the issue in the first place (rather than one which we
> > _do_ expect it to go wrong.)
>
> Yes, they should definitely try a more recent kernel.
As the kernels which have been reported as having the issue do not
have __sync_icache_dcache, all bets are off for things behaving
correctly. I don't see any point speculating further until the
problem is confirmed on a kernel _with_ __sync_icache_dcache.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-09-25 19:30 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-23 11:57 I-cache/D-cache inconsistency issue with page cache Mike Hommey
2011-09-23 19:39 ` Russell King - ARM Linux
2011-09-24 9:35 ` Mike Hommey
2011-09-24 9:47 ` Russell King - ARM Linux
2011-09-24 10:07 ` Mike Hommey
2011-09-24 10:12 ` Russell King - ARM Linux
2011-09-25 9:51 ` Catalin Marinas
2011-09-25 10:34 ` Russell King - ARM Linux
2011-09-25 15:26 ` Catalin Marinas
2011-09-25 19:30 ` Russell King - ARM Linux
2011-09-24 10:14 ` Siarhei Siamashka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).