linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
       [not found]   ` <20081119204315.GB17209@flint.arm.linux.org.uk>
@ 2008-11-20  6:59     ` Nick Piggin
  2008-11-20  9:19       ` Russell King - ARM Linux
  2008-11-20 12:28       ` Dmitry Adamushko
  0 siblings, 2 replies; 13+ messages in thread
From: Nick Piggin @ 2008-11-20  6:59 UTC (permalink / raw)
  To: Russell King - ARM Linux, linux-fsdevel
  Cc: Naval Saini, linux-arch, linux-arm-kernel, linux-kernel,
	naval.saini

On Thursday 20 November 2008 07:43, Russell King - ARM Linux wrote:
> On Wed, Nov 19, 2008 at 05:40:23PM +1100, Nick Piggin wrote:
> > It would be interesting to know exactly what problem you are seeing.
> >
> > ARM I think is supposed to handle aliasing problems by flushing
> > caches at appropriate points. It would be nice to know what's going
> > wrong and whether we can cover those holes.
>
> I think there's a problem here: the existing cache handling API is
> designed around the MM's manipulation of page tables.  It is generally
> not designed to handle aliasing between multiple mappings of the same
> page, except with one exception: page cache pages mmap'd into userspace,
> which is handled via flush_dcache_page().

Right, flush_dcache_page.


> O_DIRECT on ARM is probably completely untested for the most part.  It's
> not something that is encountered very often, and as such gets zero
> testing.  I've certainly never had the tools to be able to test it out,
> so even on VIVT it's probably completely buggy.  Bear in mind that most
> of my modern platforms use MTD (either cramfs or in the rare case jffs2)
> so I'm not sure that I could sensibly even test O_DIRECT - isn't O_DIRECT
> for use with proper block devices?

Yes, although there are other things that use get_user_pages as well
(eg. splice, ptrace). So it would be interesting if they have corner
case problems as well.


> As it's probably clear, I've no clue of the O_DIRECT implementation or
> how it's supposed to work.  At a guess, it probably needs a new cache
> handling API to be designed, since the existing flush_cache_(range|page)
> are basically no-ops on VIPT.  Or maybe we need some way to mark the
> userspace pages as being write-through or uncacheable depending on the
> features we have available on the processor.  Or something.  Don't know.
>
> So, what is O_DIRECT, and where can I find some information about it?
> I don't see anything in Documentation/ describing it.

(open(2), with O_DIRECT flag)

O_DIRECT uses "get_user_pages()" pages (user mapped pages) to feed the
block layer with rather than pagecache pages. However pagecache pages
can also be user mapped, so I'm thinking there should be enough cache
flushing APIs to be able to handle either case.

Basically, an O_DIRECT write involves:

- The program storing into some virtual address, then passing that virtual
  address as the buffer to write(2).

- The kernel will get_user_pages() to get the struct page * of that user
  virtual address. At this point, get_user_pages does flush_dcache_page.
  (Which should write back the user caches?)

- Then the struct page is sent to the block layer (it won't tend to be
  touched by the kernel via the kernel linear map, unless we have like an
  "emulated" block device block device like 'brd').

- Even if it is read via the kernel linear map, AFAIKS, we should be OK
  due to the flush_dcache_page().

An O_DIRECT read involves:

- Same first 2 steps as O_DIRECT write, including flush_dcache_page. So the
  user mapping should not have any previously dirtied lines around.

- The page is sent to the block layer, which stores into the page. Some
  block devices like 'brd' will potentially store via the kernel linear map
  here, and they probably don't do enough cache flushing. But a regular
  block device should go via DMA, which AFAIK should be OK? (the user address
  should remain invalidated because it would be a bug to read from the buffer
  before the read has completed)

So I can't see exactly where the problem would be. Can you?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20  6:59     ` O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case) Nick Piggin
@ 2008-11-20  9:19       ` Russell King - ARM Linux
  2008-11-20 10:55         ` Naval Saini
  2008-11-21 17:10         ` Catalin Marinas
  2008-11-20 12:28       ` Dmitry Adamushko
  1 sibling, 2 replies; 13+ messages in thread
From: Russell King - ARM Linux @ 2008-11-20  9:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, Naval Saini, linux-arch, linux-arm-kernel,
	linux-kernel, naval.saini

On Thu, Nov 20, 2008 at 05:59:00PM +1100, Nick Piggin wrote:
> Basically, an O_DIRECT write involves:
> 
> - The program storing into some virtual address, then passing that virtual
>   address as the buffer to write(2).
> 
> - The kernel will get_user_pages() to get the struct page * of that user
>   virtual address. At this point, get_user_pages does flush_dcache_page.
>   (Which should write back the user caches?)
> 
> - Then the struct page is sent to the block layer (it won't tend to be
>   touched by the kernel via the kernel linear map, unless we have like an
>   "emulated" block device block device like 'brd').
> 
> - Even if it is read via the kernel linear map, AFAIKS, we should be OK
>   due to the flush_dcache_page().

That seems sane, and yes, flush_dcache_page() will write back and
invalidate dirty cache lines in both the kernel and user mappings.

> An O_DIRECT read involves:
> 
> - Same first 2 steps as O_DIRECT write, including flush_dcache_page. So the
>   user mapping should not have any previously dirtied lines around.
> 
> - The page is sent to the block layer, which stores into the page. Some
>   block devices like 'brd' will potentially store via the kernel linear map
>   here, and they probably don't do enough cache flushing. But a regular
>   block device should go via DMA, which AFAIK should be OK? (the user address
>   should remain invalidated because it would be a bug to read from the buffer
>   before the read has completed)

This is where things get icky with lots of drivers - DMA is fine, but
many PIO based drivers don't handle the implications of writing to the
kernel page cache page when there may be CPU cache side effects.

If the cache is in read allocate mode, then in this case there shouldn't
be any dirty cache lines.  (That's not always the case though, esp. via
conventional IO.)  If the cache is in write allocate mode, PIO data will
sit in the kernel mapping and won't be visible to userspace.

That is a years-old bug, one that I've been unable to run tests for here
(because my platforms don't have the right combinations of CPUs supporting
write alloc and/or a problem block driver.)  I've even been accused of
being uncooperative over testing possible bug fixes by various people
(if I don't have hardware which can show the problem, how can I test
possible fixes?)  So I've given up with that issue - as far as I'm
concerned, it's a problem for others to sort out.

Do we know what hardware, which IO drivers are being used, and any
relevent configuration of the drivers?

-------------------------------------------------------------------
List admin: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ:        http://www.arm.linux.org.uk/mailinglists/faq.php
Etiquette:  http://www.arm.linux.org.uk/mailinglists/etiquette.php

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20  9:19       ` Russell King - ARM Linux
@ 2008-11-20 10:55         ` Naval Saini
  2008-11-21 17:10         ` Catalin Marinas
  1 sibling, 0 replies; 13+ messages in thread
From: Naval Saini @ 2008-11-20 10:55 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Nick Piggin, linux-fsdevel, linux-arch, linux-arm-kernel,
	linux-kernel, naval.saini

Hi Russel/Nick:

Thanks for the replies - I would like to reply ASAP to the mails (out
of interest also) - but have been unable to for  some reasons. I donot
understand all that has been talked about in the discussions and would
spend more time before I reply to your suggestions.

We are working on arm 11 based embedded systems and were debugging an
issue with mkfs.xfs (uses O_DIRECT) utility. After quite a bit of
debugging, we found a pattern in addresses of user space buffers
allocated with memalign. For some address (of these buffers), the
format would result in corrupt disk. By applying this patch, I was
able to allocate memory that was page color aligned (as required by
arm technical ref. manual). We have observed the same problems in VIVT
caches (arm 9 processor) based. We also found that we could do away
with the problem by disabling cache.

Rest, the problem was observed not during O_DIRECT write (but at time
of read). When the same buffer was used and when user virtual address
(UVA) and kernal virtual address (KVA) did not meet alignment
requirements, the d_cache_invalidate operation (in consistent_sync
func in consistent.c) on KVA did not have the desired affect. When the
same memory was accessed from UVA (that should have been invalidated
along with KVA invalidation), it corrupted the buffer we had just
read. This in mkfs.xfs utility, corrupted the disk format. In lay man
terms, looked like KVA and UVA occupied seperate locations in cache,
when they were used one after the other (specifically KVA after UVA).

Hope this would be of immidiate help. I would comment on the strageies
proposed by nick in next mail.

Further as i have said, our kernel version in linux-2.6.18.5 ,
uclibc-0.9.28 , mkfs is irrelevant.

Thanks,
Naval


On 11/20/08, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> On Thu, Nov 20, 2008 at 05:59:00PM +1100, Nick Piggin wrote:
>  > Basically, an O_DIRECT write involves:
>  >
>  > - The program storing into some virtual address, then passing that virtual
>  >   address as the buffer to write(2).
>  >
>  > - The kernel will get_user_pages() to get the struct page * of that user
>  >   virtual address. At this point, get_user_pages does flush_dcache_page.
>  >   (Which should write back the user caches?)
>  >
>  > - Then the struct page is sent to the block layer (it won't tend to be
>  >   touched by the kernel via the kernel linear map, unless we have like an
>  >   "emulated" block device block device like 'brd').
>  >
>  > - Even if it is read via the kernel linear map, AFAIKS, we should be OK
>  >   due to the flush_dcache_page().
>
>
> That seems sane, and yes, flush_dcache_page() will write back and
>  invalidate dirty cache lines in both the kernel and user mappings.
>
>
>  > An O_DIRECT read involves:
>  >
>  > - Same first 2 steps as O_DIRECT write, including flush_dcache_page. So the
>  >   user mapping should not have any previously dirtied lines around.
>  >
>  > - The page is sent to the block layer, which stores into the page. Some
>  >   block devices like 'brd' will potentially store via the kernel linear map
>  >   here, and they probably don't do enough cache flushing. But a regular
>  >   block device should go via DMA, which AFAIK should be OK? (the user address
>  >   should remain invalidated because it would be a bug to read from the buffer
>  >   before the read has completed)
>
>
> This is where things get icky with lots of drivers - DMA is fine, but
>  many PIO based drivers don't handle the implications of writing to the
>  kernel page cache page when there may be CPU cache side effects.
>
>  If the cache is in read allocate mode, then in this case there shouldn't
>  be any dirty cache lines.  (That's not always the case though, esp. via
>  conventional IO.)  If the cache is in write allocate mode, PIO data will
>  sit in the kernel mapping and won't be visible to userspace.
>
>  That is a years-old bug, one that I've been unable to run tests for here
>  (because my platforms don't have the right combinations of CPUs supporting
>  write alloc and/or a problem block driver.)  I've even been accused of
>  being uncooperative over testing possible bug fixes by various people
>  (if I don't have hardware which can show the problem, how can I test
>  possible fixes?)  So I've given up with that issue - as far as I'm
>  concerned, it's a problem for others to sort out.
>
>  Do we know what hardware, which IO drivers are being used, and any
>  relevent configuration of the drivers?
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20  6:59     ` O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case) Nick Piggin
  2008-11-20  9:19       ` Russell King - ARM Linux
@ 2008-11-20 12:28       ` Dmitry Adamushko
  2008-11-20 13:25         ` Nick Piggin
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitry Adamushko @ 2008-11-20 12:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Russell King - ARM Linux, linux-fsdevel, Naval Saini, linux-arch,
	linux-arm-kernel, linux-kernel, naval.saini, Ralf Baechle

2008/11/20 Nick Piggin <nickpiggin@yahoo.com.au>:
>
> [ ... ]
>
> - The page is sent to the block layer, which stores into the page. Some
>  block devices like 'brd' will potentially store via the kernel linear map
>  here, and they probably don't do enough cache flushing.

btw., if someone is curious, here is another case of what may happen
on VIPT systems when someone uses a "virtual" block device (like
'brd') as, heh, a swap :-)

http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 12:28       ` Dmitry Adamushko
@ 2008-11-20 13:25         ` Nick Piggin
  2008-11-20 13:55           ` Ralf Baechle
  2008-11-20 13:59           ` Dmitry Adamushko
  0 siblings, 2 replies; 13+ messages in thread
From: Nick Piggin @ 2008-11-20 13:25 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Russell King - ARM Linux, linux-fsdevel, Naval Saini, linux-arch,
	linux-arm-kernel, linux-kernel, naval.saini, Ralf Baechle

On Thursday 20 November 2008 23:28, Dmitry Adamushko wrote:
> 2008/11/20 Nick Piggin <nickpiggin@yahoo.com.au>:
> > [ ... ]
> >
> > - The page is sent to the block layer, which stores into the page. Some
> >  block devices like 'brd' will potentially store via the kernel linear
> > map here, and they probably don't do enough cache flushing.
>
> btw., if someone is curious, here is another case of what may happen
> on VIPT systems when someone uses a "virtual" block device (like
> 'brd') as, heh, a swap :-)
>
> http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html

Right... Now I'm lacking knowledge when it comes to devices, but I
think it is probably reasonable for the block device layer to ensure
the physical memory is uptodate after it signals request completion.

That is, there shouldn't be any potentially aliasing dirty lines.
Block devices which do any writeout via the kernel linear address
(eg. brd) should do a flush_dcache_page.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 13:25         ` Nick Piggin
@ 2008-11-20 13:55           ` Ralf Baechle
  2008-11-20 15:27             ` Matthew Wilcox
  2008-11-20 16:30             ` Russell King - ARM Linux
  2008-11-20 13:59           ` Dmitry Adamushko
  1 sibling, 2 replies; 13+ messages in thread
From: Ralf Baechle @ 2008-11-20 13:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dmitry Adamushko, Russell King - ARM Linux, linux-fsdevel,
	Naval Saini, linux-arch, linux-arm-kernel, linux-kernel,
	naval.saini

On Fri, Nov 21, 2008 at 12:25:39AM +1100, Nick Piggin wrote:

> > > - The page is sent to the block layer, which stores into the page. Some
> > >  block devices like 'brd' will potentially store via the kernel linear
> > > map here, and they probably don't do enough cache flushing.
> >
> > btw., if someone is curious, here is another case of what may happen
> > on VIPT systems when someone uses a "virtual" block device (like
> > 'brd') as, heh, a swap :-)
> >
> > http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html
> 
> Right... Now I'm lacking knowledge when it comes to devices, but I
> think it is probably reasonable for the block device layer to ensure
> the physical memory is uptodate after it signals request completion.
> 
> That is, there shouldn't be any potentially aliasing dirty lines.
> Block devices which do any writeout via the kernel linear address
> (eg. brd) should do a flush_dcache_page.

It's better to avoid aliases than dealing with them by flushing.  A way to
avoid aliases whenever a page is mapped to userspace, one creates a mapping
at a carefully choosen address that doesn't alias.  On architectures with
software reload TLBs such as MIPS that's very cheap and the entire
cacheflush with all it's associated pains can go away.  Right now MIPS uses
such a mechanism:

  void *kmap_coherent(struct page *page, unsigned long addr);
  void kunmap_coherent(void);

within the architecture private implementation but it could be use beyond
that, probably on all architectures though I know that there would be some
solvable issues on PARISC.  Lightweight, no ordering constraints between
kernel and userspace accesses, so also no locking needed.

Does this look like a possible avenue?

  Ralf

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 13:25         ` Nick Piggin
  2008-11-20 13:55           ` Ralf Baechle
@ 2008-11-20 13:59           ` Dmitry Adamushko
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitry Adamushko @ 2008-11-20 13:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Russell King - ARM Linux, linux-fsdevel, Naval Saini, linux-arch,
	linux-arm-kernel, linux-kernel, naval.saini, Ralf Baechle

2008/11/20 Nick Piggin <nickpiggin@yahoo.com.au>:
> On Thursday 20 November 2008 23:28, Dmitry Adamushko wrote:
>> 2008/11/20 Nick Piggin <nickpiggin@yahoo.com.au>:
>> > [ ... ]
>> >
>> > - The page is sent to the block layer, which stores into the page. Some
>> >  block devices like 'brd' will potentially store via the kernel linear
>> > map here, and they probably don't do enough cache flushing.
>>
>> btw., if someone is curious, here is another case of what may happen
>> on VIPT systems when someone uses a "virtual" block device (like
>> 'brd') as, heh, a swap :-)
>>
>> http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html
>
> Right... Now I'm lacking knowledge when it comes to devices, but I
> think it is probably reasonable for the block device layer to ensure
> the physical memory is uptodate after it signals request completion.
>
> That is, there shouldn't be any potentially aliasing dirty lines.
> Block devices which do any writeout via the kernel linear address
> (eg. brd) should do a flush_dcache_page.
>

Yeah, but my point is that flush_dcache_page() in its current
incarnation will _not_ always match the expectetions of such block
devices (e.g. brd or compcache) when those devices are used as "swap"
:-)

IOW, although flush_dcache_page() is in place, it won't work as expected.

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 13:55           ` Ralf Baechle
@ 2008-11-20 15:27             ` Matthew Wilcox
  2008-11-20 17:17               ` Ralf Baechle
  2008-11-20 16:30             ` Russell King - ARM Linux
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2008-11-20 15:27 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: Nick Piggin, Dmitry Adamushko, Russell King - ARM Linux,
	linux-fsdevel, Naval Saini, linux-arch, linux-arm-kernel,
	linux-kernel, naval.saini, James Bottomley

On Thu, Nov 20, 2008 at 01:55:58PM +0000, Ralf Baechle wrote:
> It's better to avoid aliases than dealing with them by flushing.  A way to
> avoid aliases whenever a page is mapped to userspace, one creates a mapping
> at a carefully choosen address that doesn't alias.  On architectures with
> software reload TLBs such as MIPS that's very cheap and the entire
> cacheflush with all it's associated pains can go away.  Right now MIPS uses
> such a mechanism:
> 
>   void *kmap_coherent(struct page *page, unsigned long addr);
>   void kunmap_coherent(void);
> 
> within the architecture private implementation but it could be use beyond
> that, probably on all architectures though I know that there would be some
> solvable issues on PARISC.  Lightweight, no ordering constraints between
> kernel and userspace accesses, so also no locking needed.

I'm not quite sure why you need kmap_coherent().  If a page is mapped into
userspace, you can find what address it's mapped to from
page->mapping->i_mmap and page->index.  OTOH, that's potentially
expensive since you need to grab the spinlock, and unless you have all
user addresses coherent with each other (like parisc does), you need to
figure out which process to be coherent with.

I know James Bottomley did an experiment (and did an OLS presentation
...) on unmapping the entire page cache and greatly expanding the kmap
area to do just this kind of thing.  I think he even got a speedup.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 13:55           ` Ralf Baechle
  2008-11-20 15:27             ` Matthew Wilcox
@ 2008-11-20 16:30             ` Russell King - ARM Linux
  1 sibling, 0 replies; 13+ messages in thread
From: Russell King - ARM Linux @ 2008-11-20 16:30 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: Nick Piggin, Dmitry Adamushko, linux-fsdevel, Naval Saini,
	linux-arch, linux-arm-kernel, linux-kernel, naval.saini

On Thu, Nov 20, 2008 at 01:55:58PM +0000, Ralf Baechle wrote:
> It's better to avoid aliases than dealing with them by flushing.  A way to
> avoid aliases whenever a page is mapped to userspace, one creates a mapping
> at a carefully choosen address that doesn't alias.  On architectures with
> software reload TLBs such as MIPS that's very cheap and the entire
> cacheflush with all it's associated pains can go away.  Right now MIPS uses
> such a mechanism:
> 
>   void *kmap_coherent(struct page *page, unsigned long addr);
>   void kunmap_coherent(void);

Yes, we effectively already do that for our copy_user_highpage/
clear_user_highpage (note: we really want to override both of those
functions, not just one as the code in include/linux/highmem.h is
currently setup.  And I wish that V4L didn't use the non-highmem
clear_user_page() call...)

Why?  On ARM, we're starting to support highmem on VIVT and VIPT, and
having *_user_highpage() kmap, and then have the *_user_page() setup
yet another mapping for VIPT caches is just silly.

Which reminds me that I need to get a patch to make clear_user_highpage()
overridable out on the relevent lists...

> within the architecture private implementation but it could be use beyond
> that, probably on all architectures though I know that there would be some
> solvable issues on PARISC.  Lightweight, no ordering constraints between
> kernel and userspace accesses, so also no locking needed.

The "no locking needed" depends.  If we're going to be asking drivers
to copy use such an interface in order to copy data into page cache
pages, you can't have two drivers calling it from interrupt context
without some form of locking.

Nor if you're SMP and you don't have the mappings setup per-CPU.

However, as davem has said in the past, the result of a device reading
or writing the page cache should result in the same conditions no
matter how the read or write is actually done.  That's only common
sense.  So ensuring that device PIO accesses are done with the same
cache colour as userspace results in a difference between PIO and DMA,
one which could potentially bite if we end up reading from the kernel
mapping of those pages first.

If we extend the argument that "everything must be the same cache colour"
then we need cache colouring when allocating page cache pages - since
page cache pages are part of the kernel direct mapped memory, their
colour is already fixed from the time that you've setup the initial
kernel memory mappings.  In ARMs case of an aliasing VIPT cache, that
means that VA [13:12] must equal PA [13:12] for every single mapping
no matter where...

That would be nice, since we don't then have to worry about aliases
between kernel and userspace anymore, so we don't have to play these
remapping games in kernel space. ;)

-------------------------------------------------------------------
List admin: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ:        http://www.arm.linux.org.uk/mailinglists/faq.php
Etiquette:  http://www.arm.linux.org.uk/mailinglists/etiquette.php

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 15:27             ` Matthew Wilcox
@ 2008-11-20 17:17               ` Ralf Baechle
  2008-11-20 17:40                 ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Ralf Baechle @ 2008-11-20 17:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Dmitry Adamushko, Russell King - ARM Linux,
	linux-fsdevel, Naval Saini, linux-arch, linux-arm-kernel,
	linux-kernel, naval.saini, James Bottomley

On Thu, Nov 20, 2008 at 08:27:19AM -0700, Matthew Wilcox wrote:

> I'm not quite sure why you need kmap_coherent().  If a page is mapped into
> userspace, you can find what address it's mapped to from
> page->mapping->i_mmap and page->index.  OTOH, that's potentially

Even if we know the userspace address of a page we do not necessarily have
a usable mapping for kernel purposes.  The userspace mapping might be r/o
when we need r/w or it might be in another process.  kmap_coherent takes
the job of creating a r/w mapping on a suitable kernel virtual address
that will avoid any aliases.

> page->mapping->i_mmap and page->index.  OTOH, that's potentially
> expensive since you need to grab the spinlock, and unless you have all
> user addresses coherent with each other (like parisc does), you need to
> figure out which process to be coherent with.

Having all userspace addresses of a page across all processes coherent with
each other is the only practicable solution in Linux; at least I don't think
how otherwise and within the currently kernel framework a platform could
sanely handle userspace-userspace aliases.  So we're talking about extending
this to cover userspace-kernelspace aliases.

The original reason for the introduction of kmap_coherent was avoiding
a cache alias in when a multi-threaded process forks.  The issue has been
debated on lkml in 2006 as part of my submission of a patchset under the
subject of "Fix COW D-cache aliasing on fork".  The description is somewhat
lengthy so I omit it here.

One of the ugly parts of kmap_coherent() is that it cannot be used safely
if the page has been marked as dirty by flush_dcache_page(); the callers
know about this and deal with it.

> I know James Bottomley did an experiment (and did an OLS presentation
> ...) on unmapping the entire page cache and greatly expanding the kmap
> area to do just this kind of thing.  I think he even got a speedup.

The speedup is no surprise.

  Ralf

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 17:17               ` Ralf Baechle
@ 2008-11-20 17:40                 ` Matthew Wilcox
  2008-11-20 19:30                   ` Russell King - ARM Linux
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2008-11-20 17:40 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: Nick Piggin, Dmitry Adamushko, Russell King - ARM Linux,
	linux-fsdevel, Naval Saini, linux-arch, linux-arm-kernel,
	linux-kernel, naval.saini, James Bottomley

On Thu, Nov 20, 2008 at 05:17:14PM +0000, Ralf Baechle wrote:
> On Thu, Nov 20, 2008 at 08:27:19AM -0700, Matthew Wilcox wrote:
> > I'm not quite sure why you need kmap_coherent().  If a page is mapped into
> > userspace, you can find what address it's mapped to from
> > page->mapping->i_mmap and page->index.  OTOH, that's potentially
> 
> Even if we know the userspace address of a page we do not necessarily have
> a usable mapping for kernel purposes.  The userspace mapping might be r/o
> when we need r/w or it might be in another process.  kmap_coherent takes
> the job of creating a r/w mapping on a suitable kernel virtual address
> that will avoid any aliases.

Ah, I didn't do a good enough job of explaining.  My question was why
you needed the 'address' argument to kmap_coherent (and thus had a
different prototype from kmap) instead of just implementing kmap() on
mips.

> The original reason for the introduction of kmap_coherent was avoiding
> a cache alias in when a multi-threaded process forks.  The issue has been
> debated on lkml in 2006 as part of my submission of a patchset under the
> subject of "Fix COW D-cache aliasing on fork".  The description is somewhat
> lengthy so I omit it here.
> 
> One of the ugly parts of kmap_coherent() is that it cannot be used safely
> if the page has been marked as dirty by flush_dcache_page(); the callers
> know about this and deal with it.

If you fix kmap() to only give you addresses that are coherent with
userspace, you no longer need flush_dcache_page().

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20 17:40                 ` Matthew Wilcox
@ 2008-11-20 19:30                   ` Russell King - ARM Linux
  0 siblings, 0 replies; 13+ messages in thread
From: Russell King - ARM Linux @ 2008-11-20 19:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ralf Baechle, Nick Piggin, Dmitry Adamushko, linux-fsdevel,
	Naval Saini, linux-arch, linux-arm-kernel, linux-kernel,
	naval.saini, James Bottomley

On Thu, Nov 20, 2008 at 10:40:54AM -0700, Matthew Wilcox wrote:
> On Thu, Nov 20, 2008 at 05:17:14PM +0000, Ralf Baechle wrote:
> > On Thu, Nov 20, 2008 at 08:27:19AM -0700, Matthew Wilcox wrote:
> > > I'm not quite sure why you need kmap_coherent().  If a page is mapped into
> > > userspace, you can find what address it's mapped to from
> > > page->mapping->i_mmap and page->index.  OTOH, that's potentially
> > 
> > Even if we know the userspace address of a page we do not necessarily have
> > a usable mapping for kernel purposes.  The userspace mapping might be r/o
> > when we need r/w or it might be in another process.  kmap_coherent takes
> > the job of creating a r/w mapping on a suitable kernel virtual address
> > that will avoid any aliases.
> 
> Ah, I didn't do a good enough job of explaining.  My question was why
> you needed the 'address' argument to kmap_coherent (and thus had a
> different prototype from kmap) instead of just implementing kmap() on
> mips.

Ok, having discussed this with Matthew, I'm beginning to really warm
to the idea of always kmapping page cache accesses.

That nicely means that we never touch the page cache via the kernel
direct mapping, which in turn means we can reduce the amount of RAM
mapped, freeing up space for kmap in the virtual address space.

I think this should work really nicely, and make flush_dcache_page()
almost a no-op for aliasing VIPT caches (on ARM, we still have a
problem with instruction/data cache aliasing, but that's a far easier
issue to solve.)

Ralf: were you saying that you already had an implementation similar
to Matthew's?  As I see it, if you use page->index to determine the
userspace mapping colouring (which we do on ARM) then kmap() already
has access the necessary information to correctly place the page...

-------------------------------------------------------------------
List admin: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ:        http://www.arm.linux.org.uk/mailinglists/faq.php
Etiquette:  http://www.arm.linux.org.uk/mailinglists/etiquette.php

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)
  2008-11-20  9:19       ` Russell King - ARM Linux
  2008-11-20 10:55         ` Naval Saini
@ 2008-11-21 17:10         ` Catalin Marinas
  1 sibling, 0 replies; 13+ messages in thread
From: Catalin Marinas @ 2008-11-21 17:10 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Nick Piggin, linux-fsdevel, Naval Saini, linux-arch,
	linux-arm-kernel, linux-kernel, naval.saini

On Thu, 2008-11-20 at 09:19 +0000, Russell King - ARM Linux wrote:
> On Thu, Nov 20, 2008 at 05:59:00PM +1100, Nick Piggin wrote:
> > - The page is sent to the block layer, which stores into the page. Some
> >   block devices like 'brd' will potentially store via the kernel linear map
> >   here, and they probably don't do enough cache flushing. But a regular
> >   block device should go via DMA, which AFAIK should be OK? (the user address
> >   should remain invalidated because it would be a bug to read from the buffer
> >   before the read has completed)
> 
> This is where things get icky with lots of drivers - DMA is fine, but
> many PIO based drivers don't handle the implications of writing to the
> kernel page cache page when there may be CPU cache side effects.

And for PIO devices, what cache flushing function should be used if the
page isn't available (maybe just a pointer to a buffer) to do a
flush_dcache_page? Should this be done in the filesystem layer?

> If the cache is in read allocate mode, then in this case there shouldn't
> be any dirty cache lines.  (That's not always the case though, esp. via
> conventional IO.)  If the cache is in write allocate mode, PIO data will
> sit in the kernel mapping and won't be visible to userspace.

This problem re-appeared in our tests when someone started using an ext2
filesystem with mtd+slram with write-allocate caches. Basically the D
cache isn't flushed to ensure its coherency with the I cache.

The very dirty workaround was to use flush_icache_range (I know, it's
meant for something completely different) in mtd_blkdevs.c. The slram.c
doesn't have access to any page information either and I noticed that
there are very few block devices that call flush_dcache_page directly.
Should this be done by the filesystem layer?

It looks to me like some filesystems like cramfs call flush_dcache_page
in their "readpage" functions but ext2 doesn't. My other, less dirty,
workaround for the mtd+slram problem is below. It appears to solve the
problem though I'm not sure it is the right solution (ext2 uses
mpage_readpages).

diff --git a/fs/mpage.c b/fs/mpage.c
index 552b80b..979a4a9 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -51,6 +51,7 @@ static void mpage_end_io_read(struct bio *bio, int err)
 			prefetchw(&bvec->bv_page->flags);
 
 		if (uptodate) {
+			flush_dcache_page(page);
 			SetPageUptodate(page);
 		} else {
 			ClearPageUptodate(page);


Thanks.

-- 
Catalin

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-11-21 17:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <d8dd82df0811180606w503563ach8650ab07dcd0a35c@mail.gmail.com>
     [not found] ` <200811191740.23638.nickpiggin@yahoo.com.au>
     [not found]   ` <20081119204315.GB17209@flint.arm.linux.org.uk>
2008-11-20  6:59     ` O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case) Nick Piggin
2008-11-20  9:19       ` Russell King - ARM Linux
2008-11-20 10:55         ` Naval Saini
2008-11-21 17:10         ` Catalin Marinas
2008-11-20 12:28       ` Dmitry Adamushko
2008-11-20 13:25         ` Nick Piggin
2008-11-20 13:55           ` Ralf Baechle
2008-11-20 15:27             ` Matthew Wilcox
2008-11-20 17:17               ` Ralf Baechle
2008-11-20 17:40                 ` Matthew Wilcox
2008-11-20 19:30                   ` Russell King - ARM Linux
2008-11-20 16:30             ` Russell King - ARM Linux
2008-11-20 13:59           ` Dmitry Adamushko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).