Re: [patch 6/12] hold atomic kmaps across generic_file_read

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Daniel Phillips <phillips@arcor.de>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: Andrew Morton <akpm@zip.com.au>, lkml <linux-kernel@vger.kernel.org>
Subject: Re: [patch 6/12] hold atomic kmaps across generic_file_read
Date: Sat, 10 Aug 2002 20:16:36 +0200	[thread overview]
Message-ID: <E17damz-0001Zq-00@starship> (raw)
In-Reply-To: <Pine.LNX.4.44.0208100948100.2134-100000@home.transmeta.com>

On Saturday 10 August 2002 19:01, Linus Torvalds wrote:
> On Sat, 10 Aug 2002, Daniel Phillips wrote:
> > Sorry, this connection is too subtle for me.  I see why we want to do
> > this, and in fact I've been researching how to do it for the last few
> > weeks, but I don't see how it's related to the atomic kmap path.  Could
> > you please explain, in words of one syllable?
> 
> We cannot do that optimization generally. I'll give you two reasons, both 
> of which are sufficient on their own:
> 
>  - doing the page table walk is simply slower than doing the memcpy if the
>    page is just there. So you have to have a good heuristic on when it
>    might be worthwhile to do page table tricks. That heuristic should 
>    include "is the page directly accessible". Which is exactly what you 
>    get if you have a "atomic copy_to_user() that returns failure if it
>    cannot be done atomically".
> 
>  - Even if walking the page tables were to be fast (ie ignoring #1), 
>    replacing a page in virtual memory is absolutely not. Especially not on 
>    SMP, where replacing a page in memory implies doing CPU crosscalls in 
>    order to invalidate the TLB on other CPU's for the old page. So before 
>    you do the "clever VM stuff", you had better have a heuristic that says
>    "this page isn't mapped, so it doesn't need the expensive cross-calls".
> 
>    Again: guess what gives you pretty much exactly that heuristic?
> 
> See?

Yes, I see.  Easy, when you put it that way.

> The fact is, "memcpy()" is damned fast for a lot of cases, because it 
> natively uses the TLB and existing caches. It's slow for other cases, but 
> you want to have a good _heuristic_ for when you might want to try to 
> avoid the slow case without avoiding the fast case. Without that heuristic 
> you can't do the optimization sanely.
> 
> And obviously the heuristic should be a really fast one. The atomic 
> copy_to_user() is the _perfect_ heuristic, because if it just does the 
> memcpy there is absolutely zero overhead (it just does it). The overhead 
> comes in only in the case where we're going to be slowed down by the fault 
> anyway, _and_ where we want to do the clever tricks.

So the overhead consists of inc/deccing preempt_count around the
copy_*_user, which fakes do_page_fault into forcing an early return.

> > While I'm feeling disoriented, what exactly is the deadlock path for a
> > write from a mmaped, not uptodate page, to the same page?  And why does
> > __get_user need to touch the page in *two* places to instantiate it?
> 
> It doesn't touch it twice. It touches _both_ of the potential pages that 
> will be involved in the memcpy - since the copy may well not be 
> page-aligned in user space.

Oh duh.  I stared at that for the longest time, without realizing there's no
alignment requirement.

> The deadlock is when you do a write of a page into a mapping of the very 
> same page that isn't yet mapped. What happens is:
> 
>  - the write has gotten the page lock. Since the wrie knows that the whole 
>    page is going to be overwritten, it is _not_ marked uptodate, and the
>    old contents (garbage from the allocation) are left alone.
> 
>  - the copy_from_user() pagefaults and tries to bring in the _same_ page 
>    into user land.
> 
>  - that involves reading in the page and making sure it is up-to-date
> 
>  - but since the write has already locked the page, you now have a 
>    deadlock. The write cannot continue, since it needs the old contents,
>    and the old contents cannot be read in since the write holds the page
>    lock.
> 
> The "copy_from_user() atomically" solves the problem quite nicely. If the 
> atomic copy fails, we can afford to do the things that we cannot afford to 
> do normally (because the thing never triggers under real load, and real 
> load absolutely _needs_ to not try to get the page up-to-date before the 
> write). 
> 
> So with the atomic copy-from-user, we can trap the problem only when it is 
> a problem, and go full speed normally.

That's all crystal clear now.  (Though the way do_page_fault finesses
copy_from_user into returning early is a little - how should I put it -
opaque.  Yes, I see it, but...)

I'm sure you're aware there's a lot more you can do with these tricks
than just zero-copy read - there's zero-copy write as well, and there
are both of the above, except a full pte page at a time.  There could
even be a file to file copy if there were an interface for it.

I don't see what prevents the read optimization even with a mmapped
page, the page just becomes CoW in all of the mapped region, the read
destination and the page cache.

-- 
Daniel

next prev parent reply	other threads:[~2002-08-10 18:11 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-08-10  0:57 [patch 6/12] hold atomic kmaps across generic_file_read Andrew Morton
2002-08-10  1:33 ` Linus Torvalds
2002-08-10  3:53   ` Andrew Morton
2002-08-10  3:53     ` Linus Torvalds
2002-08-10  6:12       ` Andrew Morton
2002-08-10  7:25         ` Linus Torvalds
2002-08-10  9:08           ` Andrew Morton
2002-08-10 12:44           ` Daniel Phillips
2002-08-10 17:01             ` Linus Torvalds
2002-08-10 18:16               ` Daniel Phillips [this message]
2002-08-10 18:32                 ` Linus Torvalds
2002-08-10 18:46                   ` Daniel Phillips
2002-08-10 14:16           ` Rik van Riel
2002-08-10 17:03             ` Linus Torvalds
2002-08-10 17:36           ` Jamie Lokier
2002-08-10 17:46             ` Linus Torvalds
2002-08-10 17:55               ` Jamie Lokier
2002-08-10 18:42                 ` Linus Torvalds
2002-08-10 18:52                   ` Jeff Garzik
2002-08-10 19:01                     ` Christoph Hellwig
2002-08-10 19:04                       ` Jeff Garzik
2002-08-12 15:20                       ` Ingo Oeser
2002-08-12  0:18                     ` Albert D. Cahalan
2002-08-12 14:11                       ` Jeff Garzik
2002-08-12 14:46                         ` David Woodhouse
2002-08-10 19:10                   ` Jamie Lokier
2002-08-10 22:42                     ` Linus Torvalds
2002-08-11  3:17                       ` Simon Kirby
2002-08-11  6:07                         ` Andrew Morton
2002-08-11  8:46                           ` Simon Kirby
2002-08-11  9:36                             ` Andrew Morton
2002-08-11  9:49                               ` Andrew Morton
2002-08-11 10:28                             ` Andrew Morton
2002-08-11 18:52                         ` Linus Torvalds
2002-08-12  3:28                           ` Andrew Morton
2002-08-12  3:27                             ` Linus Torvalds
2002-08-12  4:08                               ` Andrew Morton
2002-08-12  6:20                             ` Simon Kirby
2002-08-12  6:44                               ` Andrew Morton
2002-08-12 19:43                                 ` Trond Myklebust
2002-08-12 20:43                                   ` Andrew Morton
2002-08-11  8:00                       ` Daniel Phillips
2002-08-11 19:00                         ` Linus Torvalds
2002-08-11 19:43                           ` Daniel Phillips
2002-08-11  0:34   ` Andrew Morton
2002-08-11  0:56     ` Linus Torvalds
2002-08-11  1:27       ` Andrew Morton
2002-08-12  7:45   ` Rusty Russell
2002-08-12  9:45     ` Daniel Phillips
2002-08-12 20:29       ` Linus Torvalds
2002-08-12 21:21         ` Daniel Phillips
2002-08-12 17:30     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E17damz-0001Zq-00@starship \
    --to=phillips@arcor.de \
    --cc=akpm@zip.com.au \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox