linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Andi Kleen <andi@firstfloor.org>, Ingo Molnar <mingo@elte.hu>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@rostedt.homelinux.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Hellwig <hch@lst.de>, Li Zefan <lizf@cn.fujitsu.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Johannes Berg <johannes.berg@intel.com>,
	Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
	Arnaldo Carvalho de Melo <acme@infradead.org>,
	Tom Zanussi <tzanussi@gmail.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Jeremy Fitzhardinge <jeremy@goop.org>,
	"Frank Ch. Eigler" <fche@redhat.com>,
	Tejun Heo <htejun@gmail.com>
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe
Date: Thu, 15 Jul 2010 10:46:13 -0400	[thread overview]
Message-ID: <1279205173.4190.53.camel@localhost> (raw)
In-Reply-To: <20100715141118.GA6417@nowhere>

On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:

> >  - make sure that you only ever use _one_ single top-level entry for
> > all vmalloc issues, and can make sure that all processes are created
> > with that static entry filled in. This is optimal, but it just doesn't
> > work on all architectures (eg on 32-bit x86, it would limit the
> > vmalloc space to 4MB in non-PAE, whatever)
> 
> 
> But then, even if you ensure that, don't we need to also fill lower level
> entries for the new mapping.

If I understand your question, you do not need to worry about the lower
level entries because all the processes will share the same top level.

process 1's GPD ------,
                      |
                      +------> PMD --> ...
                      |
process 2' GPD -------'

Thus we have one page entry shared by all processes. The issue happens
when the vm space crosses the PMD boundary and we need to update all the
GPD's of all processes to point to the new PMD we need to add to handle
the spread of the vm space.

> 
> Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> risk to add a new memory mapping for new memory allocated with kmalloc?

Because all of memory (well 800 some megs on 32 bit) is mapped into
memory for all processes. That is, kmalloc only uses this memory (as
does get_free_page()). All processes have a PMD (or PUD, whatever) that
maps this memory. The issues only arises when we use new virtual memory,
which vmalloc does. Vmalloc may map to physical memory that is already
mapped to all processes but the address that the vmalloc uses to access
that memory is not yet mapped.

The usual reason the kernel uses vmalloc is to get a contiguous range of
memory. The vmalloc can map several pages as one contiguous piece of
memory that in reality is several different pages scattered around
physical memory. kmalloc can only map pages that are contiguous in
physical memory. That is, if kmalloc gets 8192 bytes on an arch with
4096 byte pages, it will allocate two consecutive pages in physical
memory. If two contiguous pages are not available even if thousand of
single pages are, the kmalloc will fail, where as the vmalloc will not.

An allocation of vmalloc can use two different pages and just map the
page table to make them contiguous in view of the kernel. Note, this
comes at a cost. One is when we do this, we suffer the case where we
need to update a bunch of page tables. The other is that we must waste
TLB entries to point to these separate pages. Kmalloc and
get_free_page() uses the big memory mappings. That is, if the TLB allows
us to map large pages, we can do that for kernel memory since we just
want the contiguous memory as it is in physical memory.

Thus the kernel maps the physical memory with the fewest TLB entries as
needed (large pages and large TLB entries). If we can map 64K pages, we
do that. Then kmalloc just allocates within this range, it does not need
to map any pages. They are already mapped.

Does this make a bit more sense?

> 
> 
> 
> >  - at vmalloc time, when adding a new page directory entry, walk all
> > the tens of thousands of existing page tables under a lock that
> > guarantees that we don't add any new ones (ie it will lock out fork())
> > and add the required pgd entry to them.
> > 
> >  - or just take the fault and do the "fill the page tables" on demand.
> > 
> > Quite frankly, most of the time it's probably better to make that last
> > choice (unless your hardware makes it easy to make the first choice,
> > which is obviously simplest for everybody). It makes it _much_ cheaper
> > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > simpler too, and has no interesting locking issues with how/when you
> > expose the page tables in fork() etc.
> > 
> > So the only downside is that you do end up taking a fault in the
> > (rare) case where you have a newly created task that didn't get an
> > even newer vmalloc entry.
> 
> 
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

Actually we don't even need to walk the page tables in the first task
(although we might do that). When the kernel accesses that memory we
take the page fault, the page fault will see that this memory is vmalloc
data and fill in the page tables for the task at that time.

> 
> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).
> 
> 
> 
> > And that fault can sometimes be in an
> > interrupt or an NMI. Normally it's trivial to handle that fairly
> > simple nested fault. But NMI has that inconvenient "iret unblocks
> > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > x86.
> 
> 
> Yeah.
> 
> 
> So the parts of the problem I don't understand are:
> 
> - why don't we have this problem with kmalloc() ?

I hope I explained that above.

> - did I understand well the race that makes the fault necessary,
>   ie: we walk the tasklist lockless, add the new mapping if
>   not present, but we might miss a task lately forked, but
>   the fault will fix that.

I'm lost on this race. If we do a lock and walk all page tables I think
the race goes away. So I don't understand this either?

-- Steve



  parent reply	other threads:[~2010-07-15 14:46 UTC|newest]

Thread overview: 163+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
2010-07-14 16:28   ` Linus Torvalds
2010-07-14 17:06     ` Mathieu Desnoyers
2010-07-14 18:10       ` Linus Torvalds
2010-07-14 18:46         ` Ingo Molnar
2010-07-14 19:14           ` Linus Torvalds
2010-07-14 19:36             ` Frederic Weisbecker
2010-07-14 19:54               ` Linus Torvalds
2010-07-14 20:17                 ` Mathieu Desnoyers
2010-07-14 20:55                   ` Linus Torvalds
2010-07-14 21:18                     ` Ingo Molnar
2010-07-14 22:14                 ` Frederic Weisbecker
2010-07-14 22:31                   ` Mathieu Desnoyers
2010-07-14 22:48                     ` Frederic Weisbecker
2010-07-14 23:11                       ` Mathieu Desnoyers
2010-07-14 23:38                         ` Frederic Weisbecker
2010-07-15 16:26                           ` Mathieu Desnoyers
2010-08-03 17:18                             ` Peter Zijlstra
2010-08-03 18:25                               ` Mathieu Desnoyers
2010-08-04  6:46                                 ` Peter Zijlstra
2010-08-04  7:14                                   ` Ingo Molnar
2010-08-04 14:45                                   ` Mathieu Desnoyers
2010-08-04 14:56                                     ` Peter Zijlstra
2010-08-06  1:49                                       ` Mathieu Desnoyers
2010-08-06  9:51                                         ` Peter Zijlstra
2010-08-06 13:46                                           ` Mathieu Desnoyers
2010-08-06  6:18                                       ` Masami Hiramatsu
2010-08-06  9:50                                         ` Peter Zijlstra
2010-08-06 13:37                                           ` Mathieu Desnoyers
2010-08-07  9:51                                           ` Masami Hiramatsu
2010-08-09 16:53                                           ` Frederic Weisbecker
2010-08-03 18:56                               ` Linus Torvalds
2010-08-03 19:45                                 ` Mathieu Desnoyers
2010-08-03 20:02                                   ` Linus Torvalds
2010-08-03 20:10                                     ` Ingo Molnar
2010-08-03 20:21                                       ` Ingo Molnar
2010-08-03 21:16                                         ` Mathieu Desnoyers
2010-08-03 20:54                                     ` Mathieu Desnoyers
2010-08-04  6:27                                 ` Peter Zijlstra
2010-08-04 14:06                                   ` Mathieu Desnoyers
2010-08-04 14:50                                     ` Peter Zijlstra
2010-08-06  1:42                                       ` Mathieu Desnoyers
2010-08-06 10:11                                         ` Peter Zijlstra
2010-08-06 11:14                                           ` Peter Zijlstra
2010-08-06 14:15                                             ` Mathieu Desnoyers
2010-08-06 14:13                                           ` Mathieu Desnoyers
2010-08-11 14:44                                             ` Steven Rostedt
2010-08-11 14:34                                   ` Steven Rostedt
2010-08-15 13:35                                     ` Mathieu Desnoyers
2010-08-15 16:33                                     ` Avi Kivity
2010-08-15 16:44                                       ` Mathieu Desnoyers
2010-08-15 16:51                                         ` Avi Kivity
2010-08-15 18:31                                           ` Mathieu Desnoyers
2010-08-16 10:49                                             ` Avi Kivity
2010-08-16 11:29                                             ` Avi Kivity
2010-08-04  6:46                                 ` Dave Chinner
2010-08-04  7:21                                   ` Ingo Molnar
2010-07-14 23:40                         ` Steven Rostedt
2010-07-14 19:41             ` Linus Torvalds
2010-07-14 19:56               ` Andi Kleen
2010-07-14 20:05                 ` Mathieu Desnoyers
2010-07-14 20:07                   ` Andi Kleen
2010-07-14 20:08                     ` H. Peter Anvin
2010-07-14 23:32                       ` Tejun Heo
2010-07-14 22:31                   ` Frederic Weisbecker
2010-07-14 22:56                     ` Linus Torvalds
2010-07-14 23:09                       ` Andi Kleen
2010-07-14 23:22                         ` Linus Torvalds
2010-07-15 14:11                       ` Frederic Weisbecker
2010-07-15 14:35                         ` Andi Kleen
2010-07-16 11:21                           ` Frederic Weisbecker
2010-07-15 14:46                         ` Steven Rostedt [this message]
2010-07-16 10:47                           ` Frederic Weisbecker
2010-07-16 11:43                             ` Steven Rostedt
2010-07-15 14:51                         ` Linus Torvalds
2010-07-15 15:38                           ` Linus Torvalds
2010-07-16 12:00                           ` Frederic Weisbecker
2010-07-16 12:54                             ` Steven Rostedt
2010-07-14 20:39         ` Mathieu Desnoyers
2010-07-14 21:23           ` Linus Torvalds
2010-07-14 21:45             ` Maciej W. Rozycki
2010-07-14 21:52               ` Linus Torvalds
2010-07-14 22:31                 ` Maciej W. Rozycki
2010-07-14 22:21             ` Mathieu Desnoyers
2010-07-14 22:37               ` Linus Torvalds
2010-07-14 22:51                 ` Jeremy Fitzhardinge
2010-07-14 23:02                   ` Linus Torvalds
2010-07-14 23:54                     ` Jeremy Fitzhardinge
2010-07-15  1:23                 ` Linus Torvalds
2010-07-15  1:45                   ` Linus Torvalds
2010-07-15 18:31                     ` Mathieu Desnoyers
2010-07-15 18:43                       ` Linus Torvalds
2010-07-15 18:48                         ` Linus Torvalds
2010-07-15 22:01                           ` Mathieu Desnoyers
2010-07-15 22:16                             ` Linus Torvalds
2010-07-15 22:24                               ` H. Peter Anvin
2010-07-15 22:26                               ` Linus Torvalds
2010-07-15 22:46                                 ` H. Peter Anvin
2010-07-15 22:58                                 ` Andi Kleen
2010-07-15 23:20                                   ` H. Peter Anvin
2010-07-15 23:23                                     ` Linus Torvalds
2010-07-15 23:41                                       ` H. Peter Anvin
2010-07-15 23:44                                         ` Linus Torvalds
2010-07-15 23:46                                           ` H. Peter Anvin
2010-07-15 23:48                                       ` Andi Kleen
2010-07-15 22:30                               ` Mathieu Desnoyers
2010-07-16 19:13                             ` Mathieu Desnoyers
2010-07-15 16:44                   ` Mathieu Desnoyers
2010-07-15 16:49                     ` Linus Torvalds
2010-07-15 17:38                       ` Mathieu Desnoyers
2010-07-15 20:44                         ` H. Peter Anvin
2010-07-18 11:03                   ` Avi Kivity
2010-07-18 17:36                     ` Linus Torvalds
2010-07-18 18:04                       ` Avi Kivity
2010-07-18 18:22                         ` Linus Torvalds
2010-07-19  7:32                           ` Avi Kivity
2010-07-18 18:17                       ` Linus Torvalds
2010-07-18 18:43                         ` Steven Rostedt
2010-07-18 19:26                           ` Linus Torvalds
2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
2010-07-14 16:42   ` Maciej W. Rozycki
2010-07-14 18:12     ` Mathieu Desnoyers
2010-07-14 19:21       ` Maciej W. Rozycki
2010-07-14 19:58         ` Mathieu Desnoyers
2010-07-14 20:36           ` Maciej W. Rozycki
2010-07-16 12:28   ` Avi Kivity
2010-07-16 14:49     ` Mathieu Desnoyers
2010-07-16 15:34       ` Andi Kleen
2010-07-16 15:40         ` Mathieu Desnoyers
2010-07-16 16:47       ` Avi Kivity
2010-07-16 16:58         ` Mathieu Desnoyers
2010-07-16 17:54           ` Avi Kivity
2010-07-16 18:05             ` H. Peter Anvin
2010-07-16 18:15               ` Avi Kivity
2010-07-16 18:17                 ` H. Peter Anvin
2010-07-16 18:28                   ` Avi Kivity
2010-07-16 18:37                     ` Linus Torvalds
2010-07-16 19:26                       ` Avi Kivity
2010-07-16 21:39                         ` Linus Torvalds
2010-07-16 22:07                           ` Andi Kleen
2010-07-16 22:26                             ` Linus Torvalds
2010-07-16 22:41                               ` Andi Kleen
2010-07-17  1:15                                 ` Linus Torvalds
2010-07-16 22:40                             ` Mathieu Desnoyers
2010-07-18  9:23                           ` Avi Kivity
2010-07-16 18:22                 ` Mathieu Desnoyers
2010-07-16 18:32                   ` Avi Kivity
2010-07-16 19:29                     ` H. Peter Anvin
2010-07-16 19:39                       ` Avi Kivity
2010-07-16 19:32                     ` Andi Kleen
2010-07-16 18:25                 ` Linus Torvalds
2010-07-16 19:30                   ` Andi Kleen
2010-07-18  9:26                     ` Avi Kivity
2010-07-16 19:28               ` Andi Kleen
2010-07-16 19:32                 ` Avi Kivity
2010-07-16 19:34                   ` Andi Kleen
2010-08-04  9:46               ` Peter Zijlstra
2010-08-04 20:23                 ` H. Peter Anvin
2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
2010-07-14 17:08   ` Mathieu Desnoyers
2010-07-14 18:56     ` Andi Kleen
2010-07-14 23:29       ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1279205173.4190.53.camel@localhost \
    --to=rostedt@goodmis.org \
    --cc=acme@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=fche@redhat.com \
    --cc=fweisbec@gmail.com \
    --cc=hch@lst.de \
    --cc=hpa@zytor.com \
    --cc=htejun@gmail.com \
    --cc=jeremy@goop.org \
    --cc=johannes.berg@intel.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=masami.hiramatsu.pt@hitachi.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=rostedt@rostedt.homelinux.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=tzanussi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).