From: Frederic Weisbecker <fweisbec@gmail.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Andi Kleen <andi@firstfloor.org>, Ingo Molnar <mingo@elte.hu>,
LKML <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
Steven Rostedt <rostedt@rostedt.homelinux.com>,
Thomas Gleixner <tglx@linutronix.de>,
Christoph Hellwig <hch@lst.de>, Li Zefan <lizf@cn.fujitsu.com>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Johannes Berg <johannes.berg@intel.com>,
Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
Arnaldo Carvalho de Melo <acme@infradead.org>,
Tom Zanussi <tzanussi@gmail.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Jeremy Fitzhardinge <jeremy@goop.org>,
"Frank Ch. Eigler" <fche@redhat.com>,
Tejun Heo <htejun@gmail.com>
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe
Date: Fri, 16 Jul 2010 12:47:26 +0200 [thread overview]
Message-ID: <20100716104716.GA5377@nowhere> (raw)
In-Reply-To: <1279205173.4190.53.camel@localhost>
On Thu, Jul 15, 2010 at 10:46:13AM -0400, Steven Rostedt wrote:
> On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:
>
> > > - make sure that you only ever use _one_ single top-level entry for
> > > all vmalloc issues, and can make sure that all processes are created
> > > with that static entry filled in. This is optimal, but it just doesn't
> > > work on all architectures (eg on 32-bit x86, it would limit the
> > > vmalloc space to 4MB in non-PAE, whatever)
> >
> >
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
>
> If I understand your question, you do not need to worry about the lower
> level entries because all the processes will share the same top level.
>
> process 1's GPD ------,
> |
> +------> PMD --> ...
> |
> process 2' GPD -------'
>
> Thus we have one page entry shared by all processes. The issue happens
> when the vm space crosses the PMD boundary and we need to update all the
> GPD's of all processes to point to the new PMD we need to add to handle
> the spread of the vm space.
Oh right. We point to that PMD, and the update has been made itself inside
the lower level entries pointed by the PMD. Indeed.
>
> >
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
>
> Because all of memory (well 800 some megs on 32 bit) is mapped into
> memory for all processes. That is, kmalloc only uses this memory (as
> does get_free_page()). All processes have a PMD (or PUD, whatever) that
> maps this memory. The issues only arises when we use new virtual memory,
> which vmalloc does. Vmalloc may map to physical memory that is already
> mapped to all processes but the address that the vmalloc uses to access
> that memory is not yet mapped.
Ok I see.
>
> The usual reason the kernel uses vmalloc is to get a contiguous range of
> memory. The vmalloc can map several pages as one contiguous piece of
> memory that in reality is several different pages scattered around
> physical memory. kmalloc can only map pages that are contiguous in
> physical memory. That is, if kmalloc gets 8192 bytes on an arch with
> 4096 byte pages, it will allocate two consecutive pages in physical
> memory. If two contiguous pages are not available even if thousand of
> single pages are, the kmalloc will fail, where as the vmalloc will not.
>
> An allocation of vmalloc can use two different pages and just map the
> page table to make them contiguous in view of the kernel. Note, this
> comes at a cost. One is when we do this, we suffer the case where we
> need to update a bunch of page tables. The other is that we must waste
> TLB entries to point to these separate pages. Kmalloc and
> get_free_page() uses the big memory mappings. That is, if the TLB allows
> us to map large pages, we can do that for kernel memory since we just
> want the contiguous memory as it is in physical memory.
>
> Thus the kernel maps the physical memory with the fewest TLB entries as
> needed (large pages and large TLB entries). If we can map 64K pages, we
> do that. Then kmalloc just allocates within this range, it does not need
> to map any pages. They are already mapped.
>
> Does this make a bit more sense?
Totally! You've made it very clear to me.
Moreover I did not know we can have such variable page size. I mean I thought
we can have variable page size but that would apply to every pages.
>
> >
> >
> >
> > > - at vmalloc time, when adding a new page directory entry, walk all
> > > the tens of thousands of existing page tables under a lock that
> > > guarantees that we don't add any new ones (ie it will lock out fork())
> > > and add the required pgd entry to them.
> > >
> > > - or just take the fault and do the "fill the page tables" on demand.
> > >
> > > Quite frankly, most of the time it's probably better to make that last
> > > choice (unless your hardware makes it easy to make the first choice,
> > > which is obviously simplest for everybody). It makes it _much_ cheaper
> > > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > > simpler too, and has no interesting locking issues with how/when you
> > > expose the page tables in fork() etc.
> > >
> > > So the only downside is that you do end up taking a fault in the
> > > (rare) case where you have a newly created task that didn't get an
> > > even newer vmalloc entry.
> >
> >
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> Actually we don't even need to walk the page tables in the first task
> (although we might do that). When the kernel accesses that memory we
> take the page fault, the page fault will see that this memory is vmalloc
> data and fill in the page tables for the task at that time.
Right.
> >
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> >
> >
> >
> > > And that fault can sometimes be in an
> > > interrupt or an NMI. Normally it's trivial to handle that fairly
> > > simple nested fault. But NMI has that inconvenient "iret unblocks
> > > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > > x86.
> >
> >
> > Yeah.
> >
> >
> > So the parts of the problem I don't understand are:
> >
> > - why don't we have this problem with kmalloc() ?
>
> I hope I explained that above.
Yeah :)
Thanks a lot for your explanations!
next prev parent reply other threads:[~2010-07-16 10:47 UTC|newest]
Thread overview: 163+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
2010-07-14 16:28 ` Linus Torvalds
2010-07-14 17:06 ` Mathieu Desnoyers
2010-07-14 18:10 ` Linus Torvalds
2010-07-14 18:46 ` Ingo Molnar
2010-07-14 19:14 ` Linus Torvalds
2010-07-14 19:36 ` Frederic Weisbecker
2010-07-14 19:54 ` Linus Torvalds
2010-07-14 20:17 ` Mathieu Desnoyers
2010-07-14 20:55 ` Linus Torvalds
2010-07-14 21:18 ` Ingo Molnar
2010-07-14 22:14 ` Frederic Weisbecker
2010-07-14 22:31 ` Mathieu Desnoyers
2010-07-14 22:48 ` Frederic Weisbecker
2010-07-14 23:11 ` Mathieu Desnoyers
2010-07-14 23:38 ` Frederic Weisbecker
2010-07-15 16:26 ` Mathieu Desnoyers
2010-08-03 17:18 ` Peter Zijlstra
2010-08-03 18:25 ` Mathieu Desnoyers
2010-08-04 6:46 ` Peter Zijlstra
2010-08-04 7:14 ` Ingo Molnar
2010-08-04 14:45 ` Mathieu Desnoyers
2010-08-04 14:56 ` Peter Zijlstra
2010-08-06 1:49 ` Mathieu Desnoyers
2010-08-06 9:51 ` Peter Zijlstra
2010-08-06 13:46 ` Mathieu Desnoyers
2010-08-06 6:18 ` Masami Hiramatsu
2010-08-06 9:50 ` Peter Zijlstra
2010-08-06 13:37 ` Mathieu Desnoyers
2010-08-07 9:51 ` Masami Hiramatsu
2010-08-09 16:53 ` Frederic Weisbecker
2010-08-03 18:56 ` Linus Torvalds
2010-08-03 19:45 ` Mathieu Desnoyers
2010-08-03 20:02 ` Linus Torvalds
2010-08-03 20:10 ` Ingo Molnar
2010-08-03 20:21 ` Ingo Molnar
2010-08-03 21:16 ` Mathieu Desnoyers
2010-08-03 20:54 ` Mathieu Desnoyers
2010-08-04 6:27 ` Peter Zijlstra
2010-08-04 14:06 ` Mathieu Desnoyers
2010-08-04 14:50 ` Peter Zijlstra
2010-08-06 1:42 ` Mathieu Desnoyers
2010-08-06 10:11 ` Peter Zijlstra
2010-08-06 11:14 ` Peter Zijlstra
2010-08-06 14:15 ` Mathieu Desnoyers
2010-08-06 14:13 ` Mathieu Desnoyers
2010-08-11 14:44 ` Steven Rostedt
2010-08-11 14:34 ` Steven Rostedt
2010-08-15 13:35 ` Mathieu Desnoyers
2010-08-15 16:33 ` Avi Kivity
2010-08-15 16:44 ` Mathieu Desnoyers
2010-08-15 16:51 ` Avi Kivity
2010-08-15 18:31 ` Mathieu Desnoyers
2010-08-16 10:49 ` Avi Kivity
2010-08-16 11:29 ` Avi Kivity
2010-08-04 6:46 ` Dave Chinner
2010-08-04 7:21 ` Ingo Molnar
2010-07-14 23:40 ` Steven Rostedt
2010-07-14 19:41 ` Linus Torvalds
2010-07-14 19:56 ` Andi Kleen
2010-07-14 20:05 ` Mathieu Desnoyers
2010-07-14 20:07 ` Andi Kleen
2010-07-14 20:08 ` H. Peter Anvin
2010-07-14 23:32 ` Tejun Heo
2010-07-14 22:31 ` Frederic Weisbecker
2010-07-14 22:56 ` Linus Torvalds
2010-07-14 23:09 ` Andi Kleen
2010-07-14 23:22 ` Linus Torvalds
2010-07-15 14:11 ` Frederic Weisbecker
2010-07-15 14:35 ` Andi Kleen
2010-07-16 11:21 ` Frederic Weisbecker
2010-07-15 14:46 ` Steven Rostedt
2010-07-16 10:47 ` Frederic Weisbecker [this message]
2010-07-16 11:43 ` Steven Rostedt
2010-07-15 14:51 ` Linus Torvalds
2010-07-15 15:38 ` Linus Torvalds
2010-07-16 12:00 ` Frederic Weisbecker
2010-07-16 12:54 ` Steven Rostedt
2010-07-14 20:39 ` Mathieu Desnoyers
2010-07-14 21:23 ` Linus Torvalds
2010-07-14 21:45 ` Maciej W. Rozycki
2010-07-14 21:52 ` Linus Torvalds
2010-07-14 22:31 ` Maciej W. Rozycki
2010-07-14 22:21 ` Mathieu Desnoyers
2010-07-14 22:37 ` Linus Torvalds
2010-07-14 22:51 ` Jeremy Fitzhardinge
2010-07-14 23:02 ` Linus Torvalds
2010-07-14 23:54 ` Jeremy Fitzhardinge
2010-07-15 1:23 ` Linus Torvalds
2010-07-15 1:45 ` Linus Torvalds
2010-07-15 18:31 ` Mathieu Desnoyers
2010-07-15 18:43 ` Linus Torvalds
2010-07-15 18:48 ` Linus Torvalds
2010-07-15 22:01 ` Mathieu Desnoyers
2010-07-15 22:16 ` Linus Torvalds
2010-07-15 22:24 ` H. Peter Anvin
2010-07-15 22:26 ` Linus Torvalds
2010-07-15 22:46 ` H. Peter Anvin
2010-07-15 22:58 ` Andi Kleen
2010-07-15 23:20 ` H. Peter Anvin
2010-07-15 23:23 ` Linus Torvalds
2010-07-15 23:41 ` H. Peter Anvin
2010-07-15 23:44 ` Linus Torvalds
2010-07-15 23:46 ` H. Peter Anvin
2010-07-15 23:48 ` Andi Kleen
2010-07-15 22:30 ` Mathieu Desnoyers
2010-07-16 19:13 ` Mathieu Desnoyers
2010-07-15 16:44 ` Mathieu Desnoyers
2010-07-15 16:49 ` Linus Torvalds
2010-07-15 17:38 ` Mathieu Desnoyers
2010-07-15 20:44 ` H. Peter Anvin
2010-07-18 11:03 ` Avi Kivity
2010-07-18 17:36 ` Linus Torvalds
2010-07-18 18:04 ` Avi Kivity
2010-07-18 18:22 ` Linus Torvalds
2010-07-19 7:32 ` Avi Kivity
2010-07-18 18:17 ` Linus Torvalds
2010-07-18 18:43 ` Steven Rostedt
2010-07-18 19:26 ` Linus Torvalds
2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
2010-07-14 16:42 ` Maciej W. Rozycki
2010-07-14 18:12 ` Mathieu Desnoyers
2010-07-14 19:21 ` Maciej W. Rozycki
2010-07-14 19:58 ` Mathieu Desnoyers
2010-07-14 20:36 ` Maciej W. Rozycki
2010-07-16 12:28 ` Avi Kivity
2010-07-16 14:49 ` Mathieu Desnoyers
2010-07-16 15:34 ` Andi Kleen
2010-07-16 15:40 ` Mathieu Desnoyers
2010-07-16 16:47 ` Avi Kivity
2010-07-16 16:58 ` Mathieu Desnoyers
2010-07-16 17:54 ` Avi Kivity
2010-07-16 18:05 ` H. Peter Anvin
2010-07-16 18:15 ` Avi Kivity
2010-07-16 18:17 ` H. Peter Anvin
2010-07-16 18:28 ` Avi Kivity
2010-07-16 18:37 ` Linus Torvalds
2010-07-16 19:26 ` Avi Kivity
2010-07-16 21:39 ` Linus Torvalds
2010-07-16 22:07 ` Andi Kleen
2010-07-16 22:26 ` Linus Torvalds
2010-07-16 22:41 ` Andi Kleen
2010-07-17 1:15 ` Linus Torvalds
2010-07-16 22:40 ` Mathieu Desnoyers
2010-07-18 9:23 ` Avi Kivity
2010-07-16 18:22 ` Mathieu Desnoyers
2010-07-16 18:32 ` Avi Kivity
2010-07-16 19:29 ` H. Peter Anvin
2010-07-16 19:39 ` Avi Kivity
2010-07-16 19:32 ` Andi Kleen
2010-07-16 18:25 ` Linus Torvalds
2010-07-16 19:30 ` Andi Kleen
2010-07-18 9:26 ` Avi Kivity
2010-07-16 19:28 ` Andi Kleen
2010-07-16 19:32 ` Avi Kivity
2010-07-16 19:34 ` Andi Kleen
2010-08-04 9:46 ` Peter Zijlstra
2010-08-04 20:23 ` H. Peter Anvin
2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
2010-07-14 17:08 ` Mathieu Desnoyers
2010-07-14 18:56 ` Andi Kleen
2010-07-14 23:29 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100716104716.GA5377@nowhere \
--to=fweisbec@gmail.com \
--cc=acme@infradead.org \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=fche@redhat.com \
--cc=hch@lst.de \
--cc=hpa@zytor.com \
--cc=htejun@gmail.com \
--cc=jeremy@goop.org \
--cc=johannes.berg@intel.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lizf@cn.fujitsu.com \
--cc=masami.hiramatsu.pt@hitachi.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=rostedt@rostedt.homelinux.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=tzanussi@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.