All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ian Campbell <Ian.Campbell@eu.citrix.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Subject: Re: [PATCH 10 of 24] libxc: infrastructure for hypercall safe data buffers
Date: Tue, 7 Sep 2010 19:44:46 +0100	[thread overview]
Message-ID: <1283885086.15518.32.camel@localhost.localdomain> (raw)
In-Reply-To: <19590.29997.199583.59386@mariner.uk.xensource.com>

On Tue, 2010-09-07 at 18:23 +0100, Ian Jackson wrote: 
> Ian Campbell writes ("Re: [Xen-devel] [PATCH 10 of 24] libxc: infrastructure for hypercall safe data buffers"):
> > It's not clear what phase 2 actually is (although phase 3 is clearly
> > profit), I don't think any existing syscalls do what we need. mlock
> > (avoiding the stack) gets pretty close and so far the issues with mlock
> > seem to have been more potential than hurting us in practice, but it
> > pays to be prepared e.g. for more aggressive page migration/coalescing
> > in the future, I think.
> 
> Ian and I discussed this extensively on IRC, during which conversation
> I became convinced that mlock() must do what we want.  Having read the
> code in the kernel I'm not not so sure.

After we had our discussion some other conversation I had (I forget
where/with whom) which made me pretty sure we were wrong as well.

> The ordinary userspace access functions are all written to cope with
> pagefaults and retry the access.  So userspace addresses are not in
> general valid in kernel mode even if you've called functions to try to
> test them.

Correct, the difference between a normal userspace access function and a
hypercall is that it is possible to inject (and handle) and page fault
in the former case whereas we cannot inject a page fault to a VCPU while
it is processing a hypercall.

(Maybe it is possible in principle to make all hypercalls restartable
such that we can return to the guest in order to inject page faults but
its not the case right now and I suspect it would be an enormous amount
of work to make it so)

>   It's not clear what mlock prevents; does it prevent NUMA
> page migration ?  If not then I think indeed the page could be made
> not present by one VCPU editing the page tables while another VCPU is
> entering the hypercall, so that the 2nd VCPU will get a spurious
> EFAULT.

I think you are right, these kinds of page faults are possible.

It seems that mlock is only specified to prevent major page faults (i.e.
those requiring I/O to service) but doesn't specify anything regarding
minor page faults. It ensures that the data is resident in RAM but not
necessarily that it is continuously mapped into your virtual address
space nor writeable.

Minor page faults could be caused by NUMA migration (as you say), CoW
mappings or by the kernel trying to consolidate free memory in order to
satisfy a higher order allocation (Linux has recently gained this exact
functionality, I believe). I'm sure there are a host of other potential
causes too...

It's possible that historically most of these potential minor fault
causes were either not implemented in the kernels we were using for
domain 0 (e.g. consolidation is pretty new) or not likely to hit in
practice (e.g. perhaps libxc's usage patterns make it likely that any
CoW mappings are already dealt with by the time the hypercall happens).

Going forward I think it's likely that NUMA migration and memory
consolidation and the like will become more widespread.

> OTOH: there must be other things that work like Xen - what about user
> mode device drivers of various kinds ?  Do X servers not mlock memory
> and expect to be able to tell the video card to DMA to it ?  etc.

DMA would require physical (or more strictly DMA) addresses rather than
virtual addresses so locking the page into a particular virtual address
space doesn't matter all that much from a DMA point of view. I don't
think pure user mode device drivers can do DMA, there is always some
sort of kernel stub required.

In any case the kernel has been moving away from needing privileged X
servers with direct access to hardware in favour of KMS for a while so
I'm not sure an appeal to any similarity we may have with that case
helps us much.

> I think if linux-kernel think that people haven't assumed that mlock()
> actually pins the page, they're mistaken - and it's likely to be not
> just us.

Unfortunately, I think we're reasonably unique. 

Ian.

  reply	other threads:[~2010-09-07 18:44 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-06 13:38 [PATCH 00 of 24] [RFC] libxc: hypercall buffers Ian Campbell
2010-09-06 13:38 ` [PATCH 01 of 24] xen: define raw version of set_xen_guest_handle Ian Campbell
2010-09-06 13:38 ` [PATCH 02 of 24] libxc: flask: use (un)lock pages rather than open coding m(un)lock Ian Campbell
2010-09-06 13:38 ` [PATCH 03 of 24] libxc: pass an xc_interface handle to page locking functions Ian Campbell
2010-09-06 13:38 ` [PATCH 04 of 24] libxc: Remove unnecessary double indirection from xc_readconsolering Ian Campbell
2010-09-06 13:38 ` [PATCH 05 of 24] libxc: use correct size of struct xen_mc Ian Campbell
2010-09-06 13:38 ` [PATCH 06 of 24] libxc: add to xc_domain_maximum_gpfn Ian Campbell
2010-09-06 13:38 ` [PATCH 07 of 24] libxc: replace open-coded use of XENMEM_decrease_reservation Ian Campbell
2010-09-06 13:38 ` [PATCH 08 of 24] libxc: simplify performance counters API Ian Campbell
2010-09-06 13:38 ` [PATCH 09 of 24] libxc: simplify lock profiling API Ian Campbell
2010-09-06 13:38 ` [PATCH 10 of 24] libxc: infrastructure for hypercall safe data buffers Ian Campbell
2010-09-07  8:44   ` Jeremy Fitzhardinge
2010-09-07  9:56     ` Ian Campbell
2010-09-07 17:23       ` Ian Jackson
2010-09-07 18:44         ` Ian Campbell [this message]
2010-09-07 23:31         ` Jeremy Fitzhardinge
2010-09-06 13:38 ` [PATCH 11 of 24] libxc: convert xc_version over to hypercall buffers Ian Campbell
2010-09-06 13:38 ` [PATCH 12 of 24] libxc: convert domctl interfaces " Ian Campbell
2010-09-06 13:38 ` [PATCH 13 of 24] libxc: convert shadow domctl interfaces and save/restore " Ian Campbell
2010-09-06 13:38 ` [PATCH 14 of 24] libxc: convert sysctl interfaces " Ian Campbell
2010-09-06 13:38 ` [PATCH 15 of 24] libxc: convert watchdog interface " Ian Campbell
2010-09-06 13:38 ` [PATCH 16 of 24] libxc: convert acm interfaces " Ian Campbell
2010-09-06 13:38 ` [PATCH 17 of 24] libxc: convert evtchn " Ian Campbell
2010-09-06 13:38 ` [PATCH 18 of 24] libxc: convert schedop " Ian Campbell
2010-09-06 13:38 ` [PATCH 19 of 24] libxc: convert physdevop interface " Ian Campbell
2010-09-06 13:38 ` [PATCH 20 of 24] libxc: convert flask interfaces " Ian Campbell
2010-09-06 13:38 ` [PATCH 21 of 24] libxc: convert hvmop " Ian Campbell
2010-09-06 13:38 ` [PATCH 22 of 24] libxc: convert mca interface " Ian Campbell
2010-09-06 13:38 ` [PATCH 23 of 24] libxc: convert tmem " Ian Campbell
2010-09-06 13:38 ` [PATCH 24 of 24] libxc: convert gnttab interfaces " Ian Campbell
2010-09-06 13:41 ` [PATCH 00 of 24] [RFC] libxc: " Ian Campbell
2010-09-07 16:35 ` Ian Jackson
2010-09-07 16:36   ` Ian Campbell
2010-09-07 17:28     ` Ian Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1283885086.15518.32.camel@localhost.localdomain \
    --to=ian.campbell@eu.citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=jeremy@goop.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.