All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rik van Riel <riel@redhat.com>
To: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
	"Nakajima, Jun" <jun.nakajima@intel.com>
Cc: "lsf-pc@lists.linuxfoundation.org"
	<lsf-pc@lists.linuxfoundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Linux kernel Mailing List <linux-kernel@vger.kernel.org>,
	KVM list <kvm@vger.kernel.org>
Subject: Re: [LSF/MM TOPIC] VM containers
Date: Mon, 25 Jan 2016 12:25:54 -0500	[thread overview]
Message-ID: <56A65AA2.6040307@redhat.com> (raw)
In-Reply-To: <20160124170656.6c5460a3@lxorguk.ukuu.org.uk>

On 01/24/2016 12:06 PM, One Thousand Gnomes wrote:
>>> That changes some of the goals the memory management subsystem has,
>>> from "use all the resources effectively" to "use as few resources as
>>> necessary, in case the host needs the memory for something else".
> 
> Also "and take guidance/provide telemetry" - because you want to tune the
> VM behaviours based upon policy and to learn from them for when you re-run
> that container.
> 
>> Beyond memory consumption, I would be interested whether we can harden the kernel by the paravirt interfaces for memory protection in VMs (if any). For example, the hypervisor could write-protect part of the page tables or kernel data structures in VMs, and does it help?
> 
> There are four behaviours I can think of, some of which you see in
> various hypervisors and security hardening systems
> 
> - die on write (a write here causes a security trap and termination after
>   the guest has marked the page range die on write, and it cannot be
>   unmarked). The guest OS at boot can for example mark all it's code as
>   die-on-write.
> - irrevocably read only (VM never allows page to be rewritten by guest
>   after the guest marks the page range irrevocably r/o)

For these we get the question "how do we make it harder for the
guest to remap the page tables to point at read/write memory,
and modify that instead of the read-only memory?"

On "smaller" guests (less than 1TB in size), it may be enough to
ensure that the kernel PUD pointer points to the (read-only) kernel
PUD at context switch time, placing the main kernel page tables,
kernel text, and some other things in read-only memory.

> - asynchronous faulting (pages the guest thinks are in it's memory but
>   are in fact on the hosts swap cause a subscribable fault in the guest
>   so that it can (where possible) be context switched

KVM (and s390) already do the asynchronous page fault trick.

> - free if needed - marking pages as freed up and either you get a page
>   back as it was or a fault and a zeroed page

People have worked on this for KVM. I do not remember what
happened to the code.

-- 
All rights reversed

WARNING: multiple messages have this Message-ID (diff)
From: Rik van Riel <riel@redhat.com>
To: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
	"Nakajima, Jun" <jun.nakajima@intel.com>
Cc: "lsf-pc@lists.linuxfoundation.org"
	<lsf-pc@lists.linuxfoundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Linux kernel Mailing List <linux-kernel@vger.kernel.org>,
	KVM list <kvm@vger.kernel.org>
Subject: Re: [LSF/MM TOPIC] VM containers
Date: Mon, 25 Jan 2016 12:25:54 -0500	[thread overview]
Message-ID: <56A65AA2.6040307@redhat.com> (raw)
In-Reply-To: <20160124170656.6c5460a3@lxorguk.ukuu.org.uk>

On 01/24/2016 12:06 PM, One Thousand Gnomes wrote:
>>> That changes some of the goals the memory management subsystem has,
>>> from "use all the resources effectively" to "use as few resources as
>>> necessary, in case the host needs the memory for something else".
> 
> Also "and take guidance/provide telemetry" - because you want to tune the
> VM behaviours based upon policy and to learn from them for when you re-run
> that container.
> 
>> Beyond memory consumption, I would be interested whether we can harden the kernel by the paravirt interfaces for memory protection in VMs (if any). For example, the hypervisor could write-protect part of the page tables or kernel data structures in VMs, and does it help?
> 
> There are four behaviours I can think of, some of which you see in
> various hypervisors and security hardening systems
> 
> - die on write (a write here causes a security trap and termination after
>   the guest has marked the page range die on write, and it cannot be
>   unmarked). The guest OS at boot can for example mark all it's code as
>   die-on-write.
> - irrevocably read only (VM never allows page to be rewritten by guest
>   after the guest marks the page range irrevocably r/o)

For these we get the question "how do we make it harder for the
guest to remap the page tables to point at read/write memory,
and modify that instead of the read-only memory?"

On "smaller" guests (less than 1TB in size), it may be enough to
ensure that the kernel PUD pointer points to the (read-only) kernel
PUD at context switch time, placing the main kernel page tables,
kernel text, and some other things in read-only memory.

> - asynchronous faulting (pages the guest thinks are in it's memory but
>   are in fact on the hosts swap cause a subscribable fault in the guest
>   so that it can (where possible) be context switched

KVM (and s390) already do the asynchronous page fault trick.

> - free if needed - marking pages as freed up and either you get a page
>   back as it was or a fault and a zeroed page

People have worked on this for KVM. I do not remember what
happened to the code.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-01-25 17:25 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-22 15:56 [LSF/MM TOPIC] VM containers Rik van Riel
2016-01-22 15:56 ` Rik van Riel
2016-01-22 16:05 ` [Lsf-pc] " James Bottomley
2016-01-22 16:05   ` James Bottomley
2016-01-22 17:11 ` Johannes Weiner
2016-01-22 17:11   ` Johannes Weiner
2016-01-27 15:48   ` Vladimir Davydov
2016-01-27 15:48     ` Vladimir Davydov
2016-01-27 15:48     ` Vladimir Davydov
2016-01-27 18:36     ` Johannes Weiner
2016-01-27 18:36       ` Johannes Weiner
2016-01-28 17:12       ` Vladimir Davydov
2016-01-28 17:12         ` Vladimir Davydov
2016-01-28 17:12         ` Vladimir Davydov
2016-01-23 23:41 ` Nakajima, Jun
2016-01-24 17:06   ` One Thousand Gnomes
2016-01-24 17:06     ` One Thousand Gnomes
2016-01-25 17:25     ` Rik van Riel [this message]
2016-01-25 17:25       ` Rik van Riel
2016-01-28 15:18 ` Aneesh Kumar K.V
2016-01-28 15:18   ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56A65AA2.6040307@redhat.com \
    --to=riel@redhat.com \
    --cc=gnomes@lxorguk.ukuu.org.uk \
    --cc=jun.nakajima@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linuxfoundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.