From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [LSF/MM TOPIC] VM containers Date: Mon, 25 Jan 2016 12:25:54 -0500 Message-ID: <56A65AA2.6040307@redhat.com> References: <56A2511F.1080900@redhat.com> <439BF796-53D3-48C9-8578-A0733DDE8001@intel.com> <20160124170656.6c5460a3@lxorguk.ukuu.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: "lsf-pc@lists.linuxfoundation.org" , Linux Memory Management List , Linux kernel Mailing List , KVM list To: One Thousand Gnomes , "Nakajima, Jun" Return-path: In-Reply-To: <20160124170656.6c5460a3@lxorguk.ukuu.org.uk> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On 01/24/2016 12:06 PM, One Thousand Gnomes wrote: >>> That changes some of the goals the memory management subsystem has, >>> from "use all the resources effectively" to "use as few resources as >>> necessary, in case the host needs the memory for something else". > > Also "and take guidance/provide telemetry" - because you want to tune the > VM behaviours based upon policy and to learn from them for when you re-run > that container. > >> Beyond memory consumption, I would be interested whether we can harden the kernel by the paravirt interfaces for memory protection in VMs (if any). For example, the hypervisor could write-protect part of the page tables or kernel data structures in VMs, and does it help? > > There are four behaviours I can think of, some of which you see in > various hypervisors and security hardening systems > > - die on write (a write here causes a security trap and termination after > the guest has marked the page range die on write, and it cannot be > unmarked). The guest OS at boot can for example mark all it's code as > die-on-write. > - irrevocably read only (VM never allows page to be rewritten by guest > after the guest marks the page range irrevocably r/o) For these we get the question "how do we make it harder for the guest to remap the page tables to point at read/write memory, and modify that instead of the read-only memory?" On "smaller" guests (less than 1TB in size), it may be enough to ensure that the kernel PUD pointer points to the (read-only) kernel PUD at context switch time, placing the main kernel page tables, kernel text, and some other things in read-only memory. > - asynchronous faulting (pages the guest thinks are in it's memory but > are in fact on the hosts swap cause a subscribable fault in the guest > so that it can (where possible) be context switched KVM (and s390) already do the asynchronous page fault trick. > - free if needed - marking pages as freed up and either you get a page > back as it was or a fault and a zeroed page People have worked on this for KVM. I do not remember what happened to the code. -- All rights reversed