From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Catterall Subject: Re: RFC: HVM de-privileged mode scheduling considerations Date: Tue, 11 Aug 2015 11:40:02 +0100 Message-ID: <55C9D102.9040409@citrix.com> References: <55BF6E38.509@citrix.com> <55BF72AB.8070100@citrix.com> <1438612487.31129.9.camel@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap , Ian Campbell Cc: Andrew Cooper , Dario Faggioli , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org On 04/08/15 14:46, George Dunlap wrote: > On Mon, Aug 3, 2015 at 3:34 PM, Ian Campbell wrote: >> On Mon, 2015-08-03 at 14:54 +0100, Andrew Cooper wrote: >>> On 03/08/15 14:35, Ben Catterall wrote: >>>> Hi all, >>>> >>>> I am working on an x86 proof-of-concept to evaluate if it is feasible >>>> to move device models and x86 emulation code for HVM guests into a >>>> de-privileged context. >>>> >>>> I was hoping to get feedback from relevant maintainers on scheduling >>>> considerations for this system to mitigate potential DoS attacks. >>>> >>>> Many thanks in advance, >>>> Ben >>>> >>>> This is intended as a proof-of-concept, with the aim of determining if >>>> this idea is feasible within performance constraints. >>>> >>>> Motivation >>>> ---------- >>>> The motivation for moving the device models and x86 emulation code >>>> into ring 3 is to mitigate a system compromise due a bug in any of >>>> these systems. These systems are currently part of the hypervisor and, >>>> consequently, a bug in any of these could allow an attacker to gain >>>> control (or perform a DOS) of >>>> Xen and/or guests. >>>> >>>> Migrating between PCPUs >>>> ----------------------- >>>> There is a need to support migration between pcpus so that the >>>> scheduler can still perform this operation. However, there is an issue >>>> to resolve. Currently, I have a per-vcpu copy of the Xen ring 0 stack >>>> up to the point of entering the de-privileged mode. This allows us to >>>> restore this stack and then continue from the entry point when we have >>>> finished in de-privileged mode. There will be per-pcpu data on these >>>> per-vcpu stacks such as saved stack frame pointers for the per-pcpu >>>> stack, smp_processor_id() responses etc. >>>> >>>> Therefore, it will be necessary to lock the vcpu to the current pcpu >>>> when it enters this user mode so that it does not wake up on a >>>> different pcpu where such pointers and other data are invalid. We can >>>> do this by setting a hard affinity to the pcpu that the vcpu is >>>> executing on. See common/wait.c which does something similar to what I >>>> am doing. >>>> >>>> However, needing to have hard affinity to a pcpu leads to the >>>> following problem: >>>> - An attacker could lock multiple vcpus to a single pcpu, leading to a >>>> DoS. This could be achieved by spinning in a loop in Xen >>>> de-privileged mode (assuming a bug in this mode) and performing this >>>> operation on multiple vcpus at once. The attacker could wait until all >>>> of their vcpus were on the same pcpu and then execute this attack. >>>> This could cause the pcpu to, effectively, lock up, as it will be >>>> under heavy load, and we would be unable to move work elsewhere. >>>> >>>> A solution to the DoS would be to force migration to another pcpu, if >>>> after, say, 100 quanta have passed where the vcpu has remained in >>>> de-privileged mode. This forcing of migration would require us to >>>> forcibly complete the de-privileged operation, and then, just before >>>> returning into the guest, force a cpu change. We could not just force >>>> a migration at the schedule call point as the Xen stack needs to >>>> unwind to free up resources. We would reset this count each time we >>>> completed a de-privileged mode operation. >>>> >>>> A legitimate long-running de-privileged operation would trigger this >>>> forced migration mechanism. However, it is unlikely that such >>>> operations will be needed and the count can be adjusted appropriately >>>> to mitigate this. >>>> >>>> Any suggestions or feedback would be appreciated! >>> >>> I don't see why any scheduling support is needed. >>> >>> Currently all operations like this are run synchronously in the vmexit >>> context of the vcpu. Any current DoS is already a real issue. >> >> The point is that this work is supposed to mitigate (or eliminate) such >> issues, so we would like to remove this existing real issue. >> >> IOW while it might be expected that an in-Xen DM can DoS the system, an in >> -Xen-ring3 DM should not be able to do so. >> >>> In any reasonable situation, emulation of a device is a small state >>> mutation and occasionally kicking off a further action to perform. (The >>> far bigger risk from this kind of emulation is following bad >>> pointers/etc, rather than long loops.) >>> >>> I think it would be entirely reasonable to have a deadline for a single >>> execution of depriv mode, after which the domain is declared malicious >>> and killed. >> >> I think this could make sense, it's essentially a harsher variant of Ben's >> suggestion to abort an attempt to process the MMIO in order to migrate to >> another pcpu, but it has the benefit of being easier to implement and >> easier to reason about in terms of interactions with other aspects of the >> system (i.e. it seems to remove the need to think of ways an attacker might >> game that other system). >> >>> We already have this for host pcpus - the watchdog defaults to 5 >>> seconds. Having a similar cutoff for depriv mode should be fine. >> >> That's a reasonable analogy. >> >> Perhaps we would want the depriv-watchdog to be some 1/N fraction of the >> pcpu -watchdog, for a smallish N, to avoid the risk of any slop in the >> timing allowing the pcpu watchdog to fire. N=3 for example (on the grounds >> that N=2 is probably sufficient, so N=3 must be awesome). > > +1 > > -George > Thanks all! I'll do this then. Appreciate the feedback! Ben