From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ben Catterall <Ben.Catterall@citrix.com>
Subject: Re: RFC: HVM de-privileged mode scheduling
	considerations
Date: Tue, 11 Aug 2015 11:40:02 +0100
Message-ID: <55C9D102.9040409@citrix.com>
References: <55BF6E38.509@citrix.com>	<55BF72AB.8070100@citrix.com>
	<1438612487.31129.9.camel@citrix.com>
	<CAFLBxZYbKjLTQSSJqeCQctHmOYdj9aVv5bH0RemBvGME7pj6_A@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <CAFLBxZYbKjLTQSSJqeCQctHmOYdj9aVv5bH0RemBvGME7pj6_A@mail.gmail.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: George Dunlap <George.Dunlap@eu.citrix.com>, Ian Campbell <ian.campbell@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Dario Faggioli <dario.faggioli@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org


On 04/08/15 14:46, George Dunlap wrote:
> On Mon, Aug 3, 2015 at 3:34 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
>> On Mon, 2015-08-03 at 14:54 +0100, Andrew Cooper wrote:
>>> On 03/08/15 14:35, Ben Catterall wrote:
>>>> Hi all,
>>>>
>>>> I am working on an x86 proof-of-concept to evaluate if it is feasible
>>>> to move device models and x86 emulation code for HVM guests into a
>>>> de-privileged context.
>>>>
>>>> I was hoping to get feedback from relevant maintainers on scheduling
>>>> considerations for this system to mitigate potential DoS attacks.
>>>>
>>>> Many thanks in advance,
>>>> Ben
>>>>
>>>> This is intended as a proof-of-concept, with the aim of determining if
>>>> this idea is feasible within performance constraints.
>>>>
>>>> Motivation
>>>> ----------
>>>> The motivation for moving the device models and x86 emulation code
>>>> into ring 3 is to mitigate a system  compromise due a bug in any of
>>>> these systems. These systems are currently part of the hypervisor and,
>>>> consequently, a bug in any of these could allow an attacker to gain
>>>> control (or perform a DOS) of
>>>> Xen and/or guests.
>>>>
>>>> Migrating between PCPUs
>>>> -----------------------
>>>> There is a need to support migration between pcpus so that the
>>>> scheduler can still perform this operation. However, there is an issue
>>>> to resolve. Currently, I have a per-vcpu copy of the Xen ring 0 stack
>>>> up to the point of entering the de-privileged mode. This allows us to
>>>> restore this stack and then continue from the entry point when we have
>>>> finished in de-privileged mode. There will be per-pcpu data on these
>>>> per-vcpu stacks such as saved stack frame pointers for the per-pcpu
>>>> stack, smp_processor_id() responses etc.
>>>>
>>>> Therefore, it will be necessary to lock the vcpu to the current pcpu
>>>> when it enters this user mode so that it does not wake up on a
>>>> different pcpu where such pointers and other data are invalid. We can
>>>> do this by setting a hard affinity to the pcpu that the vcpu is
>>>> executing on. See common/wait.c which does something similar to what I
>>>> am doing.
>>>>
>>>> However, needing to have hard affinity to a pcpu leads to the
>>>> following problem:
>>>> - An attacker could lock multiple vcpus to a single pcpu, leading to a
>>>> DoS. This could be achieved by  spinning in a loop in Xen
>>>> de-privileged mode (assuming a bug in this mode) and performing this
>>>> operation on multiple vcpus at once. The attacker could wait until all
>>>> of their vcpus were on the same pcpu and then execute this attack.
>>>> This could cause the pcpu to, effectively, lock up, as it will be
>>>> under heavy load, and we would be unable to move work elsewhere.
>>>>
>>>> A solution to the DoS would be to force migration to another pcpu, if
>>>> after, say, 100 quanta have passed where the vcpu has remained in
>>>> de-privileged mode. This forcing of migration would require us to
>>>> forcibly complete the de-privileged operation, and then, just before
>>>> returning into the guest, force a cpu change. We could not just force
>>>> a migration at the schedule call point as the Xen stack needs to
>>>> unwind to free up resources. We would reset this count each time we
>>>> completed a de-privileged mode operation.
>>>>
>>>> A legitimate long-running de-privileged operation would trigger this
>>>> forced migration mechanism. However, it is unlikely that such
>>>> operations will be needed and the count can be adjusted appropriately
>>>> to mitigate this.
>>>>
>>>> Any suggestions or feedback would be appreciated!
>>>
>>> I don't see why any scheduling support is needed.
>>>
>>> Currently all operations like this are run synchronously in the vmexit
>>> context of the vcpu.  Any current DoS is already a real issue.
>>
>> The point is that this work is supposed to mitigate (or eliminate) such
>> issues, so we would like to remove this existing real issue.
>>
>> IOW while it might be expected that an in-Xen DM can DoS the system, an in
>> -Xen-ring3 DM should not be able to do so.
>>
>>> In any reasonable situation, emulation of a device is a small state
>>> mutation and occasionally kicking off a further action to perform.  (The
>>> far bigger risk from this kind of emulation is following bad
>>> pointers/etc, rather than long loops.)
>>>
>>> I think it would be entirely reasonable to have a deadline for a single
>>> execution of depriv mode, after which the domain is declared malicious
>>> and killed.
>>
>> I think this could make sense, it's essentially a harsher variant of Ben's
>> suggestion to abort an attempt to process the MMIO in order to migrate to
>> another pcpu, but it has the benefit of being easier to implement and
>> easier to reason about in terms of interactions with other aspects of the
>> system (i.e. it seems to remove the need to think of ways an attacker might
>> game that other system).
>>
>>> We already have this for host pcpus - the watchdog defaults to 5
>>> seconds.  Having a similar cutoff for depriv mode should be fine.
>>
>> That's a reasonable analogy.
>>
>> Perhaps we would want the depriv-watchdog to be some 1/N fraction of the
>> pcpu -watchdog, for a smallish N, to avoid the risk of any slop in the
>> timing allowing the pcpu watchdog to fire. N=3 for example (on the grounds
>> that N=2 is probably sufficient, so N=3 must be awesome).
>
> +1
>
>   -George
>
Thanks all! I'll do this then. Appreciate the feedback!

Ben