Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jag Raman <jag.raman@oracle.com>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	sstabellini@kernel.org, konrad.wilk@oracle.com,
	qemu-devel@nongnu.org, "Philippe Mathieu-Daudé" <f4bug@amsat.org>,
	ross.lagerwall@citrix.com, liran.alon@oracle.com,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	kanth.ghatraju@oracle.com
Subject: Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess
Date: Tue, 11 Jun 2019 11:53:05 -0400	[thread overview]
Message-ID: <735c942b-9ab9-86df-d112-d6b1fc7e90f9@oracle.com> (raw)
In-Reply-To: <20190523104018.GE26632@stefanha-x1.localdomain>



On 5/23/2019 6:40 AM, Stefan Hajnoczi wrote:
> On Tue, May 07, 2019 at 03:00:52PM -0400, Jag Raman wrote:
>> Hi Stefan,
>>
>> Thank you very much for your feedback. Following is a summary of the
>> discussions our team had regarding your feedback.
>>
>> On 4/25/2019 11:44 AM, Stefan Hajnoczi wrote:
>>>
>>> Can multiple LSI SCSI controllers be launched such that each process
>>> only has access to a subset of disk images?  Or is the disk image label
>>> per-VM so that there is no isolation between LSI SCSI controller
>>> processes for that VM?
>>
>> Yes, it is possible to provide each process with access to a subset of
>> disk images. The Orchestrator (libvirt, etc.) assigns a set of MCS
>> Categories to each VM, then device instances can be isolated by being
>> assigned a subset of the VM’s Categories.
>>
>>>
>>> My concern with this overall approach is the practicality vs its
>>> benefits.  Regarding practicality, each emulated device needs to be
>>> proxied separately.  The QEMU subsystem used by the device also needs to
>>> be proxied.  Global state, monitor commands, and live migration all
>>> require code changes to support proxied operation.  This is very
>>> invasive.
>>>
>>> Then each emulated device needs an SELinux policy to achieve the
>>> benefits of confinement.  I have no idea how to correctly write a policy
>>> like this and it's likely that developers who contribute a single new
>>> device will not be proficient in it either.  Writing these policies is a
>>> rare thing and few people will be good at this.  It also makes me worry
>>> about how we test and review them.
>>
>> We also think that having an SELinux policy per device would become
>> complicated. Our proposal, therefore, is to define SELinux policies for
>> each device class - viz. disk, network, console, graphics, etc.
>> "fedora-selinux" upstream repo. [1] will contain these policies, so the
>> device developer doesn't have to worry about defining new policies for
>> each device. This proposal would diminish the complexity of SELinux
>> policies.
> 
> Have you considered using Linux namespaces?  I'm beginning to think that
> SELinux becomes less relevant with pid and mount namespaces to isolate
> processes.  The advantage of namespaces is that they are easy to
> understand and can be expressed in code instead of a policy file in a
> separate package.  This is the approach we're taking with virtiofsd
> (vhost-user device backend for virtio-fs).
> 
>>>
>>> Despite the efforts required in making this work, all processes still
>>> effectively have full access to the guest since they can access guest
>>> RAM.  What I mean is that the device is actually not confined to its
>>> host process (e.g. LSI SCSI controller process) because it can write
>>> code to executable guest RAM pages.  The guest will then execute that
>>> code and therefore all guest I/O (networking, disk, etc) is still
>>> available indirectly to the "confined" processes.  They are not really
>>> sandboxed from the outside world, regardless of how strict the SELinux
>>> policy is :(.
>>>
>>> There are performance issues due to proxying as well, but let's ignore
>>> them for now and focus on security.
>>
>> We are also focusing on performance. Please take a look at the following
>> blog for an initial report on performance. The results are for an iSCSI
>> backend in Oracle Cloud. We are working on collecting data on a much
>> heavier IOPS workload like an NVMe backend.
>>
>> https://blogs.oracle.com/linux/towards-a-more-secure-qemu-hypervisor%2c-part-3-of-3-v2
> 
> Hard to reach a conclusion without also looking at CPU utilization.
> IOPS alone don't tell the story.
> 
> If the system had spare CPU cycles then the performance results between
> built-in LSI and separate LSI will be similar but the efficiency
> (IOPS/CPU%) has actually decreased due to the extra CPU cycles required
> to forward the hardware register access to the device emulation process.
> 
> If you rerun on a system without spare CPU cycles then IOPS degradation
> would become apparent.  I'm not saying this is necessarily the case,
> maybe the overhead is really doesn't have a significant effect, but the
> graph shown in the blog post isn't enough to draw a conclusion either
> way.

Hi Stefan,

We are working on getting a better idea about the CPU utilization while 
the performance test is running. We're looking forward to discussing 
this during the forthcoming KVM meeting.

Thank you!
--
Jag

> 
> Regarding the proposed QEMU bypass, these already exist in some form via
> kvm.ko's ioeventfd and coalesced MMIO features.
> 
> Today ioeventfd is only used for performance-critical hardware
> registers, so kvm.ko doesn't use a sophisticated dispatch mechanism.  If
> you want to use it for all hardware register accesses handled by a
> separate process then ioeventfd probably needs to be tweaked somewhat to
> make it more scalable for that case.
> 
> Coalesced MMIO is also cool.  kvm.ko can accumulate guest MMIO writes in
> a buffer that is only collected at a later point in time.  This improves
> performance for devices that require multiple hardware register writes
> to kick off an I/O operation (only the last one really needs to be
> trapped by the device emulation code!).  This sounds similar to an MMIO
> access shared ring buffer.
> 
>>>
>>> How do the benefits compare against today's monolithic approach?  If the
>>> guest exploits monolithic QEMU it has full access to all host files and
>>> APIs available to QEMU.  However, these are largely just the resources
>>> that belong to the guest anyway - not resources we are trying to keep
>>> away from the guest.  With multi-process QEMU each process still has
>>> access to all guest interfaces via the code injection I mentioned above,
>>> but the SELinux policy could restrict access to some resources.  But
>>> this benefit is really small in my opinion, given that the resources
>>> belong to the guest anyway and the guest can already access them.
>>
>> The primary focus of our project is to defend the host from malicious
>> guest. The code injection problem you outlined above involves part of
>> the guest attacking itself, but not the host. Therefore, this wouldn't
>> compromise our objective.
>>
>> Like you know, there are some parts of QEMU which are not directly
>> accessible from the guest (via drivers, etc.), which we prefer to call
>> the control plane. It executes ioctls to the host kernel and has access
>> to a broader set of syscalls, which the device emulation code doesn’t
>> need. We want to protect the control plane from emulated devices. In the
>> case where a device injects code into the RAM to attack another device
>> on the same VM, the control plane would still be protected.
> 
> Are you aware of any cases where the syscall attack surface led to an
> exploitable bug in QEMU?  Any proof-of-concept exploit code or a CVE?
> 
>> Another benefit with the project would be regarding detecting and
>> reporting failures in the emulated devices. For instance, in cases like
>> CVE-2018-18849, where an emulated device hangs/crashes, it wouldn't
>> directly crash the QEMU process as well. QEMU could detect the failure,
>> log the problem and exit, instead of generating coredump/hang.
> 
> Debugging is a lot easier with a coredump though :).  I would rather
> have a coredump than a nice message that says "LSI died".
> 
>>>
>>> I think you can implement this for a handful of devices as a one-time
>>> thing, but the invasiveness and the impracticality of getting wide cover
>>> of QEMU make this approach questionable.
>>>
>>> Am I mistaken about the invasiveness or impracticality?
>>
>> We are not planning to implement this for all devices since it would be
>> impractical. But the project adds a framework for implementing more
>> devices in the future.
>>
>> One other thing we would like to bring your attention to is that the
>> project doesn't affect the current usage. The same devices could still
>> be used as part of monolithic QEMU if the user chooses to do so.
> 
> I don't follow, to me this proposal seems extremely invasive and
> requires awareness from all developers.
> 
> QEMU contains global state (like net/net.c:net_clients or
> block.c:all_bdrv_states) and QMP commands that access global state.  All
> of this needs to be carefully proxied to avoid losing functionality as
> fundamental as the QMP monitor.
> 
> This is what worries me about this project.  There are amazing niche
> features like record/replay that have been integrated into QEMU without
> requiring all developers to be aware of how they work.  If you can
> achieve this then I would have no reservations.
> 
> Right now I don't see that this will be possible and that's why I'm
> challenging you to justify that the reduction in system call attack
> surface is actually worth the invasive changes required.
> 
> Do you see a way to solve the issues I've mentioned?
> 
> Stefan
>

next prev parent reply	other threads:[~2019-06-11 16:13 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-07  7:22 [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess elena.ufimtseva
2019-03-07  8:14 ` Thomas Huth
2019-03-07 14:16   ` Kevin Wolf
2019-03-07 14:21     ` Thomas Huth
2019-03-07 14:40       ` Konrad Rzeszutek Wilk
2019-03-07 14:53         ` Thomas Huth
2019-03-08 18:22     ` Elena Ufimtseva
2019-03-07 14:26 ` Stefan Hajnoczi
2019-03-07 14:51   ` Daniel P. Berrangé
2019-03-07 16:05     ` Michael S. Tsirkin
2019-03-07 16:19       ` Daniel P. Berrangé
2019-03-07 16:46         ` Michael S. Tsirkin
2019-03-07 16:49           ` Daniel P. Berrangé
2019-03-07 19:27     ` Stefan Hajnoczi
2019-03-07 23:29       ` John G Johnson
2019-03-08  9:50         ` Stefan Hajnoczi
     [not found]           ` <20190326080822.GC21018@stefanha-x1.localdomain>
     [not found]             ` <e5395abf-6b41-46c8-f5af-3210077dfdd5@oracle.com>
     [not found]               ` <CAAdtpL4ztcpf-CTx0fc5T_+VQ+8upHa2pEMoiZPcmBXOO6L3Og@mail.gmail.com>
2019-04-23 21:26                 ` Jag Raman
2019-04-23 21:26                   ` Jag Raman
2019-04-25 15:44                   ` Stefan Hajnoczi
2019-04-25 15:44                     ` Stefan Hajnoczi
2019-05-07 19:00                     ` Jag Raman
2019-05-23 10:40                       ` Stefan Hajnoczi
2019-06-11 15:53                         ` Jag Raman [this message]
2019-05-23 11:11                       ` Stefan Hajnoczi
2019-05-28 15:18                         ` Elena Ufimtseva
2019-05-30 20:54                           ` Elena Ufimtseva
2019-06-11 15:59                             ` Jag Raman
2019-06-12 16:24                             ` Stefan Hajnoczi
2019-06-12 17:01                               ` Elena Ufimtseva
2019-03-11 10:20         ` Daniel P. Berrangé
2019-05-07 21:00           ` Elena Ufimtseva
2019-05-23 11:22             ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=735c942b-9ab9-86df-d112-d6b1fc7e90f9@oracle.com \
    --to=jag.raman@oracle.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=f4bug@amsat.org \
    --cc=john.g.johnson@oracle.com \
    --cc=kanth.ghatraju@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=liran.alon@oracle.com \
    --cc=qemu-devel@nongnu.org \
    --cc=ross.lagerwall@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).