From: Jag Raman <jag.raman@oracle.com>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
"John G Johnson" <john.g.johnson@oracle.com>,
sstabellini@kernel.org, konrad.wilk@oracle.com,
qemu-devel@nongnu.org, "Philippe Mathieu-Daudé" <f4bug@amsat.org>,
ross.lagerwall@citrix.com, liran.alon@oracle.com,
"Stefan Hajnoczi" <stefanha@redhat.com>,
kanth.ghatraju@oracle.com
Subject: Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess
Date: Tue, 11 Jun 2019 11:53:05 -0400 [thread overview]
Message-ID: <735c942b-9ab9-86df-d112-d6b1fc7e90f9@oracle.com> (raw)
In-Reply-To: <20190523104018.GE26632@stefanha-x1.localdomain>
On 5/23/2019 6:40 AM, Stefan Hajnoczi wrote:
> On Tue, May 07, 2019 at 03:00:52PM -0400, Jag Raman wrote:
>> Hi Stefan,
>>
>> Thank you very much for your feedback. Following is a summary of the
>> discussions our team had regarding your feedback.
>>
>> On 4/25/2019 11:44 AM, Stefan Hajnoczi wrote:
>>>
>>> Can multiple LSI SCSI controllers be launched such that each process
>>> only has access to a subset of disk images? Or is the disk image label
>>> per-VM so that there is no isolation between LSI SCSI controller
>>> processes for that VM?
>>
>> Yes, it is possible to provide each process with access to a subset of
>> disk images. The Orchestrator (libvirt, etc.) assigns a set of MCS
>> Categories to each VM, then device instances can be isolated by being
>> assigned a subset of the VM’s Categories.
>>
>>>
>>> My concern with this overall approach is the practicality vs its
>>> benefits. Regarding practicality, each emulated device needs to be
>>> proxied separately. The QEMU subsystem used by the device also needs to
>>> be proxied. Global state, monitor commands, and live migration all
>>> require code changes to support proxied operation. This is very
>>> invasive.
>>>
>>> Then each emulated device needs an SELinux policy to achieve the
>>> benefits of confinement. I have no idea how to correctly write a policy
>>> like this and it's likely that developers who contribute a single new
>>> device will not be proficient in it either. Writing these policies is a
>>> rare thing and few people will be good at this. It also makes me worry
>>> about how we test and review them.
>>
>> We also think that having an SELinux policy per device would become
>> complicated. Our proposal, therefore, is to define SELinux policies for
>> each device class - viz. disk, network, console, graphics, etc.
>> "fedora-selinux" upstream repo. [1] will contain these policies, so the
>> device developer doesn't have to worry about defining new policies for
>> each device. This proposal would diminish the complexity of SELinux
>> policies.
>
> Have you considered using Linux namespaces? I'm beginning to think that
> SELinux becomes less relevant with pid and mount namespaces to isolate
> processes. The advantage of namespaces is that they are easy to
> understand and can be expressed in code instead of a policy file in a
> separate package. This is the approach we're taking with virtiofsd
> (vhost-user device backend for virtio-fs).
>
>>>
>>> Despite the efforts required in making this work, all processes still
>>> effectively have full access to the guest since they can access guest
>>> RAM. What I mean is that the device is actually not confined to its
>>> host process (e.g. LSI SCSI controller process) because it can write
>>> code to executable guest RAM pages. The guest will then execute that
>>> code and therefore all guest I/O (networking, disk, etc) is still
>>> available indirectly to the "confined" processes. They are not really
>>> sandboxed from the outside world, regardless of how strict the SELinux
>>> policy is :(.
>>>
>>> There are performance issues due to proxying as well, but let's ignore
>>> them for now and focus on security.
>>
>> We are also focusing on performance. Please take a look at the following
>> blog for an initial report on performance. The results are for an iSCSI
>> backend in Oracle Cloud. We are working on collecting data on a much
>> heavier IOPS workload like an NVMe backend.
>>
>> https://blogs.oracle.com/linux/towards-a-more-secure-qemu-hypervisor%2c-part-3-of-3-v2
>
> Hard to reach a conclusion without also looking at CPU utilization.
> IOPS alone don't tell the story.
>
> If the system had spare CPU cycles then the performance results between
> built-in LSI and separate LSI will be similar but the efficiency
> (IOPS/CPU%) has actually decreased due to the extra CPU cycles required
> to forward the hardware register access to the device emulation process.
>
> If you rerun on a system without spare CPU cycles then IOPS degradation
> would become apparent. I'm not saying this is necessarily the case,
> maybe the overhead is really doesn't have a significant effect, but the
> graph shown in the blog post isn't enough to draw a conclusion either
> way.
Hi Stefan,
We are working on getting a better idea about the CPU utilization while
the performance test is running. We're looking forward to discussing
this during the forthcoming KVM meeting.
Thank you!
--
Jag
>
> Regarding the proposed QEMU bypass, these already exist in some form via
> kvm.ko's ioeventfd and coalesced MMIO features.
>
> Today ioeventfd is only used for performance-critical hardware
> registers, so kvm.ko doesn't use a sophisticated dispatch mechanism. If
> you want to use it for all hardware register accesses handled by a
> separate process then ioeventfd probably needs to be tweaked somewhat to
> make it more scalable for that case.
>
> Coalesced MMIO is also cool. kvm.ko can accumulate guest MMIO writes in
> a buffer that is only collected at a later point in time. This improves
> performance for devices that require multiple hardware register writes
> to kick off an I/O operation (only the last one really needs to be
> trapped by the device emulation code!). This sounds similar to an MMIO
> access shared ring buffer.
>
>>>
>>> How do the benefits compare against today's monolithic approach? If the
>>> guest exploits monolithic QEMU it has full access to all host files and
>>> APIs available to QEMU. However, these are largely just the resources
>>> that belong to the guest anyway - not resources we are trying to keep
>>> away from the guest. With multi-process QEMU each process still has
>>> access to all guest interfaces via the code injection I mentioned above,
>>> but the SELinux policy could restrict access to some resources. But
>>> this benefit is really small in my opinion, given that the resources
>>> belong to the guest anyway and the guest can already access them.
>>
>> The primary focus of our project is to defend the host from malicious
>> guest. The code injection problem you outlined above involves part of
>> the guest attacking itself, but not the host. Therefore, this wouldn't
>> compromise our objective.
>>
>> Like you know, there are some parts of QEMU which are not directly
>> accessible from the guest (via drivers, etc.), which we prefer to call
>> the control plane. It executes ioctls to the host kernel and has access
>> to a broader set of syscalls, which the device emulation code doesn’t
>> need. We want to protect the control plane from emulated devices. In the
>> case where a device injects code into the RAM to attack another device
>> on the same VM, the control plane would still be protected.
>
> Are you aware of any cases where the syscall attack surface led to an
> exploitable bug in QEMU? Any proof-of-concept exploit code or a CVE?
>
>> Another benefit with the project would be regarding detecting and
>> reporting failures in the emulated devices. For instance, in cases like
>> CVE-2018-18849, where an emulated device hangs/crashes, it wouldn't
>> directly crash the QEMU process as well. QEMU could detect the failure,
>> log the problem and exit, instead of generating coredump/hang.
>
> Debugging is a lot easier with a coredump though :). I would rather
> have a coredump than a nice message that says "LSI died".
>
>>>
>>> I think you can implement this for a handful of devices as a one-time
>>> thing, but the invasiveness and the impracticality of getting wide cover
>>> of QEMU make this approach questionable.
>>>
>>> Am I mistaken about the invasiveness or impracticality?
>>
>> We are not planning to implement this for all devices since it would be
>> impractical. But the project adds a framework for implementing more
>> devices in the future.
>>
>> One other thing we would like to bring your attention to is that the
>> project doesn't affect the current usage. The same devices could still
>> be used as part of monolithic QEMU if the user chooses to do so.
>
> I don't follow, to me this proposal seems extremely invasive and
> requires awareness from all developers.
>
> QEMU contains global state (like net/net.c:net_clients or
> block.c:all_bdrv_states) and QMP commands that access global state. All
> of this needs to be carefully proxied to avoid losing functionality as
> fundamental as the QMP monitor.
>
> This is what worries me about this project. There are amazing niche
> features like record/replay that have been integrated into QEMU without
> requiring all developers to be aware of how they work. If you can
> achieve this then I would have no reservations.
>
> Right now I don't see that this will be possible and that's why I'm
> challenging you to justify that the reduction in system call attack
> surface is actually worth the invasive changes required.
>
> Do you see a way to solve the issues I've mentioned?
>
> Stefan
>
next prev parent reply other threads:[~2019-06-11 16:13 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-07 7:22 [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess elena.ufimtseva
2019-03-07 8:14 ` Thomas Huth
2019-03-07 14:16 ` Kevin Wolf
2019-03-07 14:21 ` Thomas Huth
2019-03-07 14:40 ` Konrad Rzeszutek Wilk
2019-03-07 14:53 ` Thomas Huth
2019-03-08 18:22 ` Elena Ufimtseva
2019-03-07 14:26 ` Stefan Hajnoczi
2019-03-07 14:51 ` Daniel P. Berrangé
2019-03-07 16:05 ` Michael S. Tsirkin
2019-03-07 16:19 ` Daniel P. Berrangé
2019-03-07 16:46 ` Michael S. Tsirkin
2019-03-07 16:49 ` Daniel P. Berrangé
2019-03-07 19:27 ` Stefan Hajnoczi
2019-03-07 23:29 ` John G Johnson
2019-03-08 9:50 ` Stefan Hajnoczi
[not found] ` <20190326080822.GC21018@stefanha-x1.localdomain>
[not found] ` <e5395abf-6b41-46c8-f5af-3210077dfdd5@oracle.com>
[not found] ` <CAAdtpL4ztcpf-CTx0fc5T_+VQ+8upHa2pEMoiZPcmBXOO6L3Og@mail.gmail.com>
2019-04-23 21:26 ` Jag Raman
2019-04-23 21:26 ` Jag Raman
2019-04-25 15:44 ` Stefan Hajnoczi
2019-04-25 15:44 ` Stefan Hajnoczi
2019-05-07 19:00 ` Jag Raman
2019-05-23 10:40 ` Stefan Hajnoczi
2019-06-11 15:53 ` Jag Raman [this message]
2019-05-23 11:11 ` Stefan Hajnoczi
2019-05-28 15:18 ` Elena Ufimtseva
2019-05-30 20:54 ` Elena Ufimtseva
2019-06-11 15:59 ` Jag Raman
2019-06-12 16:24 ` Stefan Hajnoczi
2019-06-12 17:01 ` Elena Ufimtseva
2019-03-11 10:20 ` Daniel P. Berrangé
2019-05-07 21:00 ` Elena Ufimtseva
2019-05-23 11:22 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=735c942b-9ab9-86df-d112-d6b1fc7e90f9@oracle.com \
--to=jag.raman@oracle.com \
--cc=elena.ufimtseva@oracle.com \
--cc=f4bug@amsat.org \
--cc=john.g.johnson@oracle.com \
--cc=kanth.ghatraju@oracle.com \
--cc=konrad.wilk@oracle.com \
--cc=liran.alon@oracle.com \
--cc=qemu-devel@nongnu.org \
--cc=ross.lagerwall@citrix.com \
--cc=sstabellini@kernel.org \
--cc=stefanha@gmail.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).