Linux virtualization list
 help / color / mirror / Atom feed
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-03 19:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <eb1750e90e4bd45da297fa6f78f8ef93671b7c2f.camel@kernel.crashing.org>

On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > >   2- Make virtio use the DMA API with our custom platform-provided
> > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > running on a secure VM in our case.
> > 
> > And total NAK the customer platform-provided part of this.  We need
> > a flag passed in from the hypervisor that the device needs all bus
> > specific dma api treatment, and then just use the normal plaform
> > dma mapping setup. 
> 
> Christoph, as I have explained already, we do NOT have a way to provide
> such a flag as neither the hypervisor nor qemu knows anything about
> this when the VM is created.

I think the fact you can't add flags from the hypervisor is
a sign of a problematic architecture, you should look at
adding that down the road - you will likely need it at some point.

However in this specific case, the flag does not need to come from the
hypervisor, it can be set by arch boot code I think.
Christoph do you see a problem with that?

> >  To get swiotlb you'll need to then use the DT/ACPI
> > dma-range property to limit the addressable range, and a swiotlb
> > capable plaform will use swiotlb automatically.
> 
> This cannot be done as you describe it.
> 
> The VM is created as a *normal* VM. The DT stuff is generated by qemu
> at a point where it has *no idea* that the VM will later become secure
> and thus will have to restrict which pages can be used for "DMA".
> 
> The VM will *at runtime* turn itself into a secure VM via interactions
> with the security HW and the Ultravisor layer (which sits below the
> HV). This happens way after the DT has been created and consumed, the
> qemu devices instanciated etc...
> 
> Only the guest kernel knows because it initates the transition. When
> that happens, the virtio devices have already been used by the guest
> firmware, bootloader, possibly another kernel that kexeced the "secure"
> one, etc... 
> 
> So instead of running around saying NAK NAK NAK, please explain how we
> can solve that differently.
> 
> Ben.
> 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-03 19:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: robh, srikar, benh, linuxram, linux-kernel, virtualization, hch,
	paulus, mpe, joe, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <b31d2162-884c-0ed4-00df-e7debe0cc0e7@redhat.com>

On Fri, Aug 03, 2018 at 10:41:41AM +0800, Jason Wang wrote:
> 
> 
> On 2018年08月03日 04:55, Michael S. Tsirkin wrote:
> > On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
> > > This patch series is the follow up on the discussions we had before about
> > > the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
> > > for virito devices (https://patchwork.kernel.org/patch/10417371/). There
> > > were suggestions about doing away with two different paths of transactions
> > > with the host/QEMU, first being the direct GPA and the other being the DMA
> > > API based translations.
> > > 
> > > First patch attempts to create a direct GPA mapping based DMA operations
> > > structure called 'virtio_direct_dma_ops' with exact same implementation
> > > of the direct GPA path which virtio core currently has but just wrapped in
> > > a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
> > > the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
> > > existing semantics. The second patch does exactly that inside the function
> > > virtio_finalize_features(). The third patch removes the default direct GPA
> > > path from virtio core forcing it to use DMA API callbacks for all devices.
> > > Now with that change, every device must have a DMA operations structure
> > > associated with it. The fourth patch adds an additional hook which gives
> > > the platform an opportunity to do yet another override if required. This
> > > platform hook can be used on POWER Ultravisor based protected guests to
> > > load up SWIOTLB DMA callbacks to do the required (as discussed previously
> > > in the above mentioned thread how host is allowed to access only parts of
> > > the guest GPA range) bounce buffering into the shared memory for all I/O
> > > scatter gather buffers to be consumed on the host side.
> > > 
> > > Please go through these patches and review whether this approach broadly
> > > makes sense. I will appreciate suggestions, inputs, comments regarding
> > > the patches or the approach in general. Thank you.
> > Jason did some work on profiling this. Unfortunately he reports
> > about 4% extra overhead from this switch on x86 with no vIOMMU.
> 
> The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> guest and measure PPS on tap on host.
> 
> Thanks

Could you supply host configuration involved please?

> > 
> > I expect he's writing up the data in more detail, but
> > just wanted to let you know this would be one more
> > thing to debug before we can just switch to DMA APIs.
> > 
> > 
> > > Anshuman Khandual (4):
> > >    virtio: Define virtio_direct_dma_ops structure
> > >    virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
> > >    virtio: Force virtio core to use DMA API callbacks for all virtio devices
> > >    virtio: Add platform specific DMA API translation for virito devices
> > > 
> > >   arch/powerpc/include/asm/dma-mapping.h |  6 +++
> > >   arch/powerpc/platforms/pseries/iommu.c |  6 +++
> > >   drivers/virtio/virtio.c                | 72 ++++++++++++++++++++++++++++++++++
> > >   drivers/virtio/virtio_pci_common.h     |  3 ++
> > >   drivers/virtio/virtio_ring.c           | 65 +-----------------------------
> > >   5 files changed, 89 insertions(+), 63 deletions(-)
> > > 
> > > -- 
> > > 2.9.3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
From: Michael S. Tsirkin @ 2018-08-03 19:12 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: alexander.h.duyck, willemdebruijn.kernel, Nambiar, Amritha,
	netdev, tom, alexander.duyck, virtualization, edumazet, rostedt,
	hannes, sridhar.samudrala, tom, davem, gaowanlong
In-Reply-To: <20180803190650.GA2157@outlook.office365.com>

On Fri, Aug 03, 2018 at 12:06:51PM -0700, Andrei Vagin wrote:
> On Fri, Aug 03, 2018 at 12:08:05AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Aug 02, 2018 at 02:04:12PM -0700, Nambiar, Amritha wrote:
> > > On 8/1/2018 5:11 PM, Andrei Vagin wrote:
> > > > On Tue, Jul 10, 2018 at 07:28:49PM -0700, Nambiar, Amritha wrote:
> > > >> With this patch series, I introduced static_key for XPS maps
> > > >> (xps_needed), so static_key_slow_inc() is used to switch branches. The
> > > >> definition of static_key_slow_inc() has cpus_read_lock in place. In the
> > > >> virtio_net driver, XPS queues are initialized after setting the
> > > >> queue:cpu affinity in virtnet_set_affinity() which is already protected
> > > >> within cpus_read_lock. Hence, the warning here trying to acquire
> > > >> cpus_read_lock when it is already held.
> > > >>
> > > >> A quick fix for this would be to just extract netif_set_xps_queue() out
> > > >> of the lock by simply wrapping it with another put/get_online_cpus
> > > >> (unlock right before and hold lock right after).
> > > > 
> > > > virtnet_set_affinity() is called from virtnet_cpu_online(), which is
> > > > called under cpus_read_lock too.
> > > > 
> > > > It looks like now we can't call netif_set_xps_queue() from cpu hotplug
> > > > callbacks.
> > > > 
> > > > I can suggest a very straightforward fix for this problem. The patch is
> > > > attached.
> > > > 
> > > 
> > > Thanks for looking into this. I was thinking of fixing this in the
> > > virtio_net driver by moving the XPS initialization (and have a new
> > > get_affinity utility) in the ndo_open (so it is together with other tx
> > > preparation) instead of probe. Your patch solves this in general for
> > > setting up cpu hotplug callbacks which is under cpus_read_lock.
> > 
> > 
> > I like this too. Could you repost in a standard way
> > (inline, with your signoff etc) so we can ack this for
> > net-next?
> 
> When I was testing this patch, I got the following kasan warning. Michael,
> could you take a look at it. Maybe you will understand what was going wrong there.
> 
> https://api.travis-ci.org/v3/job/410701353/log.txt
> 
> [    7.275033] ==================================================================
> [    7.275226] BUG: KASAN: slab-out-of-bounds in virtnet_poll+0xaa1/0xd00
> [    7.275359] Read of size 8 at addr ffff8801d444a000 by task ip/370
> [    7.275488] 
> [    7.275610] CPU: 1 PID: 370 Comm: ip Not tainted 4.18.0-rc6+ #1
> [    7.275613] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> [    7.275616] Call Trace:
> [    7.275621]  <IRQ>
> [    7.275630]  dump_stack+0x71/0xab
> [    7.275640]  print_address_description+0x6a/0x270
> [    7.275648]  kasan_report+0x258/0x380
> [    7.275653]  ? virtnet_poll+0xaa1/0xd00
> [    7.275661]  virtnet_poll+0xaa1/0xd00
> [    7.275680]  ? receive_buf+0x5920/0x5920
> [    7.275689]  ? do_raw_spin_unlock+0x54/0x220
> [    7.275699]  ? find_held_lock+0x32/0x1c0
> [    7.275710]  ? rcu_process_callbacks+0xa60/0xd20
> [    7.275736]  net_rx_action+0x2ee/0xad0
> [    7.275748]  ? rcu_note_context_switch+0x320/0x320
> [    7.275754]  ? napi_complete_done+0x300/0x300
> [    7.275763]  ? native_apic_msr_write+0x27/0x30
> [    7.275768]  ? lapic_next_event+0x5b/0x90
> [    7.275775]  ? clockevents_program_event+0x21d/0x2f0
> [    7.275791]  __do_softirq+0x19a/0x623
> [    7.275807]  do_softirq_own_stack+0x2a/0x40
> [    7.275811]  </IRQ>
> [    7.275818]  do_softirq.part.18+0x6a/0x80
> [    7.275825]  __local_bh_enable_ip+0x49/0x50
> [    7.275829]  virtnet_open+0x129/0x440
> [    7.275841]  __dev_open+0x189/0x2c0
> [    7.275848]  ? dev_set_rx_mode+0x30/0x30
> [    7.275857]  ? do_raw_spin_unlock+0x54/0x220
> [    7.275866]  __dev_change_flags+0x3a9/0x4f0
> [    7.275873]  ? dev_set_allmulti+0x10/0x10
> [    7.275889]  dev_change_flags+0x7a/0x150
> [    7.275900]  do_setlink+0x9fe/0x2e40
> [    7.275910]  ? deref_stack_reg+0xad/0xe0
> [    7.275917]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
> [    7.275922]  ? find_held_lock+0x32/0x1c0
> [    7.275929]  ? rtnetlink_put_metrics+0x460/0x460
> [    7.275935]  ? virtqueue_add_sgs+0x9e2/0xde0
> [    7.275953]  ? virtscsi_add_cmd+0x454/0x780
> [    7.275964]  ? find_held_lock+0x32/0x1c0
> [    7.275973]  ? deref_stack_reg+0xad/0xe0
> [    7.275979]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
> [    7.275985]  ? lock_downgrade+0x5e0/0x5e0
> [    7.275993]  ? memset+0x1f/0x40
> [    7.276008]  ? nla_parse+0x33/0x290
> [    7.276016]  rtnl_newlink+0x954/0x1120
> [    7.276030]  ? rtnl_link_unregister+0x250/0x250
> [    7.276044]  ? is_bpf_text_address+0x5/0x60
> [    7.276054]  ? lock_downgrade+0x5e0/0x5e0
> [    7.276057]  ? lock_acquire+0x10b/0x2a0
> [    7.276072]  ? deref_stack_reg+0xad/0xe0
> [    7.276078]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
> [    7.276085]  ? __kernel_text_address+0xe/0x30
> [    7.276090]  ? unwind_get_return_address+0x5f/0xa0
> [    7.276103]  ? find_held_lock+0x32/0x1c0
> [    7.276110]  ? is_bpf_text_address+0x5/0x60
> [    7.276124]  ? deref_stack_reg+0xad/0xe0
> [    7.276131]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
> [    7.276136]  ? depot_save_stack+0x2d9/0x460
> [    7.276142]  ? deref_stack_reg+0xad/0xe0
> [    7.276156]  ? find_held_lock+0x32/0x1c0
> [    7.276164]  ? is_bpf_text_address+0x5/0x60
> [    7.276170]  ? __lock_acquire.isra.29+0xe8/0x1bb0
> [    7.276212]  ? rtnetlink_rcv_msg+0x5d6/0x930
> [    7.276222]  ? lock_downgrade+0x5e0/0x5e0
> [    7.276226]  ? lock_acquire+0x10b/0x2a0
> [    7.276240]  rtnetlink_rcv_msg+0x69e/0x930
> [    7.276249]  ? rtnl_calcit.isra.31+0x2f0/0x2f0
> [    7.276255]  ? __lock_acquire.isra.29+0xe8/0x1bb0
> [    7.276265]  ? netlink_deliver_tap+0x8d/0x8e0
> [    7.276276]  netlink_rcv_skb+0x127/0x350
> [    7.276281]  ? rtnl_calcit.isra.31+0x2f0/0x2f0
> [    7.276289]  ? netlink_ack+0x970/0x970
> [    7.276299]  ? __alloc_skb+0xc2/0x520
> [    7.276311]  netlink_unicast+0x40f/0x5d0
> [    7.276320]  ? netlink_attachskb+0x580/0x580
> [    7.276325]  ? _copy_from_iter_full+0x157/0x6f0
> [    7.276331]  ? import_iovec+0x90/0x390
> [    7.276338]  ? get_page_from_freelist+0x1e89/0x3e30
> [    7.276347]  ? apparmor_socket_sock_rcv_skb+0x10/0x10
> [    7.276357]  netlink_sendmsg+0x65e/0xb00
> [    7.276367]  ? netlink_unicast+0x5d0/0x5d0
> [    7.276373]  ? copy_msghdr_from_user+0x206/0x340
> [    7.276388]  ? netlink_unicast+0x5d0/0x5d0
> [    7.276394]  sock_sendmsg+0xb3/0xf0
> [    7.276401]  ___sys_sendmsg+0x604/0x8b0
> [    7.276410]  ? copy_msghdr_from_user+0x340/0x340
> [    7.276416]  ? find_held_lock+0x32/0x1c0
> [    7.276424]  ? __handle_mm_fault+0xc85/0x3140
> [    7.276433]  ? lock_downgrade+0x5e0/0x5e0
> [    7.276439]  ? mem_cgroup_commit_charge+0xb4/0xf80
> [    7.276453]  ? _raw_spin_unlock+0x24/0x30
> [    7.276458]  ? __handle_mm_fault+0xc85/0x3140
> [    7.276467]  ? __pmd_alloc+0x430/0x430
> [    7.276473]  ? find_held_lock+0x32/0x1c0
> [    7.276485]  ? __fget_light+0x55/0x1f0
> [    7.276497]  ? __sys_sendmsg+0xd2/0x170
> [    7.276502]  __sys_sendmsg+0xd2/0x170
> [    7.276508]  ? __ia32_sys_shutdown+0x70/0x70
> [    7.276516]  ? handle_mm_fault+0x1f9/0x6a0
> [    7.276528]  ? up_read+0x1c/0x110
> [    7.276534]  ? __do_page_fault+0x4a6/0xa80
> [    7.276554]  do_syscall_64+0xa0/0x280
> [    7.276558]  ? prepare_exit_to_usermode+0x88/0x130
> [    7.276566]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [    7.276572] RIP: 0033:0x7ffbe9a2f160
> [    7.276574] Code: c3 48 8b 05 2a 2d 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 1d 8f 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ee cb 00 00 48 89 04 24 
> [    7.276728] RSP: 002b:00007ffc5a2d6108 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> [    7.276735] RAX: ffffffffffffffda RBX: 00007ffc5a2da220 RCX: 00007ffbe9a2f160
> [    7.276738] RDX: 0000000000000000 RSI: 00007ffc5a2d6150 RDI: 0000000000000003
> [    7.276741] RBP: 00007ffc5a2d6150 R08: 0000000000000000 R09: 0000000000000003
> [    7.276744] R10: 00007ffc5a2d5ed0 R11: 0000000000000246 R12: 000000005b6177de
> [    7.276748] R13: 0000000000000000 R14: 00000000006473a0 R15: 00007ffc5a2da918
> [    7.276763] 
> [    7.276895] Allocated by task 1:
> [    7.277026]  kasan_kmalloc+0xa0/0xd0
> [    7.277030]  __kmalloc+0x13a/0x250
> [    7.277034]  init_vqs+0xd0/0x11c0
> [    7.277038]  virtnet_probe+0xb99/0x1ad0
> [    7.277045]  virtio_dev_probe+0x3fc/0x890
> [    7.277052]  driver_probe_device+0x6c4/0xcc0
> [    7.277056]  __driver_attach+0x232/0x2c0
> [    7.277060]  bus_for_each_dev+0x118/0x1a0
> [    7.277064]  bus_add_driver+0x390/0x6e0
> [    7.277068]  driver_register+0x18e/0x400
> [    7.277076]  virtio_net_driver_init+0x6d/0x90
> [    7.277080]  do_one_initcall+0xa8/0x348
> [    7.277085]  kernel_init_freeable+0x42d/0x4c8
> [    7.277090]  kernel_init+0xf/0x130
> [    7.277095]  ret_from_fork+0x35/0x40
> [    7.277097] 
> [    7.277223] Freed by task 0:
> [    7.277347] (stack is not available)
> [    7.277473] 
> [    7.277596] The buggy address belongs to the object at ffff8801d4449100
> [    7.277596]  which belongs to the cache kmalloc-4096 of size 4096
> [    7.277769] The buggy address is located 3840 bytes inside of
> [    7.277769]  4096-byte region [ffff8801d4449100, ffff8801d444a100)
> [    7.277932] The buggy address belongs to the page:
> [    7.278066] page:ffffea0007511200 count:1 mapcount:0 mapping:ffff8801db002600 index:0x0 compound_mapcount: 0
> [    7.278230] flags: 0x17fff8000008100(slab|head)
> [    7.278363] raw: 017fff8000008100 dead000000000100 dead000000000200 ffff8801db002600
> [    7.278516] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
> [    7.278664] page dumped because: kasan: bad access detected
> [    7.278790] 
> [    7.278904] Memory state around the buggy address:
> [    7.279031]  ffff8801d4449f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [    7.279175]  ffff8801d4449f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [    7.279323] >ffff8801d444a000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [    7.279468]                    ^
> [    7.279584]  ffff8801d444a080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [    7.279729]  ffff8801d444a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [    7.279870] ==================================================================
> [    7.280011] Disabling lock debugging due to kernel taint
> [    7.632219] random: mktemp: uninitialized urandom read (6 bytes read)
> [    8.052850] random: mktemp: uninitialized urandom read (6 bytes read)
> [    8.335209] random: cloud-init: uninitialized urandom read (32 bytes read)
> [    8.796331] random: cloud-init: uninitialized urandom read (32 bytes read)
> [    9.162551] random: mktemp: uninitialized urandom read (12 bytes read)
> [    9.384504] random: ssh-keygen: uninitialized urandom read (32 bytes read)
> [    9.839586] init: failsafe main process (724) killed by TERM signal
> [   14.865443] postgres (1245): /proc/1245/oom_adj is deprecated, please use /proc/1245/oom_score_adj instead.
> [   17.213418] random: crng init done
> [   17.580892] init: plymouth-upstart-bridge main process ended, respawning

I suspect an off by one somewhere. I'm looking at the patch
and I don't see it but these things are hard to spot sometimes ...

> > 
> > > >> But this may not a
> > > >> clean solution. It'd help if I can get suggestions on what would be a
> > > >> clean option to fix this without extensively changing the code in
> > > >> virtio_net. Is it mandatory to protect the affinitization with
> > > >> read_lock? I don't see similar lock in other drivers while setting the
> > > >> affinity. I understand this warning should go away, but isn't it safe to
> > > >> have multiple readers.
> > > >>
> > > >>> On Fri, Jun 29, 2018 at 09:27:07PM -0700, Amritha Nambiar wrote:

^ permalink raw reply

* Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
From: Michael S. Tsirkin @ 2018-08-03 19:15 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, penguin-kernel, linux-kernel, mhocko, linux-mm, akpm,
	virtualization
In-Reply-To: <1533285146-25212-3-git-send-email-wei.w.wang@intel.com>

On Fri, Aug 03, 2018 at 04:32:26PM +0800, Wei Wang wrote:
> The OOM notifier is getting deprecated to use for the reasons:
> - As a callout from the oom context, it is too subtle and easy to
>   generate bugs and corner cases which are hard to track;
> - It is called too late (after the reclaiming has been performed).
>   Drivers with large amuont of reclaimable memory is expected to
>   release them at an early stage of memory pressure;
> - The notifier callback isn't aware of oom contrains;
> Link: https://lkml.org/lkml/2018/7/12/314
> 
> This patch replaces the virtio-balloon oom notifier with a shrinker
> to release balloon pages on memory pressure. The balloon pages are
> given back to mm adaptively by returning the number of pages that the
> reclaimer is asking for (i.e. sc->nr_to_scan).
> 
> Currently the max possible value of sc->nr_to_scan passed to the balloon
> shrinker is SHRINK_BATCH, which is 128. This is smaller than the
> limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be
> returned via one invocation of leak_balloon. But this patch still
> considers the case that SHRINK_BATCH or shrinker->batch could be changed
> to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to
> do multiple invocations of leak_balloon.
> 
> Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used
> to release balloon pages on OOM. We continue to use this feature bit for
> the shrinker, so the shrinker is only registered when this feature bit
> has been negotiated with host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>


Could you add data at how was this tested and how did guest
behaviour change. Which configurations see an improvement?

> ---
>  drivers/virtio/virtio_balloon.c | 111 ++++++++++++++++++++++------------------
>  1 file changed, 60 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 8100e77..612a359 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -27,7 +27,6 @@
>  #include <linux/slab.h>
>  #include <linux/module.h>
>  #include <linux/balloon_compaction.h>
> -#include <linux/oom.h>
>  #include <linux/wait.h>
>  #include <linux/mm.h>
>  #include <linux/mount.h>
> @@ -40,13 +39,8 @@
>   */
>  #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
>  #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
> -#define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> -static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> -module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> -MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> -
>  #ifdef CONFIG_BALLOON_COMPACTION
>  static struct vfsmount *balloon_mnt;
>  #endif
> @@ -86,8 +80,8 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> -	/* To register callback in oom notifier call chain */
> -	struct notifier_block nb;
> +	/* To register a shrinker to shrink memory upon memory pressure */
> +	struct shrinker shrinker;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -365,38 +359,6 @@ static void update_balloon_size(struct virtio_balloon *vb)
>  		      &actual);
>  }
>  
> -/*
> - * virtballoon_oom_notify - release pages when system is under severe
> - *			    memory pressure (called from out_of_memory())
> - * @self : notifier block struct
> - * @dummy: not used
> - * @parm : returned - number of freed pages
> - *
> - * The balancing of memory by use of the virtio balloon should not cause
> - * the termination of processes while there are pages in the balloon.
> - * If virtio balloon manages to release some memory, it will make the
> - * system return and retry the allocation that forced the OOM killer
> - * to run.
> - */
> -static int virtballoon_oom_notify(struct notifier_block *self,
> -				  unsigned long dummy, void *parm)
> -{
> -	struct virtio_balloon *vb;
> -	unsigned long *freed;
> -	unsigned num_freed_pages;
> -
> -	vb = container_of(self, struct virtio_balloon, nb);
> -	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> -		return NOTIFY_OK;
> -
> -	freed = parm;
> -	num_freed_pages = leak_balloon(vb, oom_pages);
> -	update_balloon_size(vb);
> -	*freed += num_freed_pages;
> -
> -	return NOTIFY_OK;
> -}
> -
>  static void update_balloon_stats_func(struct work_struct *work)
>  {
>  	struct virtio_balloon *vb;
> @@ -550,6 +512,53 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
> +						  struct shrink_control *sc)
> +{
> +	unsigned long pages_to_free, pages_freed = 0;
> +	struct virtio_balloon *vb = container_of(shrinker,
> +					struct virtio_balloon, shrinker);
> +
> +	pages_to_free = sc->nr_to_scan * VIRTIO_BALLOON_PAGES_PER_PAGE;
> +
> +	/*
> +	 * One invocation of leak_balloon can deflate at most
> +	 * VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
> +	 * multiple times to deflate pages till reaching pages_to_free.
> +	 */
> +	while (vb->num_pages && pages_to_free) {
> +		pages_to_free -= pages_freed;
> +		pages_freed += leak_balloon(vb, pages_to_free);
> +	}
> +	update_balloon_size(vb);
> +
> +	return pages_freed / VIRTIO_BALLOON_PAGES_PER_PAGE;
> +}
> +
> +static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
> +						   struct shrink_control *sc)
> +{
> +	struct virtio_balloon *vb = container_of(shrinker,
> +					struct virtio_balloon, shrinker);
> +
> +	return vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
> +}
> +
> +static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb)
> +{
> +	unregister_shrinker(&vb->shrinker);
> +}
> +
> +static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
> +{
> +	vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> +	vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> +	vb->shrinker.batch = 0;
> +	vb->shrinker.seeks = DEFAULT_SEEKS;
> +
> +	return register_shrinker(&vb->shrinker);
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -582,17 +591,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> -	vb->nb.notifier_call = virtballoon_oom_notify;
> -	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
> -	err = register_oom_notifier(&vb->nb);
> -	if (err < 0)
> -		goto out_del_vqs;
> -
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	balloon_mnt = kern_mount(&balloon_fs);
>  	if (IS_ERR(balloon_mnt)) {
>  		err = PTR_ERR(balloon_mnt);
> -		unregister_oom_notifier(&vb->nb);
>  		goto out_del_vqs;
>  	}
>  
> @@ -601,13 +603,20 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (IS_ERR(vb->vb_dev_info.inode)) {
>  		err = PTR_ERR(vb->vb_dev_info.inode);
>  		kern_unmount(balloon_mnt);
> -		unregister_oom_notifier(&vb->nb);
>  		vb->vb_dev_info.inode = NULL;
>  		goto out_del_vqs;
>  	}
>  	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
>  #endif
> -
> +	/*
> +	 * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a
> +	 * shrinker needs to be registered to relieve memory pressure.
> +	 */
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) {
> +		err = virtio_balloon_register_shrinker(vb);
> +		if (err)
> +			goto out_del_vqs;
> +	}
>  	virtio_device_ready(vdev);
>  
>  	if (towards_target(vb))
> @@ -639,8 +648,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb = vdev->priv;
>  
> -	unregister_oom_notifier(&vb->nb);
> -
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> +		virtio_balloon_unregister_shrinker(vb);
>  	spin_lock_irq(&vb->stop_update_lock);
>  	vb->stop_update = true;
>  	spin_unlock_irq(&vb->stop_update_lock);
> -- 
> 2.7.4

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-03 19:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Benjamin Herrenschmidt, Will Deacon, linux-kernel,
	linuxram, virtualization, paulus, marc.zyngier, mpe, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180803070507.GA1344@infradead.org>

On Fri, Aug 03, 2018 at 12:05:07AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> > So let's differenciate the two problems of having an IOMMU (real or
> > emulated) which indeeds adds overhead etc... and using the DMA API.
> > 
> > At the moment, virtio does this all over the place:
> > 
> > 	if (use_dma_api)
> > 		dma_map/alloc_something(...)
> > 	else
> > 		use_pa
> > 
> > The idea of the patch set is to do two, somewhat orthogonal, changes
> > that together achieve what we want. Let me know where you think there
> > is "a bunch of issues" because I'm missing it:
> > 
> >  1- Replace the above if/else constructs with just calling the DMA API,
> > and have virtio, at initialization, hookup its own dma_ops that just
> > "return pa" (roughly) when the IOMMU stuff isn't used.
> > 
> > This adds an indirect function call to the path that previously didn't
> > have one (the else case above). Is that a significant/measurable
> > overhead ?
> 
> If you call it often enough it does:
> 
> https://www.spinics.net/lists/netdev/msg495413.html
> 
> >  2- Make virtio use the DMA API with our custom platform-provided
> > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > running on a secure VM in our case.
> 
> And total NAK the customer platform-provided part of this.  We need
> a flag passed in from the hypervisor that the device needs all bus
> specific dma api treatment, and then just use the normal plaform
> dma mapping setup.  To get swiotlb you'll need to then use the DT/ACPI
> dma-range property to limit the addressable range, and a swiotlb
> capable plaform will use swiotlb automatically.

It seems reasonable to teach a platform to override dma-range
for a specific device e.g. in case it knows about bugs in ACPI.

-- 
MST

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-04  1:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180803220443-mutt-send-email-mst@kernel.org>

On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-04  1:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180803220443-mutt-send-email-mst@kernel.org>

On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

What we want to avoid is to expose any of this to the *end user* or
libvirt or any other higher level of the management stack. We really
want that stuff to remain contained between the VM itself, KVM and
maybe qemu.

We will need some other qemu changes for migration so that's ok. But
the minute you start touching libvirt and the higher levels it becomes
a nightmare.

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-04  1:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180803220443-mutt-send-email-mst@kernel.org>

On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

What we want to avoid is to expose any of this to the *end user* or
libvirt or any other higher level of the management stack. We really
want that stuff to remain contained between the VM itself, KVM and
maybe qemu.

We will need some other qemu changes for migration so that's ok. But
the minute you start touching libvirt and the higher levels it becomes
a nightmare.

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-04  1:21 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: robh, srikar, mpe, linuxram, linux-kernel, virtualization, hch,
	paulus, joe, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180803220812-mutt-send-email-mst@kernel.org>

On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > > > Please go through these patches and review whether this approach broadly
> > > > makes sense. I will appreciate suggestions, inputs, comments regarding
> > > > the patches or the approach in general. Thank you.
> > > 
> > > Jason did some work on profiling this. Unfortunately he reports
> > > about 4% extra overhead from this switch on x86 with no vIOMMU.
> > 
> > The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> > guest and measure PPS on tap on host.
> > 
> > Thanks
> 
> Could you supply host configuration involved please?

I wonder how much of that could be caused by Spectre mitigations
blowing up indirect function calls...

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-04  1:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180803220443-mutt-send-email-mst@kernel.org>

On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > running on a secure VM in our case.
> > > 
> > > And total NAK the customer platform-provided part of this.  We need
> > > a flag passed in from the hypervisor that the device needs all bus
> > > specific dma api treatment, and then just use the normal plaform
> > > dma mapping setup. 
> > 
> > Christoph, as I have explained already, we do NOT have a way to provide
> > such a flag as neither the hypervisor nor qemu knows anything about
> > this when the VM is created.
> 
> I think the fact you can't add flags from the hypervisor is
> a sign of a problematic architecture, you should look at
> adding that down the road - you will likely need it at some point.

(Appologies if you got this twice, my mailer had a brain fart and I don't
know if the first one got through & am about to disappear in a plane for 17h)

Well, we can later in the boot process. At VM creation time, it's just
a normal VM. The VM firmware, bootloader etc... are just operating
normally etc...

Later on, (we may have even already run Linux at that point,
unsecurely, as we can use Linux as a bootloader under some
circumstances), we start a "secure image".

This is a kernel zImage that includes a "ticket" that has the
appropriate signature etc... so that when that kernel starts, it can
authenticate with the ultravisor, be verified (along with its ramdisk)
etc... and copied (by the UV) into secure memory & run from there.

At that point, the hypervisor is informed that the VM has become
secure.

So at that point, we could exit to qemu to inform it of the change, and
have it walk the qtree and "Switch" all the virtio devices to use the
IOMMU I suppose, but it feels a lot grosser to me.

That's the only other option I can think of.

> However in this specific case, the flag does not need to come from the
> hypervisor, it can be set by arch boot code I think.
> Christoph do you see a problem with that?

The above could do that yes. Another approach would be to do it from a
small virtio "quirk" that pokes a bit in the device to force it to
iommu mode when it detects that we are running in a secure VM. That's a
bit warty on the virito side but probably not as much as having a qemu
one that walks of the virtio devices to change how they behave.

What do you reckon ?

What we want to avoid is to expose any of this to the *end user* or
libvirt or any other higher level of the management stack. We really
want that stuff to remain contained between the VM itself, KVM and
maybe qemu.

We will need some other qemu changes for migration so that's ok. But
the minute you start touching libvirt and the higher levels it becomes
a nightmare.

Cheers,
Ben.

> > >  To get swiotlb you'll need to then use the DT/ACPI
> > > dma-range property to limit the addressable range, and a swiotlb
> > > capable plaform will use swiotlb automatically.
> > 
> > This cannot be done as you describe it.
> > 
> > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > at a point where it has *no idea* that the VM will later become secure
> > and thus will have to restrict which pages can be used for "DMA".
> > 
> > The VM will *at runtime* turn itself into a secure VM via interactions
> > with the security HW and the Ultravisor layer (which sits below the
> > HV). This happens way after the DT has been created and consumed, the
> > qemu devices instanciated etc...
> > 
> > Only the guest kernel knows because it initates the transition. When
> > that happens, the virtio devices have already been used by the guest
> > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > one, etc... 
> > 
> > So instead of running around saying NAK NAK NAK, please explain how we
> > can solve that differently.
> > 
> > Ben.
> > 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-04  8:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig, paulus, marc.zyngier,
	mpe, joe, robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180803221634-mutt-send-email-mst@kernel.org>

On Fri, Aug 03, 2018 at 10:17:32PM +0300, Michael S. Tsirkin wrote:
> It seems reasonable to teach a platform to override dma-range
> for a specific device e.g. in case it knows about bugs in ACPI.

A platform will be able override dma-range using the dev->bus_dma_mask
field starting in 4.19.  But we'll still need a way how to

  a) document in the virtio spec that all bus dma quirks are to be
     applied
  b) a way to document in a virtio-related spec how the bus handles
     dma for Ben's totally fucked up hypervisor.  Without that there
     is not way we'll get interoperable implementations.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-04  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig, paulus, marc.zyngier,
	joe, robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <22310f58605169fe9de83abf78b59f593ff7fbb7.camel@kernel.crashing.org>

On Fri, Aug 03, 2018 at 01:58:46PM -0500, Benjamin Herrenschmidt wrote:
> You are saying something along the lines of "I don't like an
> instruction in your ISA, let's not support your entire CPU architecture
> in Linux".

No.  I'm saying if you can't describe your architecture in the virtio
spec document it is bogus.

> Our setup is not fucked. It makes a LOT of sense and it's a very
> sensible design. It's hitting a problem due to a corner case oddity in
> virtio bypassing the MMU, we've worked around such corner cases many
> times in the past without any problem, I fail to see what the problem
> is here.

No matter if you like it or not (I don't!) virtio is defined to bypass
dma translations, it is very clearly stated in the spec.  It has some
ill-defined bits to bypass it, so if you want the dma mapping API
to be used you'll have to set that bit (in its original form, a refined
form, or an entirely newly defined sane form) and make sure your
hypersivors always sets it.  It's not rocket science, just a little bit
for work to make sure your setup is actually going to work reliably
and portably.

> We aren't going to cancel years of HW and SW development for our

Maybe you should have actually read the specs you are claiming to
implemented before spending all that effort.

^ permalink raw reply

* Re: [PATCH net-next] vhost: switch to use new message format
From: David Miller @ 2018-08-04 20:21 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1533279891-12249-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Fri,  3 Aug 2018 15:04:51 +0800

> So fixing this by introducing a new message type with an explicit
> 32bit reserved field after type like:
> 
> struct vhost_msg_v2 {
> 	int type;
> 	__u32 reserved;

Please use fixed sized types consistently.  Use 's32' instead of 'int'
here.

Thanks!

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-05  0:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Benjamin Herrenschmidt, Will Deacon, linux-kernel,
	linuxram, virtualization, paulus, marc.zyngier, mpe, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180804081500.GA1455@infradead.org>

On Sat, Aug 04, 2018 at 01:15:00AM -0700, Christoph Hellwig wrote:
> On Fri, Aug 03, 2018 at 10:17:32PM +0300, Michael S. Tsirkin wrote:
> > It seems reasonable to teach a platform to override dma-range
> > for a specific device e.g. in case it knows about bugs in ACPI.
> 
> A platform will be able override dma-range using the dev->bus_dma_mask
> field starting in 4.19.  But we'll still need a way how to
> 
>   a) document in the virtio spec that all bus dma quirks are to be
>      applied

I agree it's a good idea. In particular I suspect that PLATFORM_IOMMU
should be extended to cover that. But see below.

>   b) a way to document in a virtio-related spec how the bus handles
>      dma for Ben's totally fucked up hypervisor.  Without that there
>      is not way we'll get interoperable implementations.


So in this case however I'm not sure what exactly do we want to add. It
seems that from point of view of the device, there is nothing special -
it just gets a PA and writes there.  It also seems that guest does not
need to get any info from the device either. Instead guest itself needs
device to DMA into specific addresses, for its own reasons.

It seems that the fact that within guest it's implemented using a bounce
buffer and that it's easiest to do by switching virtio to use the DMA API
isn't something virtio spec concerns itself with.

I'm open to suggestions.

-- 
MST

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-05  0:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <051fd78e15595b414839fa8f9d445b9f4d7576c6.camel@kernel.crashing.org>

On Fri, Aug 03, 2018 at 08:16:21PM -0500, Benjamin Herrenschmidt wrote:
> On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> > On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > > running on a secure VM in our case.
> > > > 
> > > > And total NAK the customer platform-provided part of this.  We need
> > > > a flag passed in from the hypervisor that the device needs all bus
> > > > specific dma api treatment, and then just use the normal plaform
> > > > dma mapping setup. 
> > > 
> > > Christoph, as I have explained already, we do NOT have a way to provide
> > > such a flag as neither the hypervisor nor qemu knows anything about
> > > this when the VM is created.
> > 
> > I think the fact you can't add flags from the hypervisor is
> > a sign of a problematic architecture, you should look at
> > adding that down the road - you will likely need it at some point.
> 
> Well, we can later in the boot process. At VM creation time, it's just
> a normal VM. The VM firmware, bootloader etc... are just operating
> normally etc...

I see the allure of this, but I think down the road you will
discover passing a flag in libvirt XML saying
"please use a secure mode" or whatever is a good idea.

Even thought it is probably not required to address this
specific issue.

For example, I don't think ballooning works in secure mode,
you will be able to teach libvirt not to try to add a
balloon to the guest.

> Later on, (we may have even already run Linux at that point,
> unsecurely, as we can use Linux as a bootloader under some
> circumstances), we start a "secure image".
> 
> This is a kernel zImage that includes a "ticket" that has the
> appropriate signature etc... so that when that kernel starts, it can
> authenticate with the ultravisor, be verified (along with its ramdisk)
> etc... and copied (by the UV) into secure memory & run from there.
> 
> At that point, the hypervisor is informed that the VM has become
> secure.
> 
> So at that point, we could exit to qemu to inform it of the change,

That's probably a good idea too.

> and
> have it walk the qtree and "Switch" all the virtio devices to use the
> IOMMU I suppose, but it feels a lot grosser to me.

That part feels gross, yes.

> That's the only other option I can think of.
> 
> > However in this specific case, the flag does not need to come from the
> > hypervisor, it can be set by arch boot code I think.
> > Christoph do you see a problem with that?
> 
> The above could do that yes. Another approach would be to do it from a
> small virtio "quirk" that pokes a bit in the device to force it to
> iommu mode when it detects that we are running in a secure VM. That's a
> bit warty on the virito side but probably not as much as having a qemu
> one that walks of the virtio devices to change how they behave.
> 
> What do you reckon ?

I think you are right that for the dma limit the hypervisor doesn't seem
to need to know.

> What we want to avoid is to expose any of this to the *end user* or
> libvirt or any other higher level of the management stack. We really
> want that stuff to remain contained between the VM itself, KVM and
> maybe qemu.
>
> We will need some other qemu changes for migration so that's ok. But
> the minute you start touching libvirt and the higher levels it becomes
> a nightmare.
> 
> Cheers,
> Ben.

I don't believe you'll be able to avoid that entirely. The split between
libvirt and qemu is more about community than about code, random bits of
functionality tend to land on random sides of that fence.  Better add a
tag in domain XML early is my advice. Having said that, it's your
hypervisor. I'm just suggesting that when hypervisor does somehow need
to care then I suspect most people won't be receptive to the argument
that changing libvirt is a nightmare.

> > > >  To get swiotlb you'll need to then use the DT/ACPI
> > > > dma-range property to limit the addressable range, and a swiotlb
> > > > capable plaform will use swiotlb automatically.
> > > 
> > > This cannot be done as you describe it.
> > > 
> > > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > > at a point where it has *no idea* that the VM will later become secure
> > > and thus will have to restrict which pages can be used for "DMA".
> > > 
> > > The VM will *at runtime* turn itself into a secure VM via interactions
> > > with the security HW and the Ultravisor layer (which sits below the
> > > HV). This happens way after the DT has been created and consumed, the
> > > qemu devices instanciated etc...
> > > 
> > > Only the guest kernel knows because it initates the transition. When
> > > that happens, the virtio devices have already been used by the guest
> > > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > > one, etc... 
> > > 
> > > So instead of running around saying NAK NAK NAK, please explain how we
> > > can solve that differently.
> > > 
> > > Ben.
> > > 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-05  0:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <398d498048a5b288cce06003c0808bfefa1e9b1f.camel@kernel.crashing.org>

On Fri, Aug 03, 2018 at 08:22:11PM -0500, Benjamin Herrenschmidt wrote:
> (Appologies if you got this twice, my mailer had a brain fart and I don't
> know if the first one got through & am about to disappear in a plane for 17h)

I got like 3 of these. I hope that's true for everyone as I replied to
1st one.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-05  0:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, mpe, linuxram, linux-kernel, virtualization, hch,
	paulus, joe, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <01c74680c4b3aa25d9b4375a9ab5e10046b7c71b.camel@kernel.crashing.org>

On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > > > > Please go through these patches and review whether this approach broadly
> > > > > makes sense. I will appreciate suggestions, inputs, comments regarding
> > > > > the patches or the approach in general. Thank you.
> > > > 
> > > > Jason did some work on profiling this. Unfortunately he reports
> > > > about 4% extra overhead from this switch on x86 with no vIOMMU.
> > > 
> > > The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> > > guest and measure PPS on tap on host.
> > > 
> > > Thanks
> > 
> > Could you supply host configuration involved please?
> 
> I wonder how much of that could be caused by Spectre mitigations
> blowing up indirect function calls...
> 
> Cheers,
> Ben.

I won't be surprised. If yes I suggested a way to mitigate the overhead.

-- 
MSR 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-05  0:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, mpe, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180801081637.GA14438@arm.com>

On Wed, Aug 01, 2018 at 09:16:38AM +0100, Will Deacon wrote:
> On Tue, Jul 31, 2018 at 03:36:22PM -0500, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-31 at 10:30 -0700, Christoph Hellwig wrote:
> > > > However the question people raise is that DMA API is already full of
> > > > arch-specific tricks the likes of which are outlined in your post linked
> > > > above. How is this one much worse?
> > > 
> > > None of these warts is visible to the driver, they are all handled in
> > > the architecture (possibly on a per-bus basis).
> > > 
> > > So for virtio we really need to decide if it has one set of behavior
> > > as specified in the virtio spec, or if it behaves exactly as if it
> > > was on a PCI bus, or in fact probably both as you lined up.  But no
> > > magic arch specific behavior inbetween.
> > 
> > The only arch specific behaviour is needed in the case where it doesn't
> > behave like PCI. In this case, the PCI DMA ops are not suitable, but in
> > our secure VMs, we still need to make it use swiotlb in order to bounce
> > through non-secure pages.
> 
> On arm/arm64, the problem we have is that legacy virtio devices on the MMIO
> transport (so definitely not PCI) have historically been advertised by qemu
> as not being cache coherent, but because the virtio core has bypassed DMA
> ops then everything has happened to work. If we blindly enable the arch DMA
> ops, we'll plumb in the non-coherent ops and start getting data corruption,
> so we do need a way to quirk virtio as being "always coherent" if we want to
> use the DMA ops (which we do, because our emulation platforms have an IOMMU
> for all virtio devices).
> 
> Will

Right that's not very different from placing the device within the IOMMU
domain but in fact bypassing the IOMMU. I wonder whether anyone ever
needs a non coherent virtio-mmio. If yes we can extend
PLATFORM_IOMMU to cover that or add another bit.

What exactly do the non-coherent ops do that causes the corruption?

-- 
MST

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-05  0:53 UTC (permalink / raw)
  To: Christoph Hellwig, Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, paulus, marc.zyngier, joe, robin.murphy, david,
	linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <20180804081500.GA1455@infradead.org>

On Sat, 2018-08-04 at 01:15 -0700, Christoph Hellwig wrote:
>   b) a way to document in a virtio-related spec how the bus handles
>      dma for Ben's totally fucked up hypervisor.  Without that there
>      is not way we'll get interoperable implementations.

Christoph, this isn't a totally fucked up hypervisor. It's not even
about the hypervisor itself, I mean seriously, man, can you at least
bother reading what I described is going on with the security
architecture ?

Anyway, Michael is onto what could possibly be an alternative approach,
by having us tell qemu to flip to iommu mode at secure VM boot time.
Let's see where that leads.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-05  1:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <20180804082120.GB4421@infradead.org>

On Sat, 2018-08-04 at 01:21 -0700, Christoph Hellwig wrote:
> No matter if you like it or not (I don't!) virtio is defined to bypass
> dma translations, it is very clearly stated in the spec.  It has some
> ill-defined bits to bypass it, so if you want the dma mapping API
> to be used you'll have to set that bit (in its original form, a refined
> form, or an entirely newly defined sane form) and make sure your
> hypersivors always sets it.  It's not rocket science, just a little bit
> for work to make sure your setup is actually going to work reliably
> and portably.

I think you are conflating completely different things, let me try to
clarify, we might actually be talking past each other.

> > We aren't going to cancel years of HW and SW development for our
> 
> Maybe you should have actually read the specs you are claiming to
> implemented before spending all that effort.

Anyway, let's cool our respective jets and sort that out, there are
indeed other approaches than overriding the DMA ops with special ones,
though I find them less tasty ... but here's my attempt at a (simpler)
description.

Bear with me for the long-ish email, this tries to describe the system
so you get an idea where we come from, and options we can use to get
out of this.

So we *are* implementing the spec, since qemu is currently unmodified:

Default virtio will bypass the iommu emulated by qemu as per spec etc..

On the Linux side, thus, virtio "sees" a normal iommu-bypassing device
and will treat it as such.

The problem is the assumption in the middle that qemu can access all
guest pages directly, which holds true for traditional VMs, but breaks
when the VM in our case turns itself into a secure VM. This isn't under
the action (or due to changes in) the hypervisor. KVM operates (almost)
normally here.

But there's this (very thin and open source btw) layer underneath
called ultravisor, which exploits some HW facilities to maintain a
separate pool of "secure" memory, which cannot be physically accessed
by a non-secure entity.

So in our scenario, qemu and KVM create a VM totally normally, there is
no changes required to the VM firmware, bootloader(s), etc... in fact
we support Linux based bootloaders, and those will work as normal linux
would in a VM, virtio works normally, etc...

Until that VM (via grub or kexec for example) loads a "secure image".

That secure image is a Linux kernel which has been "wrapped" (to simply
imagine a modified zImage wrapper though that's not entirely exact).

When that is run, before it modifies it's .data, it will interact with
the ultravisor using a specific HW facility to make itself secure. What
happens then is that the UV cryptographically verifies the kernel and
ramdisk, and copies them to the secure memory where execution returns.

The Ultravisor is then involved as a small shim for hypercalls between
the secure VM and KVM to prevent leakage of information (sanitize
registers etc...).

Now at this point, qemu can no longer access the secure VM pages
(there's more to this, such as using HMM to allow migration/encryption
accross etc... but let's not get bogged down).

So virtio can no longer access any page in the VM.

Now the VM *can* request from the Ultravisor some selected pages to be
made "insecure" and thus shared with qemu. This is how we handle some
of the pages used in our paravirt stuff, and that's how we want to deal
with virtio, by creating an insecure swiotlb pool.

At this point, thus, there are two options.

 - One you have rejected, which is to have a way for "no-iommu" virtio
(which still doesn't use an iommu on the qemu side and doesn't need
to), to be forced to use some custom DMA ops on the VM side.

 - One, which sadly has more overhead and will require modifying more
pieces of the puzzle, which is to make qemu uses an emulated iommu.
Once we make qemu do that, we can then layer swiotlb on top of the
emulated iommu on the guest side, and pass that as dma_ops to virtio.

Now, assuming you still absolutely want us to go down the second
option, there are several ways to get there. We would prefer to avoid
requiring the user to pass some special option to qemu. That has an
impact up the food chain (libvirt, management tools etc...) and users
probably won't understand what it's about. In fact the *end user* might
not even need to know a VM is secure, though applications inside might.

There's the additional annoyance that currently our guest FW (SLOF)
cannot deal with virtio in IOMMU mode, but that's fixable.

From there, refer to the email chain between Michael and I where we are
discussing options to "switch" virtio at runtime on the qemu side.

Any comment or suggestion ?

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-05  1:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, Christoph Hellwig
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, paulus, marc.zyngier, joe, robin.murphy, david,
	linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <20180805030326-mutt-send-email-mst@kernel.org>

On Sun, 2018-08-05 at 03:09 +0300, Michael S. Tsirkin wrote:
> It seems that the fact that within guest it's implemented using a bounce
> buffer and that it's easiest to do by switching virtio to use the DMA API
> isn't something virtio spec concerns itself with.

Right, this is my reasoning as well. See this other (long) email I just
sent to Christoph to explain the whole flow.

> I'm open to suggestions.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-05  4:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180805031046-mutt-send-email-mst@kernel.org>

On Sun, 2018-08-05 at 03:22 +0300, Michael S. Tsirkin wrote:
> I see the allure of this, but I think down the road you will
> discover passing a flag in libvirt XML saying
> "please use a secure mode" or whatever is a good idea.
> 
> Even thought it is probably not required to address this
> specific issue.
> 
> For example, I don't think ballooning works in secure mode,
> you will be able to teach libvirt not to try to add a
> balloon to the guest.

Right, we'll need some quirk to disable balloons  in the guest I
suppose.

Passing something from libvirt is cumbersome because the end user may
not even need to know about secure VMs. There are use cases where the
security is a contract down to some special application running inside
the secure VM, the sysadmin knows nothing about.

Also there's repercussions all the way to admin tools, web UIs etc...
so it's fairly wide ranging.

So as long as we only need to quirk a couple of devices, it's much
better contained that way.

> > Later on, (we may have even already run Linux at that point,
> > unsecurely, as we can use Linux as a bootloader under some
> > circumstances), we start a "secure image".
> > 
> > This is a kernel zImage that includes a "ticket" that has the
> > appropriate signature etc... so that when that kernel starts, it can
> > authenticate with the ultravisor, be verified (along with its ramdisk)
> > etc... and copied (by the UV) into secure memory & run from there.
> > 
> > At that point, the hypervisor is informed that the VM has become
> > secure.
> > 
> > So at that point, we could exit to qemu to inform it of the change,
> 
> That's probably a good idea too.

We probably will have to tell qemu eventually for migration, as we'll
need some kind of key exchange phase etc... to deal with the crypto
aspects (the actual page copy is sorted via encrypting the secure pages
back to normal pages in qemu, but we'll need extra metadata).

> > and
> > have it walk the qtree and "Switch" all the virtio devices to use the
> > IOMMU I suppose, but it feels a lot grosser to me.
> 
> That part feels gross, yes.
> 
> > That's the only other option I can think of.
> > 
> > > However in this specific case, the flag does not need to come from the
> > > hypervisor, it can be set by arch boot code I think.
> > > Christoph do you see a problem with that?
> > 
> > The above could do that yes. Another approach would be to do it from a
> > small virtio "quirk" that pokes a bit in the device to force it to
> > iommu mode when it detects that we are running in a secure VM. That's a
> > bit warty on the virito side but probably not as much as having a qemu
> > one that walks of the virtio devices to change how they behave.
> > 
> > What do you reckon ?
> 
> I think you are right that for the dma limit the hypervisor doesn't seem
> to need to know.

It's not just a limit mind you. It's a range, at least if we allocate
just a single pool of insecure pages. swiotlb feels like a better
option for us.

> > What we want to avoid is to expose any of this to the *end user* or
> > libvirt or any other higher level of the management stack. We really
> > want that stuff to remain contained between the VM itself, KVM and
> > maybe qemu.
> > 
> > We will need some other qemu changes for migration so that's ok. But
> > the minute you start touching libvirt and the higher levels it becomes
> > a nightmare.
> > 
> > Cheers,
> > Ben.
> 
> I don't believe you'll be able to avoid that entirely. The split between
> libvirt and qemu is more about community than about code, random bits of
> functionality tend to land on random sides of that fence.  Better add a
> tag in domain XML early is my advice. Having said that, it's your
> hypervisor. I'm just suggesting that when hypervisor does somehow need
> to care then I suspect most people won't be receptive to the argument
> that changing libvirt is a nightmare.

It only needs to care at runtime. The problem isn't changing libvirt
per-se, I don't have a problem with that. The problem is that it means
creating two categories of machines "secure" and "non-secure", which is
end-user visible, and thus has to be escalated to all the various
management stacks, UIs, etc... out there.

In addition, there are some cases where the individual creating the VMs
may not have any idea that they are secure.

But yes, if we have to, we'll do it. However, so far, we don't think
it's a great idea.

Cheers,
Ben.

> > > > >   To get swiotlb you'll need to then use the DT/ACPI
> > > > > dma-range property to limit the addressable range, and a swiotlb
> > > > > capable plaform will use swiotlb automatically.
> > > > 
> > > > This cannot be done as you describe it.
> > > > 
> > > > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > > > at a point where it has *no idea* that the VM will later become secure
> > > > and thus will have to restrict which pages can be used for "DMA".
> > > > 
> > > > The VM will *at runtime* turn itself into a secure VM via interactions
> > > > with the security HW and the Ultravisor layer (which sits below the
> > > > HV). This happens way after the DT has been created and consumed, the
> > > > qemu devices instanciated etc...
> > > > 
> > > > Only the guest kernel knows because it initates the transition. When
> > > > that happens, the virtio devices have already been used by the guest
> > > > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > > > one, etc... 
> > > > 
> > > > So instead of running around saying NAK NAK NAK, please explain how we
> > > > can solve that differently.
> > > > 
> > > > Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-05  7:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig, paulus, marc.zyngier,
	mpe, joe, robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180805030326-mutt-send-email-mst@kernel.org>

On Sun, Aug 05, 2018 at 03:09:55AM +0300, Michael S. Tsirkin wrote:
> So in this case however I'm not sure what exactly do we want to add. It
> seems that from point of view of the device, there is nothing special -
> it just gets a PA and writes there.  It also seems that guest does not
> need to get any info from the device either. Instead guest itself needs
> device to DMA into specific addresses, for its own reasons.
> 
> It seems that the fact that within guest it's implemented using a bounce
> buffer and that it's easiest to do by switching virtio to use the DMA API
> isn't something virtio spec concerns itself with.

And that is exactly what we added bus_dma_mask for - the case where
the device itself has not limitation (or a bigger limitation), but
the platform limits the accessible dma ranges.  One typical case is
a PCIe root port that is only connected to the CPU through an
interconnect that is limited to 32 address bits for example.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-05  7:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig, paulus, marc.zyngier,
	joe, robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <a3ec0a454928e880dabc3782ea906e318634abf9.camel@kernel.crashing.org>

On Sun, Aug 05, 2018 at 11:10:15AM +1000, Benjamin Herrenschmidt wrote:
>  - One you have rejected, which is to have a way for "no-iommu" virtio
> (which still doesn't use an iommu on the qemu side and doesn't need
> to), to be forced to use some custom DMA ops on the VM side.
> 
>  - One, which sadly has more overhead and will require modifying more
> pieces of the puzzle, which is to make qemu uses an emulated iommu.
> Once we make qemu do that, we can then layer swiotlb on top of the
> emulated iommu on the guest side, and pass that as dma_ops to virtio.

Or number three:  have a a virtio feature bit that tells the VM
to use whatever dma ops the platform thinks are appropinquate for
the bus it pretends to be on.  Then set a dma-range that is limited
to your secure memory range (if you really need it to be runtime
enabled only after a device reset that rescans) and use the normal
dma mapping code to bounce buffer.

^ permalink raw reply

* Re: [PATCH net-next 0/6] virtio_net: Add ethtool stat items
From: Tariq Toukan @ 2018-08-05 13:15 UTC (permalink / raw)
  To: David Miller, mst, toshiaki.makita1, makita.toshiaki
  Cc: ranro, netdev, Eran Ben Elisha, virtualization, Maor Gottlieb
In-Reply-To: <20180725.125909.840615930497803596.davem@davemloft.net>



On 25/07/2018 10:59 PM, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Wed, 25 Jul 2018 12:40:12 +0300
> 
>> On Mon, Jul 23, 2018 at 11:36:03PM +0900, Toshiaki Makita wrote:
>>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>
>>> Add some ethtool stat items useful for performance analysis.
>>>
>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>
>> Series:
>>
>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Series applied.
> 
>> Patch 1 seems appropriate for stable, even though it's minor.
> 
> Ok, I'll put patch #1 also into 'net' and queue it up for -stable.
> 
> Thanks.
> 

Hi,
Our verification team encountered the following degradation, introduced 
by this series. Please check KASAN bug report below.

We tested before and after the series, but didn't bisect within it. We 
can do if necessary.

Thanks,
Tariq


[   46.166544] BUG: KASAN: use-after-free in virtnet_poll+0xd9c/0xfd1 
[virtio_net]
[   46.166593] Read of size 8 at addr ffff8803400da608 by task ip/1013

[   46.166603] CPU: 3 PID: 1013 Comm: ip Not tainted 
4.18.0-rc6-for-upstream-dbg-2018-07-26_19-45-52-64 #1
[   46.166606] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.10.2-1ubuntu1 04/01/2014
[   46.166609] Call Trace:
[   46.166613]  <IRQ>
[   46.166624]  dump_stack+0xf0/0x17b
[   46.166647]  ? show_regs_print_info+0x5/0x5
[   46.166654]  ? printk+0x9c/0xc3
[   46.166661]  ? kmsg_dump_rewind_nolock+0xf5/0xf5
[   46.166677]  ? virtnet_poll+0xd9c/0xfd1 [virtio_net]
[   46.166685]  print_address_description+0xea/0x380
[   46.166701]  ? virtnet_poll+0xd9c/0xfd1 [virtio_net]
[   46.166711]  kasan_report+0x1dd/0x460
[   46.166727]  ? virtnet_poll+0xd9c/0xfd1 [virtio_net]
[   46.166743]  virtnet_poll+0xd9c/0xfd1 [virtio_net]
[   46.166767]  ? receive_buf+0x2e30/0x2e30 [virtio_net]
[   46.166796]  ? sched_clock_cpu+0x18/0x2b0
[   46.166809]  ? print_irqtrace_events+0x280/0x280
[   46.166817]  ? print_irqtrace_events+0x280/0x280
[   46.166830]  ? rcu_process_callbacks+0xc5e/0x12d0
[   46.166838]  ? kvm_clock_read+0x1f/0x30
[   46.166857]  ? mark_held_locks+0xd5/0x170
[   46.166867]  ? net_rx_action+0x2aa/0x10e0
[   46.166882]  net_rx_action+0x4bc/0x10e0
[   46.166906]  ? napi_complete_done+0x480/0x480
[   46.166925]  ? print_irqtrace_events+0x280/0x280
[   46.166935]  ? sched_clock+0x5/0x10
[   46.166952]  ? __lock_is_held+0xcb/0x1a0
[   46.166982]  __do_softirq+0x2c4/0xdf4
[   46.167010]  do_softirq_own_stack+0x2a/0x40
[   46.167031]  </IRQ>
[   46.167038]  ? virtnet_napi_enable+0x37/0xa0 [virtio_net]
[   46.167044]  do_softirq.part.11+0x69/0x70
[   46.167065]  __local_bh_enable_ip+0x1d9/0x250
[   46.167076]  virtnet_open+0x13c/0x440 [virtio_net]
[   46.167099]  __dev_open+0x1cf/0x390
[   46.167108]  ? dev_set_rx_mode+0x30/0x30
[   46.167113]  ? __local_bh_enable_ip+0xf7/0x250
[   46.167124]  ? trace_hardirqs_on_caller+0x38d/0x6c0
[   46.167130]  ? __dev_change_flags+0x18d/0x630
[   46.167142]  __dev_change_flags+0x469/0x630
[   46.167152]  ? dev_set_allmulti+0x10/0x10
[   46.167157]  ? __lock_acquire+0x9c1/0x4b50
[   46.167166]  ? print_irqtrace_events+0x280/0x280
[   46.167174]  ? kvm_clock_read+0x1f/0x30
[   46.167191]  ? rtnetlink_put_metrics+0x530/0x530
[   46.167205]  dev_change_flags+0x8b/0x160
[   46.167236]  do_setlink+0xa17/0x39f0
[   46.167248]  ? sched_clock_cpu+0x18/0x2b0
[   46.167267]  ? rtnl_dump_ifinfo+0xd20/0xd20
[   46.167277]  ? print_irqtrace_events+0x280/0x280
[   46.167285]  ? kvm_clock_read+0x1f/0x30
[   46.167316]  ? debug_show_all_locks+0x3b0/0x3b0
[   46.167321]  ? kvm_sched_clock_read+0x5/0x10
[   46.167327]  ? sched_clock+0x5/0x10
[   46.167333]  ? sched_clock_cpu+0x18/0x2b0
[   46.167382]  ? memset+0x1f/0x40
[   46.167408]  ? nla_parse+0x8c/0x3f0
[   46.167419]  ? rtnetlink_put_metrics+0x530/0x530
[   46.167427]  ? kvm_sched_clock_read+0x5/0x10
[   46.167433]  ? sched_clock+0x5/0x10
[   46.167439]  ? sched_clock_cpu+0x18/0x2b0
[   46.167457]  rtnl_newlink+0x9dc/0x13a0
[   46.167484]  ? rtnl_link_unregister+0x2b0/0x2b0
[   46.167513]  ? kvm_clock_read+0x1f/0x30
[   46.167538]  ? print_irqtrace_events+0x280/0x280
[   46.167546]  ? kvm_clock_read+0x1f/0x30
[   46.167551]  ? kvm_sched_clock_read+0x5/0x10
[   46.167557]  ? sched_clock+0x5/0x10
[   46.167562]  ? sched_clock_cpu+0x18/0x2b0
[   46.167567]  ? kvm_clock_read+0x1f/0x30
[   46.167598]  ? __lock_acquire+0x9c1/0x4b50
[   46.167640]  ? debug_show_all_locks+0x3b0/0x3b0
[   46.167646]  ? kvm_clock_read+0x1f/0x30
[   46.167651]  ? kvm_sched_clock_read+0x5/0x10
[   46.167673]  ? debug_show_all_locks+0x3b0/0x3b0
[   46.167698]  ? is_bpf_text_address+0x87/0x130
[   46.167707]  ? print_irqtrace_events+0x280/0x280
[   46.167714]  ? kvm_clock_read+0x1f/0x30
[   46.167718]  ? kvm_sched_clock_read+0x5/0x10
[   46.167723]  ? sched_clock+0x5/0x10
[   46.167728]  ? sched_clock_cpu+0x18/0x2b0
[   46.167753]  ? get_lock_stats+0x18/0x160
[   46.167877]  ? __lock_is_held+0xcb/0x1a0
[   46.167908]  rtnetlink_rcv_msg+0x575/0xaa0
[   46.167913]  ? kvm_clock_read+0x1f/0x30
[   46.167925]  ? validate_linkmsg+0x900/0x900
[   46.167945]  ? netlink_deliver_tap+0x1cc/0xf30
[   46.167950]  ? kvm_clock_read+0x1f/0x30
[   46.167965]  netlink_rcv_skb+0x13c/0x3a0
[   46.167972]  ? validate_linkmsg+0x900/0x900
[   46.167984]  ? netlink_ack+0xcd0/0xcd0
[   46.168030]  netlink_unicast+0x45a/0x6a0
[   46.168061]  ? netlink_attachskb+0x770/0x770
[   46.168075]  ? import_iovec+0xa8/0x460
[   46.168093]  netlink_sendmsg+0x68e/0xdf0
[   46.168127]  ? netlink_unicast+0x6a0/0x6a0
[   46.168133]  ? copy_msghdr_from_user+0x216/0x350
[   46.168160]  ? netlink_unicast+0x6a0/0x6a0
[   46.168168]  sock_sendmsg+0xdb/0x160
[   46.168193]  ___sys_sendmsg+0x6b3/0xbd0
[   46.168207]  ? copy_msghdr_from_user+0x350/0x350
[   46.168221]  ? do_raw_spin_unlock+0xae/0x310
[   46.168248]  ? _raw_spin_unlock+0x2e/0x50
[   46.168257]  ? __handle_mm_fault+0xb65/0x2e90
[   46.168278]  ? handle_mm_fault+0x28f/0xa70
[   46.168284]  ? kvm_clock_read+0x1f/0x30
[   46.168289]  ? kvm_sched_clock_read+0x5/0x10
[   46.168303]  ? __do_page_fault+0x549/0xd00
[   46.168308]  ? kvm_clock_read+0x1f/0x30
[   46.168313]  ? kvm_sched_clock_read+0x5/0x10
[   46.168318]  ? sched_clock+0x5/0x10
[   46.168324]  ? sched_clock_cpu+0x18/0x2b0
[   46.168336]  ? __fget_light+0x5c/0x280
[   46.168357]  ? __sys_sendmsg+0xd2/0x170
[   46.168362]  __sys_sendmsg+0xd2/0x170
[   46.168371]  ? __ia32_sys_shutdown+0x90/0x90
[   46.168382]  ? handle_mm_fault+0x363/0xa70
[   46.168397]  ? up_read+0x1c/0x130
[   46.168403]  ? __do_page_fault+0x549/0xd00
[   46.168443]  ? do_syscall_64+0x18/0x540
[   46.168459]  do_syscall_64+0xa4/0x540
[   46.168470]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   46.168477] RIP: 0033:0x7fa59e680087
[   46.168481] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 
00 00 8b 05 aa 97 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00 00 00 53 48 89 f3 48
[   46.168717] RSP: 002b:00007ffe83f2fef8 EFLAGS: 00000246 ORIG_RAX: 
000000000000002e
[   46.168726] RAX: ffffffffffffffda RBX: 00000000006744c0 RCX: 
00007fa59e680087
[   46.168731] RDX: 0000000000000000 RSI: 00007ffe83f2ff40 RDI: 
0000000000000003
[   46.168735] RBP: 000000005b5d727c R08: 0000000000000001 R09: 
fefefeff77686d74
[   46.168740] R10: 0000000000000006 R11: 0000000000000246 R12: 
00007ffe83f38000
[   46.168744] R13: 0000000000000000 R14: 00007ffe83f38728 R15: 
00007ffe83f37fd8

[   46.168778] Allocated by task 499:
[   46.168784]  kasan_kmalloc+0xa0/0xd0
[   46.168789]  __kmalloc+0x191/0x3a0
[   46.168795]  mpi_powm+0x956/0x2360
[   46.168801]  rsa_enc+0x1f0/0x3a0
[   46.168806]  pkcs1pad_verify+0x4c4/0x750
[   46.168815]  public_key_verify_signature+0x58b/0xac0
[   46.168821]  pkcs7_validate_trust+0x3bd/0x710
[   46.168830]  verify_pkcs7_signature+0xe8/0x1b0
[   46.168837]  mod_verify_sig+0x1d4/0x2a0
[   46.168842]  load_module+0x1689/0x6590
[   46.168847]  __do_sys_finit_module+0x192/0x1c0
[   46.168852]  do_syscall_64+0xa4/0x540
[   46.168857]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[   46.168864] Freed by task 499:
[   46.168869]  __kasan_slab_free+0x11d/0x160
[   46.168874]  kfree+0x151/0x650
[   46.168878]  mpi_powm+0x621/0x2360
[   46.168883]  rsa_enc+0x1f0/0x3a0
[   46.168887]  pkcs1pad_verify+0x4c4/0x750
[   46.168892]  public_key_verify_signature+0x58b/0xac0
[   46.168897]  pkcs7_validate_trust+0x3bd/0x710
[   46.168902]  verify_pkcs7_signature+0xe8/0x1b0
[   46.168906]  mod_verify_sig+0x1d4/0x2a0
[   46.168911]  load_module+0x1689/0x6590
[   46.168916]  __do_sys_finit_module+0x192/0x1c0
[   46.168921]  do_syscall_64+0xa4/0x540
[   46.168925]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[   46.168933] The buggy address belongs to the object at ffff8803400da588
                 which belongs to the cache kmalloc-2048 of size 2048
[   46.168938] The buggy address is located 128 bytes inside of
                 2048-byte region [ffff8803400da588, ffff8803400dad88)
[   46.168942] The buggy address belongs to the page:
[   46.168947] page:ffffea000d003600 count:1 mapcount:0 
mapping:ffff880355011540 index:0x0 compound_mapcount: 0
[   46.169272] flags: 0x2fffff80008100(slab|head)
[   46.169358] raw: 002fffff80008100 ffffea000d13a608 ffffea000d43e608 
ffff880355011540
[   46.169364] raw: 0000000000000000 00000000000d000d 00000001ffffffff 
0000000000000000
[   46.169369] page dumped because: kasan: bad access detected

[   46.169377] Memory state around the buggy address:
[   46.169383]  ffff8803400da500: fc fc fc fc fc fc fc fc fc fc fc fc fc 
fc fc fc
[   46.169388]  ffff8803400da580: fc fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[   46.169394] >ffff8803400da600: fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[   46.169398]                       ^
[   46.169403]  ffff8803400da680: fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[   46.169408]  ffff8803400da700: fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[   46.169427] 
==================================================================
[   46.169431] Disabling lock debugging due to kernel taint

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox