* Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
From: Jason Wang @ 2018-08-03 3:07 UTC (permalink / raw)
To: Tonghao Zhang; +Cc: Linux Kernel Network Developers, mst, virtualization
In-Reply-To: <CAMDZJNUMBYOVh+pzaykRq23ePqNR7BOA8wP_=i85-PXvwaxuBA@mail.gmail.com>
On 2018年08月03日 10:51, Tonghao Zhang wrote:
> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>>
>> On 2018年08月02日 16:41, Toshiaki Makita wrote:
>>> On 2018/08/02 17:18, Jason Wang wrote:
>>>> On 2018年08月01日 17:52, Tonghao Zhang wrote:
>>>>>> +static void vhost_net_busy_poll_check(struct vhost_net *net,
>>>>>> + struct vhost_virtqueue *rvq,
>>>>>> + struct vhost_virtqueue *tvq,
>>>>>> + bool rx)
>>>>>> +{
>>>>>> + struct socket *sock = rvq->private_data;
>>>>>> +
>>>>>> + if (rx)
>>>>>> + vhost_net_busy_poll_try_queue(net, tvq);
>>>>>> + else if (sock && sk_has_rx_data(sock->sk))
>>>>>> + vhost_net_busy_poll_try_queue(net, rvq);
>>>>>> + else {
>>>>>> + /* On tx here, sock has no rx data, so we
>>>>>> + * will wait for sock wakeup for rx, and
>>>>>> + * vhost_enable_notify() is not needed. */
>>>>> A possible case is we do have rx data but guest does not refill the rx
>>>>> queue. In this case we may lose notifications from guest.
>>>> Yes, should consider this case. thanks.
>>> I'm a bit confused. Isn't this covered by the previous
>>> "else if (sock && sk_has_rx_data(...))" block?
>> The problem is it does nothing if vhost_vq_avail_empty() is true and
>> vhost_enble_notify() is false.
>>
>>>>>>> +
>>>>>>> + cpu_relax();
>>>>>>> + }
>>>>>>> +
>>>>>>> + preempt_enable();
>>>>>>> +
>>>>>>> + if (!rx)
>>>>>>> + vhost_net_enable_vq(net, vq);
>>>>>> No need to enable rx virtqueue, if we are sure handle_rx() will be
>>>>>> called soon.
>>>>> If we disable rx virtqueue in handle_tx and don't send packets from
>>>>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
>>>>> so the network is broken.
>>>> Not sure I understand here. I mean is we schedule work for handle_rx(),
>>>> there's no need to enable it since handle_rx() will do this for us.
>>> Looks like in the last "else" block in vhost_net_busy_poll_check() we
>>> need to enable vq since in that case we have no rx data and handle_rx()
>>> is not scheduled.
>>>
>> Yes.
> So we will use the vhost_has_work() to check whether or not the
> handle_rx is scheduled ?
> If we use the vhost_has_work(), the work in the dev work_list may be
> rx work, or tx work, right ?
Yes. We can add a boolean to record whether or not we've called
vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
it was true.
So here's the needed changes:
1) Split the wakeup disabling to another patch
2) Squash the vhost_net_busy_poll_try_queue() and
vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce
duplicated checks.
3) If possible, rename the boolean rx in vhost_net_busy_poll() to
poll_rx, this makes code more readable.
Thanks
>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
From: Jason Wang @ 2018-08-03 3:07 UTC (permalink / raw)
To: Toshiaki Makita, Tonghao Zhang
Cc: Linux Kernel Network Developers, virtualization, mst
In-Reply-To: <3272c3b4-a44c-8554-329e-8a5e1a59aafd@redhat.com>
On 2018年08月02日 17:23, Jason Wang wrote:
>>>>>>
>>>>> No need to enable rx virtqueue, if we are sure handle_rx() will be
>>>>> called soon.
>>>> If we disable rx virtqueue in handle_tx and don't send packets from
>>>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
>>>> so the network is broken.
>>> Not sure I understand here. I mean is we schedule work for handle_rx(),
>>> there's no need to enable it since handle_rx() will do this for us.
>> Looks like in the last "else" block in vhost_net_busy_poll_check() we
>> need to enable vq since in that case we have no rx data and handle_rx()
>> is not scheduled.
>>
>
> Yes.
>
> Thanks
Rethink about this, looks not. We enable rx wakeups in this case, so if
there's pending data, handle_rx() will be schedule after
vhost_net_enable_vq().
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
From: Tonghao Zhang @ 2018-08-03 2:51 UTC (permalink / raw)
To: jasowang; +Cc: Linux Kernel Network Developers, mst, virtualization
In-Reply-To: <3272c3b4-a44c-8554-329e-8a5e1a59aafd@redhat.com>
On Thu, Aug 2, 2018 at 5:23 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018年08月02日 16:41, Toshiaki Makita wrote:
> > On 2018/08/02 17:18, Jason Wang wrote:
> >> On 2018年08月01日 17:52, Tonghao Zhang wrote:
> >>>> +static void vhost_net_busy_poll_check(struct vhost_net *net,
> >>>> + struct vhost_virtqueue *rvq,
> >>>> + struct vhost_virtqueue *tvq,
> >>>> + bool rx)
> >>>> +{
> >>>> + struct socket *sock = rvq->private_data;
> >>>> +
> >>>> + if (rx)
> >>>> + vhost_net_busy_poll_try_queue(net, tvq);
> >>>> + else if (sock && sk_has_rx_data(sock->sk))
> >>>> + vhost_net_busy_poll_try_queue(net, rvq);
> >>>> + else {
> >>>> + /* On tx here, sock has no rx data, so we
> >>>> + * will wait for sock wakeup for rx, and
> >>>> + * vhost_enable_notify() is not needed. */
> >>> A possible case is we do have rx data but guest does not refill the rx
> >>> queue. In this case we may lose notifications from guest.
> >> Yes, should consider this case. thanks.
> > I'm a bit confused. Isn't this covered by the previous
> > "else if (sock && sk_has_rx_data(...))" block?
>
> The problem is it does nothing if vhost_vq_avail_empty() is true and
> vhost_enble_notify() is false.
>
> >
> >>>>> +
> >>>>> + cpu_relax();
> >>>>> + }
> >>>>> +
> >>>>> + preempt_enable();
> >>>>> +
> >>>>> + if (!rx)
> >>>>> + vhost_net_enable_vq(net, vq);
> >>>> No need to enable rx virtqueue, if we are sure handle_rx() will be
> >>>> called soon.
> >>> If we disable rx virtqueue in handle_tx and don't send packets from
> >>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
> >>> so the network is broken.
> >> Not sure I understand here. I mean is we schedule work for handle_rx(),
> >> there's no need to enable it since handle_rx() will do this for us.
> > Looks like in the last "else" block in vhost_net_busy_poll_check() we
> > need to enable vq since in that case we have no rx data and handle_rx()
> > is not scheduled.
> >
>
> Yes.
So we will use the vhost_has_work() to check whether or not the
handle_rx is scheduled ?
If we use the vhost_has_work(), the work in the dev work_list may be
rx work, or tx work, right ?
> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Jason Wang @ 2018-08-03 2:41 UTC (permalink / raw)
To: Michael S. Tsirkin, Anshuman Khandual
Cc: robh, srikar, benh, linuxram, linux-kernel, virtualization, hch,
paulus, mpe, joe, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180802235332-mutt-send-email-mst@kernel.org>
On 2018年08月03日 04:55, Michael S. Tsirkin wrote:
> On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
>> This patch series is the follow up on the discussions we had before about
>> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
>> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
>> were suggestions about doing away with two different paths of transactions
>> with the host/QEMU, first being the direct GPA and the other being the DMA
>> API based translations.
>>
>> First patch attempts to create a direct GPA mapping based DMA operations
>> structure called 'virtio_direct_dma_ops' with exact same implementation
>> of the direct GPA path which virtio core currently has but just wrapped in
>> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
>> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
>> existing semantics. The second patch does exactly that inside the function
>> virtio_finalize_features(). The third patch removes the default direct GPA
>> path from virtio core forcing it to use DMA API callbacks for all devices.
>> Now with that change, every device must have a DMA operations structure
>> associated with it. The fourth patch adds an additional hook which gives
>> the platform an opportunity to do yet another override if required. This
>> platform hook can be used on POWER Ultravisor based protected guests to
>> load up SWIOTLB DMA callbacks to do the required (as discussed previously
>> in the above mentioned thread how host is allowed to access only parts of
>> the guest GPA range) bounce buffering into the shared memory for all I/O
>> scatter gather buffers to be consumed on the host side.
>>
>> Please go through these patches and review whether this approach broadly
>> makes sense. I will appreciate suggestions, inputs, comments regarding
>> the patches or the approach in general. Thank you.
> Jason did some work on profiling this. Unfortunately he reports
> about 4% extra overhead from this switch on x86 with no vIOMMU.
The test is rather simple, just run pktgen (pktgen_sample01_simple.sh)
in guest and measure PPS on tap on host.
Thanks
>
> I expect he's writing up the data in more detail, but
> just wanted to let you know this would be one more
> thing to debug before we can just switch to DMA APIs.
>
>
>> Anshuman Khandual (4):
>> virtio: Define virtio_direct_dma_ops structure
>> virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
>> virtio: Force virtio core to use DMA API callbacks for all virtio devices
>> virtio: Add platform specific DMA API translation for virito devices
>>
>> arch/powerpc/include/asm/dma-mapping.h | 6 +++
>> arch/powerpc/platforms/pseries/iommu.c | 6 +++
>> drivers/virtio/virtio.c | 72 ++++++++++++++++++++++++++++++++++
>> drivers/virtio/virtio_pci_common.h | 3 ++
>> drivers/virtio/virtio_ring.c | 65 +-----------------------------
>> 5 files changed, 89 insertions(+), 63 deletions(-)
>>
>> --
>> 2.9.3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
From: Jason Wang @ 2018-08-03 2:38 UTC (permalink / raw)
To: Toshiaki Makita, Tonghao Zhang
Cc: Linux Kernel Network Developers, virtualization, mst
In-Reply-To: <e2917cb0-f8dd-90c7-5a44-5333ad1d1ee8@lab.ntt.co.jp>
On 2018年08月02日 17:57, Toshiaki Makita wrote:
> On 2018/08/02 18:23, Jason Wang wrote:
>> On 2018年08月02日 16:41, Toshiaki Makita wrote:
>>> On 2018/08/02 17:18, Jason Wang wrote:
>>>> On 2018年08月01日 17:52, Tonghao Zhang wrote:
>>>>>> +static void vhost_net_busy_poll_check(struct vhost_net *net,
>>>>>> + struct vhost_virtqueue *rvq,
>>>>>> + struct vhost_virtqueue *tvq,
>>>>>> + bool rx)
>>>>>> +{
>>>>>> + struct socket *sock = rvq->private_data;
>>>>>> +
>>>>>> + if (rx)
>>>>>> + vhost_net_busy_poll_try_queue(net, tvq);
>>>>>> + else if (sock && sk_has_rx_data(sock->sk))
>>>>>> + vhost_net_busy_poll_try_queue(net, rvq);
>>>>>> + else {
>>>>>> + /* On tx here, sock has no rx data, so we
>>>>>> + * will wait for sock wakeup for rx, and
>>>>>> + * vhost_enable_notify() is not needed. */
>>>>> A possible case is we do have rx data but guest does not refill the rx
>>>>> queue. In this case we may lose notifications from guest.
>>>> Yes, should consider this case. thanks.
>>> I'm a bit confused. Isn't this covered by the previous
>>> "else if (sock && sk_has_rx_data(...))" block?
>> The problem is it does nothing if vhost_vq_avail_empty() is true and
>> vhost_enble_notify() is false.
> If vhost_enable_notify() is false, guest will eventually kicks vq, no?
>
Yes, you are right.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 21:51 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <de4888b6457e220776e16a9c8958ff0886ffc66c.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > > at the details of the patch in a while but that's what I told him to
> > > implement ;-) :
> > >
> > > - Make virtio always use DMA ops to simplify the code path (with a set
> > > of "transparent" ops for legacy)
> > >
> > > and
> > >
> > > - Provide an arch hook allowing us to "override" those "transparent"
> > > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > >
> > > Cheers,
> > > Ben.
> > >
> >
> > Right but as I tried to say doing that brings us to a bunch of issues
> > with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> > guest to hypervisor communication.
>
> I'm not sure I see the problem, see below
>
> > When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> > bunch of overhead which we need to get rid of if we are to switch to
> > PLATFORM_IOMMU by default. We need to fix that.
>
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.
Well actually it's the other way around. An iommu in theory doesn't need
to bring overhead if you set it in bypass mode. Which does imply the
iommu supports bypass mode. Is that universally the case? DMA API does
see Christoph's list of things it does some of which add overhead.
> At the moment, virtio does this all over the place:
>
> if (use_dma_api)
> dma_map/alloc_something(...)
> else
> use_pa
>
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
>
> 1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
>
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?
Seems to be :( Jason reports about 4%. I wonder whether we can support
map_sg and friends being NULL, then use that when mapping is an
identity. A conditional branch there is likely very cheap.
Would this cover all platforms with kvm (which is where we care
most about performance)?
> This change stands alone, and imho "cleans" up virtio by avoiding all
> that if/else "2 path" and unless it adds a measurable overhead, should
> probably be done.
>
> 2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.
>
> This benefits from -1- by making us just plumb in a different set of
> DMA ops we would have cooked up specifically for virtio in our arch
> code (or in virtio itself but build arch-conditionally in a separate
> file). But it doesn't strictly need it -1-:
>
> Now, -2- doesn't strictly needs -1-. We could have just done another
> xen-like hack that forces the DMA API "ON" for virtio when running in a
> secure VM.
>
> The problem if we do that however is that we also then need the arch
> PCI code to make sure it hooks up the virtio PCI devices with the
> special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
> not the same as other PCI devices. So it will have to play games such
> as checking vendor/device IDs for virtio, checking the IOMMU flag,
> etc... from the arch code which really bloody sucks when assigning PCI
> DMA ops.
>
> However, if we do it the way we plan here, on top of -1-, with a hook
> called from virtio into the arch to "override" the virtio DMA ops, then
> we avoid the problem completely: The arch hook would only be called by
> virtio if the IOMMU flag is *not* set. IE only when using that special
> "hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
> PCI dma ops as usual.
>
> That way, we have a very clear semantic: This hook is purely about
> replacing those "null" DMA ops that just return PA introduced in -1-
> with some arch provided specially cooked up DMA ops for non-IOMMU
> virtio that know about the arch special requirements. For us bounce
> buffering.
>
> Is there something I'm missing ?
>
> Cheers,
> Ben.
Right so I was trying to write it up in a systematic way, but just to
give you one example, if there is a system where DMA API handles
coherency issues, or flushing of some buffers, then our PLATFORM_IOMMU
flag causes that to happen.
And we kinda worked around this without the IOMMU by basically saying
"ok we do not really need DMA API so let's just bypass it" and
it was kind of ok except now everyone is switching to vIOMMU
just in case. So now people do want some parts of what DMA API does,
such as the bounce buffer use, or IOMMU mappings.
And maybe in the end the solution is going to be to do something similar
to virt_Xmb except for DMA APIs: add APIs that handle just the
addressing bits but without the overhead. See
commit 6a65d26385bf487926a0616650927303058551e3
asm-generic: implement virt_xxx memory barriers
for reference, it's a similar set of issues.
So it's not a problem with your patches as such, it's just that
they don't solve that harder problem.
--
MST
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-02 21:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <20180802225738-mutt-send-email-mst@kernel.org>
On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > at the details of the patch in a while but that's what I told him to
> > implement ;-) :
> >
> > - Make virtio always use DMA ops to simplify the code path (with a set
> > of "transparent" ops for legacy)
> >
> > and
> >
> > - Provide an arch hook allowing us to "override" those "transparent"
> > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> >
> > Cheers,
> > Ben.
> >
>
> Right but as I tried to say doing that brings us to a bunch of issues
> with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> guest to hypervisor communication.
I'm not sure I see the problem, see below
> When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> bunch of overhead which we need to get rid of if we are to switch to
> PLATFORM_IOMMU by default. We need to fix that.
So let's differenciate the two problems of having an IOMMU (real or
emulated) which indeeds adds overhead etc... and using the DMA API.
At the moment, virtio does this all over the place:
if (use_dma_api)
dma_map/alloc_something(...)
else
use_pa
The idea of the patch set is to do two, somewhat orthogonal, changes
that together achieve what we want. Let me know where you think there
is "a bunch of issues" because I'm missing it:
1- Replace the above if/else constructs with just calling the DMA API,
and have virtio, at initialization, hookup its own dma_ops that just
"return pa" (roughly) when the IOMMU stuff isn't used.
This adds an indirect function call to the path that previously didn't
have one (the else case above). Is that a significant/measurable
overhead ?
This change stands alone, and imho "cleans" up virtio by avoiding all
that if/else "2 path" and unless it adds a measurable overhead, should
probably be done.
2- Make virtio use the DMA API with our custom platform-provided
swiotlb callbacks when needed, that is when not using IOMMU *and*
running on a secure VM in our case.
This benefits from -1- by making us just plumb in a different set of
DMA ops we would have cooked up specifically for virtio in our arch
code (or in virtio itself but build arch-conditionally in a separate
file). But it doesn't strictly need it -1-:
Now, -2- doesn't strictly needs -1-. We could have just done another
xen-like hack that forces the DMA API "ON" for virtio when running in a
secure VM.
The problem if we do that however is that we also then need the arch
PCI code to make sure it hooks up the virtio PCI devices with the
special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
not the same as other PCI devices. So it will have to play games such
as checking vendor/device IDs for virtio, checking the IOMMU flag,
etc... from the arch code which really bloody sucks when assigning PCI
DMA ops.
However, if we do it the way we plan here, on top of -1-, with a hook
called from virtio into the arch to "override" the virtio DMA ops, then
we avoid the problem completely: The arch hook would only be called by
virtio if the IOMMU flag is *not* set. IE only when using that special
"hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
PCI dma ops as usual.
That way, we have a very clear semantic: This hook is purely about
replacing those "null" DMA ops that just return PA introduced in -1-
with some arch provided specially cooked up DMA ops for non-IOMMU
virtio that know about the arch special requirements. For us bounce
buffering.
Is there something I'm missing ?
Cheers,
Ben.
^ permalink raw reply
* Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
From: Michael S. Tsirkin @ 2018-08-02 21:08 UTC (permalink / raw)
To: Nambiar, Amritha
Cc: alexander.h.duyck, willemdebruijn.kernel, Andrei Vagin, tom,
netdev, alexander.duyck, virtualization, edumazet, rostedt,
hannes, sridhar.samudrala, tom, davem, gaowanlong
In-Reply-To: <c3fbd7f6-054b-a9a3-ea21-4169b26a0a41@intel.com>
On Thu, Aug 02, 2018 at 02:04:12PM -0700, Nambiar, Amritha wrote:
> On 8/1/2018 5:11 PM, Andrei Vagin wrote:
> > On Tue, Jul 10, 2018 at 07:28:49PM -0700, Nambiar, Amritha wrote:
> >> With this patch series, I introduced static_key for XPS maps
> >> (xps_needed), so static_key_slow_inc() is used to switch branches. The
> >> definition of static_key_slow_inc() has cpus_read_lock in place. In the
> >> virtio_net driver, XPS queues are initialized after setting the
> >> queue:cpu affinity in virtnet_set_affinity() which is already protected
> >> within cpus_read_lock. Hence, the warning here trying to acquire
> >> cpus_read_lock when it is already held.
> >>
> >> A quick fix for this would be to just extract netif_set_xps_queue() out
> >> of the lock by simply wrapping it with another put/get_online_cpus
> >> (unlock right before and hold lock right after).
> >
> > virtnet_set_affinity() is called from virtnet_cpu_online(), which is
> > called under cpus_read_lock too.
> >
> > It looks like now we can't call netif_set_xps_queue() from cpu hotplug
> > callbacks.
> >
> > I can suggest a very straightforward fix for this problem. The patch is
> > attached.
> >
>
> Thanks for looking into this. I was thinking of fixing this in the
> virtio_net driver by moving the XPS initialization (and have a new
> get_affinity utility) in the ndo_open (so it is together with other tx
> preparation) instead of probe. Your patch solves this in general for
> setting up cpu hotplug callbacks which is under cpus_read_lock.
I like this too. Could you repost in a standard way
(inline, with your signoff etc) so we can ack this for
net-next?
> >> But this may not a
> >> clean solution. It'd help if I can get suggestions on what would be a
> >> clean option to fix this without extensively changing the code in
> >> virtio_net. Is it mandatory to protect the affinitization with
> >> read_lock? I don't see similar lock in other drivers while setting the
> >> affinity. I understand this warning should go away, but isn't it safe to
> >> have multiple readers.
> >>
> >>> On Fri, Jun 29, 2018 at 09:27:07PM -0700, Amritha Nambiar wrote:
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 20:55 UTC (permalink / raw)
To: Anshuman Khandual
Cc: robh, srikar, benh, linuxram, linux-kernel, virtualization, hch,
paulus, mpe, joe, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>
On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
> This patch series is the follow up on the discussions we had before about
> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
> were suggestions about doing away with two different paths of transactions
> with the host/QEMU, first being the direct GPA and the other being the DMA
> API based translations.
>
> First patch attempts to create a direct GPA mapping based DMA operations
> structure called 'virtio_direct_dma_ops' with exact same implementation
> of the direct GPA path which virtio core currently has but just wrapped in
> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
> existing semantics. The second patch does exactly that inside the function
> virtio_finalize_features(). The third patch removes the default direct GPA
> path from virtio core forcing it to use DMA API callbacks for all devices.
> Now with that change, every device must have a DMA operations structure
> associated with it. The fourth patch adds an additional hook which gives
> the platform an opportunity to do yet another override if required. This
> platform hook can be used on POWER Ultravisor based protected guests to
> load up SWIOTLB DMA callbacks to do the required (as discussed previously
> in the above mentioned thread how host is allowed to access only parts of
> the guest GPA range) bounce buffering into the shared memory for all I/O
> scatter gather buffers to be consumed on the host side.
>
> Please go through these patches and review whether this approach broadly
> makes sense. I will appreciate suggestions, inputs, comments regarding
> the patches or the approach in general. Thank you.
Jason did some work on profiling this. Unfortunately he reports
about 4% extra overhead from this switch on x86 with no vIOMMU.
I expect he's writing up the data in more detail, but
just wanted to let you know this would be one more
thing to debug before we can just switch to DMA APIs.
> Anshuman Khandual (4):
> virtio: Define virtio_direct_dma_ops structure
> virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
> virtio: Force virtio core to use DMA API callbacks for all virtio devices
> virtio: Add platform specific DMA API translation for virito devices
>
> arch/powerpc/include/asm/dma-mapping.h | 6 +++
> arch/powerpc/platforms/pseries/iommu.c | 6 +++
> drivers/virtio/virtio.c | 72 ++++++++++++++++++++++++++++++++++
> drivers/virtio/virtio_pci_common.h | 3 ++
> drivers/virtio/virtio_ring.c | 65 +-----------------------------
> 5 files changed, 89 insertions(+), 63 deletions(-)
>
> --
> 2.9.3
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 20:53 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <7779442d7889ee943b3e4ff6c63ec90b4a58b79d.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 10:33:05AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > > but it's not, VMs are
> > > created in "legacy" mode all the times and we don't know at VM creation
> > > time whether it will become a secure VM or not. The way our secure VMs
> > > work is that they start as a normal VM, load a secure "payload" and
> > > call the Ultravisor to "become" secure.
> > >
> > > So we're in a bit of a bind here. We need that one-liner optional arch
> > > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> > >
> > > Ben.
> >
> > And just to make sure I understand, on your platform DMA APIs do include
> > some of the cache flushing tricks and this is why you don't want to
> > declare iommu support in the hypervisor?
>
> I'm not sure I parse what you mean.
>
> We don't need cache flushing tricks.
You don't but do real devices on same platform need them?
> The problem we have with our
> "secure" VMs is that:
>
> - At VM creation time we have no idea it's going to become a secure
> VM, qemu doesn't know anything about it, and thus qemu (or other
> management tools, libvirt etc...) are going to create "legacy" (ie
> iommu bypass) virtio devices.
>
> - Once the VM goes secure (early during boot but too late for qemu),
> it will need to make virtio do bounce-buffering via swiotlb because
> qemu cannot physically access most VM pages (blocked by HW security
> features), we need to bounce buffer using some unsecure pages that are
> accessible to qemu.
>
> That said, I wouldn't object for us to more generally switch long run
> to changing qemu so that virtio on powerpc starts using the IOMMU as a
> default provided we fix our guest firmware to understand it (it
> currently doesn't), and provided we verify that the performance impact
> on things like vhost is negligible.
>
> Cheers,
> Ben.
>
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 20:52 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <e9beb6bba5a7d05741e27e85b24a3b23ba7c2762.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 12:53:41PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote:
> >
> > I see. So yes, given that device does not know or care, using
> > virtio features is an awkward fit.
> >
> > So let's say as a quick fix for you maybe we could generalize the
> > xen_domain hack, instead of just checking xen_domain check some static
> > branch. Then teach xen and others to enable that.>
>
> > OK but problem then becomes this: if you do this and virtio device appears
> > behind a vIOMMU and it does not advertize the IOMMU flag, the
> > code will try to use the vIOMMU mappings and fail.
> >
> > It does look like even with trick above, you need a special version of
> > DMA ops that does just swiotlb but not any of the other things DMA API
> > might do.
> >
> > Thoughts?
>
> Yes, this is the purpose of Anshuman original patch (I haven't looked
> at the details of the patch in a while but that's what I told him to
> implement ;-) :
>
> - Make virtio always use DMA ops to simplify the code path (with a set
> of "transparent" ops for legacy)
>
> and
>
> - Provide an arch hook allowing us to "override" those "transparent"
> DMA ops with some custom ones that do the appropriate swiotlb gunk.
>
> Cheers,
> Ben.
>
Right but as I tried to say doing that brings us to a bunch of issues
with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
guest to hypervisor communication.
When we do (as is the case with PLATFORM_IOMMU right now) this adds a
bunch of overhead which we need to get rid of if we are to switch to
PLATFORM_IOMMU by default. We need to fix that.
--
MST
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-02 17:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <20180802200646-mutt-send-email-mst@kernel.org>
On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote:
>
> I see. So yes, given that device does not know or care, using
> virtio features is an awkward fit.
>
> So let's say as a quick fix for you maybe we could generalize the
> xen_domain hack, instead of just checking xen_domain check some static
> branch. Then teach xen and others to enable that.>
> OK but problem then becomes this: if you do this and virtio device appears
> behind a vIOMMU and it does not advertize the IOMMU flag, the
> code will try to use the vIOMMU mappings and fail.
>
> It does look like even with trick above, you need a special version of
> DMA ops that does just swiotlb but not any of the other things DMA API
> might do.
>
> Thoughts?
Yes, this is the purpose of Anshuman original patch (I haven't looked
at the details of the patch in a while but that's what I told him to
implement ;-) :
- Make virtio always use DMA ops to simplify the code path (with a set
of "transparent" ops for legacy)
and
- Provide an arch hook allowing us to "override" those "transparent"
DMA ops with some custom ones that do the appropriate swiotlb gunk.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 17:19 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <82ccef6ec3d95ee43f3990a4a2d0aea87eb45e89.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 11:01:26AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote:
> >
> > > I don't completely agree:
> > >
> > > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> > > for example. It indicates that the peer bypasses the normal platform
> > > iommu. The platform code in the guest has no real way to know that this
> > > is the case, this is a specific "feature" of the qemu implementation.
> > >
> > > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> > > property of the guest platform itself (not qemu), there's no way the
> > > "peer" can advertize it via the virtio negociated flags. At least for
> > > us. I don't know for sure whether that would be workable for the ARM
> > > case. In our case, qemu has no idea at VM creation time that the VM
> > > will turn itself into a secure VM and thus will require bounce
> > > buffering for IOs (including virtio).
> > >
> > > So unless we have another hook for the arch code to set
> > > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> > > guest itself, I don't see that as a way to deal with it.
> > >
> > > > The other issue is VIRTIO_F_IO_BARRIER
> > > > which is very vaguely defined, and which needs a better definition.
> > > > And last but not least we'll need some text explaining the challenges
> > > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> > > > is what would basically cover them, but a good description including
> > > > an explanation of why these matter.
> > >
> > > Ben.
> > >
> >
> > So is it true that from qemu point of view there is nothing special
> > going on? You pass in a PA, host writes there.
>
> Yes, qemu doesn't see a different. It's the guest that will bounce the
> pages via a pool of "insecure" pages that qemu can access. Normal pages
> in a secure VM come from PAs that qemu cannot physically access.
>
> Cheers,
> Ben.
>
I see. So yes, given that device does not know or care, using
virtio features is an awkward fit.
So let's say as a quick fix for you maybe we could generalize the
xen_domain hack, instead of just checking xen_domain check some static
branch. Then teach xen and others to enable that.
OK but problem then becomes this: if you do this and virtio device appears
behind a vIOMMU and it does not advertize the IOMMU flag, the
code will try to use the vIOMMU mappings and fail.
It does look like even with trick above, you need a special version of
DMA ops that does just swiotlb but not any of the other things DMA API
might do.
Thoughts?
--
MST
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-02 16:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <20180802182959-mutt-send-email-mst@kernel.org>
On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote:
>
> > I don't completely agree:
> >
> > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> > for example. It indicates that the peer bypasses the normal platform
> > iommu. The platform code in the guest has no real way to know that this
> > is the case, this is a specific "feature" of the qemu implementation.
> >
> > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> > property of the guest platform itself (not qemu), there's no way the
> > "peer" can advertize it via the virtio negociated flags. At least for
> > us. I don't know for sure whether that would be workable for the ARM
> > case. In our case, qemu has no idea at VM creation time that the VM
> > will turn itself into a secure VM and thus will require bounce
> > buffering for IOs (including virtio).
> >
> > So unless we have another hook for the arch code to set
> > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> > guest itself, I don't see that as a way to deal with it.
> >
> > > The other issue is VIRTIO_F_IO_BARRIER
> > > which is very vaguely defined, and which needs a better definition.
> > > And last but not least we'll need some text explaining the challenges
> > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> > > is what would basically cover them, but a good description including
> > > an explanation of why these matter.
> >
> > Ben.
> >
>
> So is it true that from qemu point of view there is nothing special
> going on? You pass in a PA, host writes there.
Yes, qemu doesn't see a different. It's the guest that will bounce the
pages via a pool of "insecure" pages that qemu can access. Normal pages
in a secure VM come from PAs that qemu cannot physically access.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 15:41 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <26c1d3d50d8e081eed44fe9940fbefed34598cbd.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 10:24:57AM -0500, Benjamin Herrenschmidt wrote:
> On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote:
> > We just need to figure out how to deal with devices that deviate
> > from the default. One things is that VIRTIO_F_IOMMU_PLATFORM really
> > should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu
> > dma tweaks (offsets, cache flushing), which seems well in spirit of
> > the original design.
>
> I don't completely agree:
>
> 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> for example. It indicates that the peer bypasses the normal platform
> iommu. The platform code in the guest has no real way to know that this
> is the case, this is a specific "feature" of the qemu implementation.
>
> 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> property of the guest platform itself (not qemu), there's no way the
> "peer" can advertize it via the virtio negociated flags. At least for
> us. I don't know for sure whether that would be workable for the ARM
> case. In our case, qemu has no idea at VM creation time that the VM
> will turn itself into a secure VM and thus will require bounce
> buffering for IOs (including virtio).
>
> So unless we have another hook for the arch code to set
> VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> guest itself, I don't see that as a way to deal with it.
>
> > The other issue is VIRTIO_F_IO_BARRIER
> > which is very vaguely defined, and which needs a better definition.
> > And last but not least we'll need some text explaining the challenges
> > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> > is what would basically cover them, but a good description including
> > an explanation of why these matter.
>
> Ben.
>
So is it true that from qemu point of view there is nothing special
going on? You pass in a PA, host writes there.
--
MST
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-02 15:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
virtualization, Christoph Hellwig, paulus, marc.zyngier, joe,
robin.murphy, david, linuxppc-dev, elfring, haren,
Anshuman Khandual
In-Reply-To: <20180802003823-mutt-send-email-mst@kernel.org>
On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > but it's not, VMs are
> > created in "legacy" mode all the times and we don't know at VM creation
> > time whether it will become a secure VM or not. The way our secure VMs
> > work is that they start as a normal VM, load a secure "payload" and
> > call the Ultravisor to "become" secure.
> >
> > So we're in a bit of a bind here. We need that one-liner optional arch
> > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> >
> > Ben.
>
> And just to make sure I understand, on your platform DMA APIs do include
> some of the cache flushing tricks and this is why you don't want to
> declare iommu support in the hypervisor?
I'm not sure I parse what you mean.
We don't need cache flushing tricks. The problem we have with our
"secure" VMs is that:
- At VM creation time we have no idea it's going to become a secure
VM, qemu doesn't know anything about it, and thus qemu (or other
management tools, libvirt etc...) are going to create "legacy" (ie
iommu bypass) virtio devices.
- Once the VM goes secure (early during boot but too late for qemu),
it will need to make virtio do bounce-buffering via swiotlb because
qemu cannot physically access most VM pages (blocked by HW security
features), we need to bounce buffer using some unsecure pages that are
accessible to qemu.
That said, I wouldn't object for us to more generally switch long run
to changing qemu so that virtio on powerpc starts using the IOMMU as a
default provided we fix our guest firmware to understand it (it
currently doesn't), and provided we verify that the performance impact
on things like vhost is negligible.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-02 15:24 UTC (permalink / raw)
To: Christoph Hellwig, Will Deacon
Cc: robh, srikar, Michael S. Tsirkin, mpe, linuxram, linux-kernel,
virtualization, paulus, marc.zyngier, joe, robin.murphy, david,
linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <20180801083639.GF26378@infradead.org>
On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote:
> We just need to figure out how to deal with devices that deviate
> from the default. One things is that VIRTIO_F_IOMMU_PLATFORM really
> should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu
> dma tweaks (offsets, cache flushing), which seems well in spirit of
> the original design.
I don't completely agree:
1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
for example. It indicates that the peer bypasses the normal platform
iommu. The platform code in the guest has no real way to know that this
is the case, this is a specific "feature" of the qemu implementation.
2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
property of the guest platform itself (not qemu), there's no way the
"peer" can advertize it via the virtio negociated flags. At least for
us. I don't know for sure whether that would be workable for the ARM
case. In our case, qemu has no idea at VM creation time that the VM
will turn itself into a secure VM and thus will require bounce
buffering for IOs (including virtio).
So unless we have another hook for the arch code to set
VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
guest itself, I don't see that as a way to deal with it.
> The other issue is VIRTIO_F_IO_BARRIER
> which is very vaguely defined, and which needs a better definition.
> And last but not least we'll need some text explaining the challenges
> of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> is what would basically cover them, but a good description including
> an explanation of why these matter.
Ben.
^ permalink raw reply
* Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
From: Michael S. Tsirkin @ 2018-08-02 15:18 UTC (permalink / raw)
To: Wei Wang
Cc: virtio-dev, linux-kernel, virtualization, linux-mm, akpm,
Michal Hocko
In-Reply-To: <5B62DDCC.3030100@intel.com>
On Thu, Aug 02, 2018 at 06:32:44PM +0800, Wei Wang wrote:
> On 08/01/2018 07:34 PM, Michal Hocko wrote:
> > On Wed 01-08-18 19:12:25, Wei Wang wrote:
> > > On 07/30/2018 05:00 PM, Michal Hocko wrote:
> > > > On Fri 27-07-18 17:24:55, Wei Wang wrote:
> > > > > The OOM notifier is getting deprecated to use for the reasons mentioned
> > > > > here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314
> > > > >
> > > > > This patch replaces the virtio-balloon oom notifier with a shrinker
> > > > > to release balloon pages on memory pressure.
> > > > It would be great to document the replacement. This is not a small
> > > > change...
> > > OK. I plan to document the following to the commit log:
> > >
> > > The OOM notifier is getting deprecated to use for the reasons:
> > > - As a callout from the oom context, it is too subtle and easy to
> > > generate bugs and corner cases which are hard to track;
> > > - It is called too late (after the reclaiming has been performed).
> > > Drivers with large amuont of reclaimable memory is expected to be
> > > released them at an early age of memory pressure;
> > > - The notifier callback isn't aware of the oom contrains;
> > > Link: https://lkml.org/lkml/2018/7/12/314
> > >
> > > This patch replaces the virtio-balloon oom notifier with a shrinker
> > > to release balloon pages on memory pressure. Users can set the amount of
> > > memory pages to release each time a shrinker_scan is called via the
> > > module parameter balloon_pages_to_shrink, and the default amount is 256
> > > pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
> > > been used to release balloon pages on OOM. We continue to use this
> > > feature bit for the shrinker, so the shrinker is only registered when
> > > this feature bit has been negotiated with host.
> > Do you have any numbers for how does this work in practice?
>
> It works in this way: for example, we can set the parameter,
> balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called.
> Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is
> called, the balloon driver will get back 1GB memory and give them back to
> mm, then the ballooned memory becomes 6GB.
>
> When the shrinker scan is called the second time, another 1GB will be given
> back to mm. So the ballooned pages are given back to mm gradually.
I think what's being asked here is a description of tests that
were run. Which workloads see improved behaviour?
Our behaviour under memory pressure isn't great, in particular it is not
clear when it's safe to re-inflate the balloon, if host attempts to
re-inflate it too soon then we still get OOM. It would be better
if VIRTIO_BALLOON_F_DEFLATE_ON_OOM would somehow mean
"it's ok to ask for almost all of memory, if guest needs memory from
balloon for apps to function it can take it from the balloon".
--
MST
^ permalink raw reply
* Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
From: Michal Hocko @ 2018-08-02 11:47 UTC (permalink / raw)
To: Wei Wang; +Cc: virtio-dev, mst, linux-kernel, virtualization, linux-mm, akpm
In-Reply-To: <5B62DDCC.3030100@intel.com>
On Thu 02-08-18 18:32:44, Wei Wang wrote:
> On 08/01/2018 07:34 PM, Michal Hocko wrote:
> > On Wed 01-08-18 19:12:25, Wei Wang wrote:
> > > On 07/30/2018 05:00 PM, Michal Hocko wrote:
> > > > On Fri 27-07-18 17:24:55, Wei Wang wrote:
> > > > > The OOM notifier is getting deprecated to use for the reasons mentioned
> > > > > here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314
> > > > >
> > > > > This patch replaces the virtio-balloon oom notifier with a shrinker
> > > > > to release balloon pages on memory pressure.
> > > > It would be great to document the replacement. This is not a small
> > > > change...
> > > OK. I plan to document the following to the commit log:
> > >
> > > The OOM notifier is getting deprecated to use for the reasons:
> > > - As a callout from the oom context, it is too subtle and easy to
> > > generate bugs and corner cases which are hard to track;
> > > - It is called too late (after the reclaiming has been performed).
> > > Drivers with large amuont of reclaimable memory is expected to be
> > > released them at an early age of memory pressure;
> > > - The notifier callback isn't aware of the oom contrains;
> > > Link: https://lkml.org/lkml/2018/7/12/314
> > >
> > > This patch replaces the virtio-balloon oom notifier with a shrinker
> > > to release balloon pages on memory pressure. Users can set the amount of
> > > memory pages to release each time a shrinker_scan is called via the
> > > module parameter balloon_pages_to_shrink, and the default amount is 256
> > > pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
> > > been used to release balloon pages on OOM. We continue to use this
> > > feature bit for the shrinker, so the shrinker is only registered when
> > > this feature bit has been negotiated with host.
> > Do you have any numbers for how does this work in practice?
>
> It works in this way: for example, we can set the parameter,
> balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called.
> Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is
> called, the balloon driver will get back 1GB memory and give them back to
> mm, then the ballooned memory becomes 6GB.
>
> When the shrinker scan is called the second time, another 1GB will be given
> back to mm. So the ballooned pages are given back to mm gradually.
>
> > Let's say
> > you have a medium page cache workload which triggers kswapd to do a
> > light reclaim? Hardcoded shrinking sounds quite dubious to me but I have
> > no idea how people expect this to work. Shouldn't this be more
> > adaptive? How precious are those pages anyway?
>
> Those pages are given to host to use usually because the guest has enough
> free memory, and host doesn't want to waste those pieces of memory as they
> are not used by this guest. When the guest needs them, it is reasonable that
> the guest has higher priority to take them back.
> But I'm not sure if there would be a more adaptive approach than "gradually
> giving back as the guest wants more".
I am not sure I follow. Let me be more specific. Say you have a trivial
stream IO triggering reclaim to recycle clean page cache. This will
invoke slab shrinkers as well. Do you really want to drop your batch of
pages on each invocation? Doesn't that remove them very quickly? Just
try to dd if=large_file of=/dev/null and see how your pages are
disappearing. Shrinkers usually scale the number of objects they are
going to reclaim based on the memory pressure (aka targer to be
reclaimed).
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* CFP ICEIS 2019 - 21st Int.l Conf. on Enterprise Information Systems (Heraklion, Crete/Greece)
From: iceis @ 2018-08-02 11:31 UTC (permalink / raw)
To: virtualization
SUBMISSION DEADLINE
21st International Conference on Enterprise Information Systems
Submission Deadline: December 10, 2018
http://www.iceis.org/
May 3 - 5, 2019
Heraklion, Crete, Greece.
ICEIS is organized in 6 major tracks:
- Databases and Information Systems Integration
- Artificial Intelligence and Decision Support Systems
- Information Systems Analysis and Specification
- Software Agents and Internet Computing
- Human-Computer Interaction
- Enterprise Architecture
Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. <br/>
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
Should you have any question please don’t hesitate contacting me.
Kind regards,
ICEIS Secretariat
Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 184
Fax: +351 265 520 186
Web: http://www.iceis.org/
e-mail: iceis.secretariat@insticc.org
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* CFP CLOSER 2019 - 8th Int.l Conf. on Cloud Computing and Services Science (Heraklion, Crete/Greece)
From: closer @ 2018-08-02 11:31 UTC (permalink / raw)
To: virtualization
SUBMISSION DEADLINE
8th International Conference on Cloud Computing and Services Science
Submission Deadline: December 10, 2018
http://closer.scitevents.org
May 2 - 4, 2019
Heraklion, Crete, Greece.
CLOSER is organized in 9 major tracks:
- Services Science
- Data as a Service
- Cloud Operations
- Edge Cloud and Fog Computing
- Service Modelling and Analytics
- Mobile Cloud Computing
- Cloud Computing Fundamentals
- Cloud Computing Platforms and Applications
- Cloud Computing Enabling Technology
Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. <br/>
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
Should you have any question please don’t hesitate contacting me.
Kind regards,
CLOSER Secretariat
Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 184
Fax: +351 265 520 186
Web: http://closer.scitevents.org
e-mail: closer.secretariat@insticc.org
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* CFP VEHITS 2019 - 5th Int.l Conf. on Vehicle Technology and Intelligent Transport Systems (Heraklion, Crete/Greece)
From: vehits @ 2018-08-02 11:30 UTC (permalink / raw)
To: virtualization
SUBMISSION DEADLINE
5th International Conference on Vehicle Technology and Intelligent Transport Systems
Submission Deadline: December 10, 2018
http://www.vehits.org/
May 3 - 5, 2019
Heraklion, Crete, Greece.
VEHITS is organized in 5 major tracks:
- Intelligent Vehicle Technologies
- Intelligent Transport Systems and Infrastructure
- Connected Vehicles
- Sustainable Transport
- Data Analytics
Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. <br/>
With the presence of internationally distinguished keynote speakers:
Javier Sánchez-Medina, University of Las Palmas de Gran Canaria, Spain
Jeroen Ploeg, 2getthere B.V., Utrecht, The Netherlands Lead Cooperative Driving Eindhoven University of Technology, Eindhoven, The Netherlands Associate professor (part time), faculty of Mechanical Engineering, Dynamics and Control group, Netherlands
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
Should you have any question please don’t hesitate contacting me.
Kind regards,
VEHITS Secretariat
Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 185
Fax: +351 265 520 186
Web: http://www.vehits.org/
e-mail: vehits.secretariat@insticc.org
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* CFP SMARTGREENS 2019 - 8th Int.l Conf. on Smart Cities and Green ICT Systems (Heraklion, Crete/Greece)
From: smartgreens @ 2018-08-02 11:30 UTC (permalink / raw)
To: virtualization
SUBMISSION DEADLINE
8th International Conference on Smart Cities and Green ICT Systems
Submission Deadline: December 10, 2018
http://www.smartgreens.org/
May 3 - 5, 2019
Heraklion, Crete, Greece.
SMARTGREENS is organized in 5 major tracks:
- Energy-Aware Systems and Technologies
- Sustainable Computing and Systems
- Smart Cities and Smart Buildings
- Demos and Use-Cases
- Smart and Digital Services
Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. <br/>
With the presence of internationally distinguished keynote speakers:
Norbert Streitz, Founder and Scientific Director, Smart Future Initiative, Germany
Rudolf Giffinger, Vienna University of Technology, Austria
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
Should you have any question please don’t hesitate contacting me.
Kind regards,
SMARTGREENS Secretariat
Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 185
Fax: +351 265 520 186
Web: http://www.smartgreens.org/
e-mail: smartgreens.secretariat@insticc.org
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* CFP IoTBDS 2019 - 3rd Int.l Conf. on Internet of Things, Big Data and Security (Heraklion, Crete/Greece)
From: iotbds @ 2018-08-02 11:30 UTC (permalink / raw)
To: virtualization
SUBMISSION DEADLINE
3rd International Conference on Internet of Things, Big Data and Security
Submission Deadline: December 10, 2018
http://iotbds.org/
May 2 - 4, 2019
Heraklion, Crete, Greece.
IoTBDS is organized in 7 major tracks:
- Big Data Research
- Emerging Services and Analytics
- Internet of Things (IoT) Fundamentals
- Internet of Things (IoT) Applications
- Big Data for Multi-discipline Services
- Security, Privacy and Trust
- IoT Technologies
Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. <br/>
With the presence of internationally distinguished keynote speakers:
Francisco Herrera, University of Granada, Spain
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
Should you have any question please don’t hesitate contacting me.
Kind regards,
IoTBDS Secretariat
Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 184
Fax: +351 265 520 186
Web: http://iotbds.org/
e-mail: iotbds.secretariat@insticc.org
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* CFP SENSORNETS 2019 - 8th Int.l Conf. on Sensor Networks (Prague/Czech Republic)
From: sensornets @ 2018-08-02 11:30 UTC (permalink / raw)
To: virtualization
SUBMISSION DEADLINE
8th International Conference on Sensor Networks
Submission Deadline: October 1, 2018
http://www.sensornets.org/
February 26 - 27, 2019
Prague, Czech Republic.
SENSORNETS is organized in 5 major tracks:
- Sensor Networks Software, Architectures and Applications
- Wireless Sensor Networks
- Energy and Environment
- Intelligent Data Analysis and Processing
- Security and Privacy in Sensor Networks
Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. <br/>
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
Should you have any question please don’t hesitate contacting me.
Kind regards,
SENSORNETS Secretariat
Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 100 033
Fax: +351 265 520 186
Web: http://www.sensornets.org/
e-mail: sensornets.secretariat@insticc.org
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox