* Re: [PATCH] vhost_net: use packet weight for rx handler, too
From: Jason Wang @ 2018-04-24 9:11 UTC (permalink / raw)
To: Paolo Abeni, kvm; +Cc: netdev, virtualization, haibinzhang, Michael S. Tsirkin
In-Reply-To: <11f2a27cee0c660a611af381ac1b68d9526095e3.1524556673.git.pabeni@redhat.com>
On 2018年04月24日 16:34, Paolo Abeni wrote:
> Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
> tx polling to 2 * vq size"), we need a packet-based limit for
> handler_rx, too - elsewhere, under rx flood with small packets,
> tx can be delayed for a very long time, even without busypolling.
>
> The pkt limit applied to handle_rx must be the same applied by
> handle_tx, or we will get unfair scheduling between rx and tx.
> Tying such limit to the queue length makes it less effective for
> large queue length values and can introduce large process
> scheduler latencies, so a constant valued is used - likewise
> the existing bytes limit.
>
> The selected limit has been validated with PVP[1] performance
> test with different queue sizes:
>
> queue size 256 512 1024
>
> baseline 366 354 362
> weight 128 715 723 670
> weight 256 740 745 733
> weight 512 600 460 583
> weight 1024 423 427 418
>
> A packet weight of 256 gives peek performances in under all the
> tested scenarios.
>
> No measurable regression in unidirectional performance tests has
> been detected.
>
> [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> drivers/vhost/net.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index bbf38befefb2..c4b49fca4871 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -46,8 +46,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> #define VHOST_NET_WEIGHT 0x80000
>
> /* Max number of packets transferred before requeueing the job.
> - * Using this limit prevents one virtqueue from starving rx. */
> -#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> + * Using this limit prevents one virtqueue from starving others with small
> + * pkts.
> + */
> +#define VHOST_NET_PKT_WEIGHT 256
>
> /* MAX number of TX used buffers for outstanding zerocopy */
> #define VHOST_MAX_PEND 128
> @@ -587,7 +589,7 @@ static void handle_tx(struct vhost_net *net)
> vhost_zerocopy_signal_used(net, vq);
> vhost_net_tx_packet(net);
> if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> - unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
> + unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
> vhost_poll_queue(&vq->poll);
> break;
> }
> @@ -769,6 +771,7 @@ static void handle_rx(struct vhost_net *net)
> struct socket *sock;
> struct iov_iter fixup;
> __virtio16 num_buffers;
> + int recv_pkts = 0;
>
> mutex_lock_nested(&vq->mutex, 0);
> sock = vq->private_data;
> @@ -872,7 +875,8 @@ static void handle_rx(struct vhost_net *net)
> if (unlikely(vq_log))
> vhost_log_write(vq, vq_log, log, vhost_len);
> total_len += vhost_len;
> - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> + if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> + unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
> vhost_poll_queue(&vq->poll);
> goto out;
> }
The numbers looks impressive.
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks!
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 11:04 UTC (permalink / raw)
To: David Rientjes
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
Michal Hocko, linux-mm, edumazet, Andrew Morton, virtualization,
David Miller, Vlastimil Babka
In-Reply-To: <alpine.DEB.2.21.1804231945410.58980@chino.kir.corp.google.com>
On Mon, 23 Apr 2018, David Rientjes wrote:
> On Mon, 23 Apr 2018, Mikulas Patocka wrote:
>
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> >
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> >
> > These bugs are hard to reproduce because kvmalloc falls back to vmalloc
> > only if memory is fragmented.
> >
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
> > is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
> > verify the addresses passed to it, and so the user will get a reliable
> > stacktrace.
>
> Why not just do it unconditionally? Sounds better than "50% of the time
> this will catch bugs".
Because kmalloc (with slub_debug) detects buffer overflows better than
vmalloc. vmalloc detects buffer overflows only at a page boundary. This is
intended for debugging kernels and debugging kernels should detect as many
bugs as possible.
Mikulas
^ permalink raw reply
* Re: [PATCH] gpu: drm: qxl: Adding new typedef vm_fault_t
From: Daniel Vetter @ 2018-04-24 12:11 UTC (permalink / raw)
To: Gerd Hoffmann
Cc: willy, airlied, Gustavo Padovan, Maarten Lankhorst, linux-kernel,
dri-devel, virtualization, Sean Paul, Souptick Joarder, airlied
In-Reply-To: <20180423104924.ozalv7x5ppjifnxc@sirius.home.kraxel.org>
On Mon, Apr 23, 2018 at 12:49:24PM +0200, Gerd Hoffmann wrote:
> On Tue, Apr 17, 2018 at 07:08:44PM +0530, Souptick Joarder wrote:
> > Use new return type vm_fault_t for fault handler. For
> > now, this is just documenting that the function returns
> > a VM_FAULT value rather than an errno. Once all instances
> > are converted, vm_fault_t will become a distinct type.
> >
> > Reference id -> 1c8f422059ae ("mm: change return type to
> > vm_fault_t")
>
> Hmm, that commit isn't yet in drm-misc-next.
> Will drm-misc-next merge with 4.17-rcX soon?
For backmerge requests you need to cc/ping the drm-misc maintainers.
Adding them. I think the hold-up also was that Dave was on vacations
still.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 12:29 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Lameter, dm-devel, eric.dumazet, mst, netdev,
linux-kernel, Michal Hocko, Pekka Enberg, linux-mm, edumazet,
David Rientjes, Joonsoo Kim, Andrew Morton, virtualization,
David Miller, Vlastimil Babka
In-Reply-To: <20180424034643.GA26636@bombadil.infradead.org>
On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > Some bugs (such as buffer overflows) are better detected
> > with kmalloc code, so we must test the kmalloc path too.
>
> Well now, this brings up another item for the collective TODO list --
> implement redzone checks for vmalloc. Unless this is something already
> taken care of by kasan or similar.
The kmalloc overflow testing is also not ideal - it rounds the size up to
the next slab size and detects buffer overflows only at this boundary.
Some times ago, I made a "kmalloc guard" patch that places a magic number
immediatelly after the requested size - so that it can detect overflows at
byte boundary
( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
That patch found a bug in crypto code:
( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
Mikulas
^ permalink raw reply
* Re: [PATCH] gpu: drm: qxl: Adding new typedef vm_fault_t
From: Gerd Hoffmann @ 2018-04-24 12:35 UTC (permalink / raw)
To: Souptick Joarder, airlied, linux-kernel, willy, virtualization,
dri-devel, airlied, Maarten Lankhorst, Gustavo Padovan, Sean Paul
In-Reply-To: <20180424121151.GX31310@phenom.ffwll.local>
On Tue, Apr 24, 2018 at 02:11:51PM +0200, Daniel Vetter wrote:
> On Mon, Apr 23, 2018 at 12:49:24PM +0200, Gerd Hoffmann wrote:
> > On Tue, Apr 17, 2018 at 07:08:44PM +0530, Souptick Joarder wrote:
> > > Use new return type vm_fault_t for fault handler. For
> > > now, this is just documenting that the function returns
> > > a VM_FAULT value rather than an errno. Once all instances
> > > are converted, vm_fault_t will become a distinct type.
> > >
> > > Reference id -> 1c8f422059ae ("mm: change return type to
> > > vm_fault_t")
> >
> > Hmm, that commit isn't yet in drm-misc-next.
> > Will drm-misc-next merge with 4.17-rcX soon?
>
> For backmerge requests you need to cc/ping the drm-misc maintainers.
> Adding them. I think the hold-up also was that Dave was on vacations
> still.
Ah, ok.
So my expectation that a backmerge happens anyway after -rc1/2 is in
line with reality, it is just to be delayed this time. I'll stay
tuned ;)
cheers,
Gerd
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 12:51 UTC (permalink / raw)
To: Mikulas Patocka
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804232003100.2299@file01.intranet.prod.int.rdu2.redhat.com>
On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
[...]
> @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> */
> WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>
> +#ifdef CONFIG_DEBUG_SG
> + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> + if (!(prandom_u32_max(2) & 1))
> + goto do_vmalloc;
> +#endif
I really do not think there is anything DEBUG_SG specific here. Why you
simply do not follow should_failslab path or even reuse the function?
> +
> /*
> * We want to attempt a large physically contiguous block first because
> * it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
> if (ret || size <= PAGE_SIZE)
> return ret;
>
> +#ifdef CONFIG_DEBUG_SG
> +do_vmalloc:
> +#endif
> return __vmalloc_node_flags_caller(size, node, flags,
> __builtin_return_address(0));
> }
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 13:31 UTC (permalink / raw)
To: Mikulas Patocka
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804232006540.2299@file01.intranet.prod.int.rdu2.redhat.com>
On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>
>
> On Mon, 23 Apr 2018, Michal Hocko wrote:
>
> > On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> >
> > > > > He didn't want to fix vmalloc(GFP_NOIO)
> > > >
> > > > I don't remember that conversation, so I don't know whether I agree with
> > > > his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> > > > towards marking regions with memalloc_noio_save() / restore. If you do
> > > > that, you won't need vmalloc(GFP_NOIO).
> > >
> > > He said the same thing a year ago. And there was small progress. 6 out of
> > > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in
> > > infiniband and 1 in btrfs. (the whole discussion is here
> > > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
> >
> > Well this is not that easy. It requires a cooperation from maintainers.
> > I can only do as much. I've posted patches in the past and actively
> > bringing up this topic at LSFMM last two years...
>
> You're right - but you have chosen the uneasy path.
Yes.
> Fixing __vmalloc code
> is easy and it doesn't require cooperation with maintainers.
But it is a hack against the intention of the scope api. It also alows
maintainers to not care about their broken code.
> > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
> > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
> > > does he have veto over this part of the code? I'd much rather argue with
> > > people who have constructive comments about fixing bugs than with him.
> >
> > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> > I would be much more willing to change my mind if you would back your
> > patch by a real bug report. Hacks are acceptable when we have a real
> > issue in hands. But if we want to fix potential issue then better make
> > it properly.
>
> Developers should fix bugs in advance, not to wait until a crash hapens,
> is analyzed and reported.
I agree. But are those existing users broken in the first place? I have
seen so many GFP_NOFS abuses that I would dare to guess that most of
those vmalloc NOFS abusers can be simply turned into GFP_KERNEL. Maybe
that is the reason we haven't heard any complains in years.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH] vhost_net: use packet weight for rx handler, too
From: David Miller @ 2018-04-24 14:02 UTC (permalink / raw)
To: pabeni; +Cc: kvm, mst, netdev, virtualization, haibinzhang
In-Reply-To: <11f2a27cee0c660a611af381ac1b68d9526095e3.1524556673.git.pabeni@redhat.com>
From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 24 Apr 2018 10:34:36 +0200
> Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
> tx polling to 2 * vq size"), we need a packet-based limit for
> handler_rx, too - elsewhere, under rx flood with small packets,
> tx can be delayed for a very long time, even without busypolling.
>
> The pkt limit applied to handle_rx must be the same applied by
> handle_tx, or we will get unfair scheduling between rx and tx.
> Tying such limit to the queue length makes it less effective for
> large queue length values and can introduce large process
> scheduler latencies, so a constant valued is used - likewise
> the existing bytes limit.
>
> The selected limit has been validated with PVP[1] performance
> test with different queue sizes:
>
> queue size 256 512 1024
>
> baseline 366 354 362
> weight 128 715 723 670
> weight 256 740 745 733
> weight 512 600 460 583
> weight 1024 423 427 418
>
> A packet weight of 256 gives peek performances in under all the
> tested scenarios.
>
> No measurable regression in unidirectional performance tests has
> been detected.
>
> [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Applied to net-next, thanks.
^ permalink raw reply
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24 15:30 UTC (permalink / raw)
To: Michal Hocko
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <20180424133146.GG17484@dhcp22.suse.cz>
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>
> > Fixing __vmalloc code
> > is easy and it doesn't require cooperation with maintainers.
>
> But it is a hack against the intention of the scope api.
It is not! You can fix __vmalloc now and you can convert the kernel to the
scope API in 4 years. It's not one way or the other.
> It also alows maintainers to not care about their broken code.
Most maintainers don't even know that it's broken. Out of 14 subsystems
using __vmalloc with GFP_NOIO/NOFS, only 2 realized that its
implementation is broken and implemented a workaround (me and the XFS
developers).
Misimplementing a function in a subtle and hard-to-notice way won't drive
developers away from using it.
> > > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
> > > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
> > > > does he have veto over this part of the code? I'd much rather argue with
> > > > people who have constructive comments about fixing bugs than with him.
> > >
> > > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> > > I would be much more willing to change my mind if you would back your
> > > patch by a real bug report. Hacks are acceptable when we have a real
> > > issue in hands. But if we want to fix potential issue then better make
> > > it properly.
> >
> > Developers should fix bugs in advance, not to wait until a crash hapens,
> > is analyzed and reported.
>
> I agree. But are those existing users broken in the first place? I have
> seen so many GFP_NOFS abuses that I would dare to guess that most of
> those vmalloc NOFS abusers can be simply turned into GFP_KERNEL. Maybe
> that is the reason we haven't heard any complains in years.
alloc_pages reclaims clean pages and most hard work is done by kswapd, so
GFP_KERNEL doesn't cause much issues with writeback. But cheating isn't
justified if you can get away with it. Incorrect GFP flags cause real
problems with shrinkers - because shrinkers are called from alloc_pages
and they do respond to GFP flags.
I had reported deadlock due to GFP issues (9d28eb12447). And the worst
thing about these bug reports is that they are totally unreproducible and
I get nothing, but a stacktrace in bugzilla. I had to guess what happened
and I couldn't even test if the patch fixed the bug.
I'm not really happy that you are deliberately leaving these issues behind
and making excuses.
Mikulas
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 15:50 UTC (permalink / raw)
To: Michal Hocko
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <20180424125121.GA17484@dhcp22.suse.cz>
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> [...]
> > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > */
> > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> >
> > +#ifdef CONFIG_DEBUG_SG
> > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > + if (!(prandom_u32_max(2) & 1))
> > + goto do_vmalloc;
> > +#endif
>
> I really do not think there is anything DEBUG_SG specific here. Why you
> simply do not follow should_failslab path or even reuse the function?
CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
there).
Fail-injection framework is if off by default and it must be explicitly
enabled and configured by the user - and most users won't enable it.
Mikulas
> > +
> > /*
> > * We want to attempt a large physically contiguous block first because
> > * it is less likely to fragment multiple larger blocks and therefore
> > @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
> > if (ret || size <= PAGE_SIZE)
> > return ret;
> >
> > +#ifdef CONFIG_DEBUG_SG
> > +do_vmalloc:
> > +#endif
> > return __vmalloc_node_flags_caller(size, node, flags,
> > __builtin_return_address(0));
> > }
>
> --
> Michal Hocko
> SUSE Labs
>
^ permalink raw reply
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 16:12 UTC (permalink / raw)
To: Mikulas Patocka
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241107010.31601@file01.intranet.prod.int.rdu2.redhat.com>
On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> >
> > > Fixing __vmalloc code
> > > is easy and it doesn't require cooperation with maintainers.
> >
> > But it is a hack against the intention of the scope api.
>
> It is not!
This discussion simply doesn't make much sense it seems. The scope API
is to document the scope of the reclaim recursion critical section. That
certainly is not a utility function like vmalloc.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 16:29 UTC (permalink / raw)
To: Mikulas Patocka
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241142340.15660@file01.intranet.prod.int.rdu2.redhat.com>
On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > [...]
> > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > */
> > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > >
> > > +#ifdef CONFIG_DEBUG_SG
> > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > + if (!(prandom_u32_max(2) & 1))
> > > + goto do_vmalloc;
> > > +#endif
> >
> > I really do not think there is anything DEBUG_SG specific here. Why you
> > simply do not follow should_failslab path or even reuse the function?
>
> CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> there).
Are you telling me that you are shaping a debugging functionality basing
on what RHEL has enabled? And you call me evil. This is just rediculous.
> Fail-injection framework is if off by default and it must be explicitly
> enabled and configured by the user - and most users won't enable it.
It can be enabled easily. And if you care enough for your debugging
kernel then just make it enabled unconditionally.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 16:29 UTC (permalink / raw)
To: Mikulas Patocka
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <20180424161242.GK17484@dhcp22.suse.cz>
On Tue 24-04-18 10:12:42, Michal Hocko wrote:
> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > >
> > > > Fixing __vmalloc code
> > > > is easy and it doesn't require cooperation with maintainers.
> > >
> > > But it is a hack against the intention of the scope api.
> >
> > It is not!
>
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.
http://lkml.kernel.org/r/20180424162712.GL17484@dhcp22.suse.cz
let's see how it rolls this time.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24 16:33 UTC (permalink / raw)
To: Michal Hocko
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <20180424161242.GK17484@dhcp22.suse.cz>
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > >
> > > > Fixing __vmalloc code
> > > > is easy and it doesn't require cooperation with maintainers.
> > >
> > > But it is a hack against the intention of the scope api.
> >
> > It is not!
>
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.
That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel
developer) from converting the code to the scope API. You make nonsensical
excuses.
Mikulas
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 17:00 UTC (permalink / raw)
To: Michal Hocko
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <20180424162906.GM17484@dhcp22.suse.cz>
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > [...]
> > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > */
> > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > >
> > > > +#ifdef CONFIG_DEBUG_SG
> > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > + if (!(prandom_u32_max(2) & 1))
> > > > + goto do_vmalloc;
> > > > +#endif
> > >
> > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > simply do not follow should_failslab path or even reuse the function?
> >
> > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > there).
>
> Are you telling me that you are shaping a debugging functionality basing
> on what RHEL has enabled? And you call me evil. This is just rediculous.
>
> > Fail-injection framework is if off by default and it must be explicitly
> > enabled and configured by the user - and most users won't enable it.
>
> It can be enabled easily. And if you care enough for your debugging
> kernel then just make it enabled unconditionally.
So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
quite sure if 3 lines of debugging code need an extra option, but if you
don't want to reuse any existing debug option, it may be possible. Adding
it to the RHEL debug kernel would be trivial.
Mikulas
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 17:03 UTC (permalink / raw)
To: Mikulas Patocka
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241250350.28995@file01.intranet.prod.int.rdu2.redhat.com>
On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > >
> > >
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > >
> > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > [...]
> > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > */
> > > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > >
> > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > + goto do_vmalloc;
> > > > > +#endif
> > > >
> > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > simply do not follow should_failslab path or even reuse the function?
> > >
> > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > > there).
> >
> > Are you telling me that you are shaping a debugging functionality basing
> > on what RHEL has enabled? And you call me evil. This is just rediculous.
> >
> > > Fail-injection framework is if off by default and it must be explicitly
> > > enabled and configured by the user - and most users won't enable it.
> >
> > It can be enabled easily. And if you care enough for your debugging
> > kernel then just make it enabled unconditionally.
>
> So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
> quite sure if 3 lines of debugging code need an extra option, but if you
> don't want to reuse any existing debug option, it may be possible. Adding
> it to the RHEL debug kernel would be trivial.
Wouldn't it be equally trivial to simply enable the fault injection? You
would get additional failure paths testing as a bonus.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Matthew Wilcox @ 2018-04-24 17:16 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Christoph Lameter, dm-devel, eric.dumazet, mst, netdev,
linux-kernel, Michal Hocko, Pekka Enberg, linux-mm, edumazet,
David Rientjes, Joonsoo Kim, Andrew Morton, virtualization,
David Miller, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804240818530.28016@file01.intranet.prod.int.rdu2.redhat.com>
On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
>
>
> On Mon, 23 Apr 2018, Matthew Wilcox wrote:
>
> > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > Some bugs (such as buffer overflows) are better detected
> > > with kmalloc code, so we must test the kmalloc path too.
> >
> > Well now, this brings up another item for the collective TODO list --
> > implement redzone checks for vmalloc. Unless this is something already
> > taken care of by kasan or similar.
>
> The kmalloc overflow testing is also not ideal - it rounds the size up to
> the next slab size and detects buffer overflows only at this boundary.
>
> Some times ago, I made a "kmalloc guard" patch that places a magic number
> immediatelly after the requested size - so that it can detect overflows at
> byte boundary
> ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
>
> That patch found a bug in crypto code:
> ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
Is it still worth doing this, now we have kasan?
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 17:28 UTC (permalink / raw)
To: Michal Hocko
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <20180424170349.GQ17484@dhcp22.suse.cz>
On Tue, 24 Apr 2018, Michal Hocko wrote:
> On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > >
> > > >
> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > >
> > > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > > [...]
> > > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > > */
> > > > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > > >
> > > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > > + goto do_vmalloc;
> > > > > > +#endif
> > > > >
> > > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > > simply do not follow should_failslab path or even reuse the function?
> > > >
> > > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > > > there).
> > >
> > > Are you telling me that you are shaping a debugging functionality basing
> > > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > >
> > > > Fail-injection framework is if off by default and it must be explicitly
> > > > enabled and configured by the user - and most users won't enable it.
> > >
> > > It can be enabled easily. And if you care enough for your debugging
> > > kernel then just make it enabled unconditionally.
> >
> > So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
> > quite sure if 3 lines of debugging code need an extra option, but if you
> > don't want to reuse any existing debug option, it may be possible. Adding
> > it to the RHEL debug kernel would be trivial.
>
> Wouldn't it be equally trivial to simply enable the fault injection? You
> would get additional failure paths testing as a bonus.
The RHEL and Fedora debugging kernels are compiled with fault injection.
But the fault-injection framework will do nothing unless it is enabled by
a kernel parameter or debugfs write.
Most users don't know about the fault injection kernel parameters or
debugfs files and won't enabled it. We need a CONFIG_ option to enable it
by default in the debugging kernels (and we could add a kernel parameter
to override the default, fine-tune the fallback probability etc.)
Mikulas
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 17:38 UTC (permalink / raw)
To: Mikulas Patocka
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241319390.28995@file01.intranet.prod.int.rdu2.redhat.com>
On Tue 24-04-18 13:28:49, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> > >
> > >
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > >
> > > > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > > >
> > > > >
> > > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > > >
> > > > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > > > [...]
> > > > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > > > */
> > > > > > > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > > > >
> > > > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > > > + goto do_vmalloc;
> > > > > > > +#endif
> > > > > >
> > > > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > > > simply do not follow should_failslab path or even reuse the function?
> > > > >
> > > > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if
> > > > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled
> > > > > there).
> > > >
> > > > Are you telling me that you are shaping a debugging functionality basing
> > > > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > > >
> > > > > Fail-injection framework is if off by default and it must be explicitly
> > > > > enabled and configured by the user - and most users won't enable it.
> > > >
> > > > It can be enabled easily. And if you care enough for your debugging
> > > > kernel then just make it enabled unconditionally.
> > >
> > > So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not
> > > quite sure if 3 lines of debugging code need an extra option, but if you
> > > don't want to reuse any existing debug option, it may be possible. Adding
> > > it to the RHEL debug kernel would be trivial.
> >
> > Wouldn't it be equally trivial to simply enable the fault injection? You
> > would get additional failure paths testing as a bonus.
>
> The RHEL and Fedora debugging kernels are compiled with fault injection.
> But the fault-injection framework will do nothing unless it is enabled by
> a kernel parameter or debugfs write.
>
> Most users don't know about the fault injection kernel parameters or
> debugfs files and won't enabled it. We need a CONFIG_ option to enable it
> by default in the debugging kernels (and we could add a kernel parameter
> to override the default, fine-tune the fallback probability etc.)
If it is a real issue to install the debugging kernel with the required
kernel parameter then I a config option for the default on makes sense
to me.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 0/6] virtio-console: spec compliance fixes
From: Michael S. Tsirkin @ 2018-04-24 18:41 UTC (permalink / raw)
To: linux-kernel
Cc: Arnd Bergmann, Amit Shah, Greg Kroah-Hartman, stable,
virtualization
In-Reply-To: <1524248223-393618-1-git-send-email-mst@redhat.com>
On Fri, Apr 20, 2018 at 09:17:59PM +0300, Michael S. Tsirkin wrote:
> Turns out virtio console tries to take a buffer out of an active vq.
> Works by sheer luck, and is explicitly forbidden by spec. And while
> going over it I saw that error handling is also broken -
> failure is easy to trigger if I force allocations to fail.
>
> Lightly tested.
Amit - any feedback before I push these patches?
> Michael S. Tsirkin (6):
> virtio_console: don't tie bufs to a vq
> virtio: add ability to iterate over vqs
> virtio_console: free buffers after reset
> virtio_console: drop custom control queue cleanup
> virtio_console: move removal code
> virtio_console: reset on out of memory
>
> drivers/char/virtio_console.c | 155 ++++++++++++++++++++----------------------
> include/linux/virtio.h | 3 +
> 2 files changed, 75 insertions(+), 83 deletions(-)
>
> --
> MST
>
^ permalink raw reply
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 18:41 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Lameter, dm-devel, eric.dumazet, mst, netdev,
linux-kernel, Michal Hocko, Pekka Enberg, linux-mm, edumazet,
David Rientjes, Joonsoo Kim, Andrew Morton, virtualization,
David Miller, Vlastimil Babka
In-Reply-To: <20180424171651.GC30577@bombadil.infradead.org>
On Tue, 24 Apr 2018, Matthew Wilcox wrote:
> On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> >
> >
> > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> >
> > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > Some bugs (such as buffer overflows) are better detected
> > > > with kmalloc code, so we must test the kmalloc path too.
> > >
> > > Well now, this brings up another item for the collective TODO list --
> > > implement redzone checks for vmalloc. Unless this is something already
> > > taken care of by kasan or similar.
> >
> > The kmalloc overflow testing is also not ideal - it rounds the size up to
> > the next slab size and detects buffer overflows only at this boundary.
> >
> > Some times ago, I made a "kmalloc guard" patch that places a magic number
> > immediatelly after the requested size - so that it can detect overflows at
> > byte boundary
> > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> >
> > That patch found a bug in crypto code:
> > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
>
> Is it still worth doing this, now we have kasan?
The kmalloc guard has much lower overhead than kasan.
(BTW. when I tried kasan, it oopsed with persistent memory)
Mikulas
^ permalink raw reply
* Re: [PATCH 1/6] virtio_console: don't tie bufs to a vq
From: Michael S. Tsirkin @ 2018-04-24 18:56 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Arnd Bergmann, Amit Shah, linux-kernel, stable, virtualization
In-Reply-To: <20180421073005.GA3744@kroah.com>
On Sat, Apr 21, 2018 at 09:30:05AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Apr 20, 2018 at 09:18:01PM +0300, Michael S. Tsirkin wrote:
> > an allocated buffer doesn't need to be tied to a vq -
> > only vq->vdev is ever used. Pass the function the
> > just what it needs - the vdev.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > drivers/char/virtio_console.c | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > index 468f061..3e56f32 100644
> > --- a/drivers/char/virtio_console.c
> > +++ b/drivers/char/virtio_console.c
> > @@ -422,7 +422,7 @@ static void reclaim_dma_bufs(void)
> > }
> > }
> >
> > -static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size,
> > +static struct port_buffer *alloc_buf(struct virtio_device *vdev, size_t buf_size,
> > int pages)
> > {
> > struct port_buffer *buf;
> > @@ -445,16 +445,16 @@ static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size,
> > return buf;
> > }
> >
> > - if (is_rproc_serial(vq->vdev)) {
> > + if (is_rproc_serial(vdev)) {
> > /*
> > * Allocate DMA memory from ancestor. When a virtio
> > * device is created by remoteproc, the DMA memory is
> > * associated with the grandparent device:
> > * vdev => rproc => platform-dev.
> > */
> > - if (!vq->vdev->dev.parent || !vq->vdev->dev.parent->parent)
> > + if (!vdev->dev.parent || !vdev->dev.parent->parent)
> > goto free_buf;
> > - buf->dev = vq->vdev->dev.parent->parent;
> > + buf->dev = vdev->dev.parent->parent;
> >
> > /* Increase device refcnt to avoid freeing it */
> > get_device(buf->dev);
> > @@ -838,7 +838,7 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf,
> >
> > count = min((size_t)(32 * 1024), count);
> >
> > - buf = alloc_buf(port->out_vq, count, 0);
> > + buf = alloc_buf(port->portdev->vdev, count, 0);
> > if (!buf)
> > return -ENOMEM;
> >
> > @@ -957,7 +957,7 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe,
> > if (ret < 0)
> > goto error_out;
> >
> > - buf = alloc_buf(port->out_vq, 0, pipe->nrbufs);
> > + buf = alloc_buf(port->portdev->vdev, 0, pipe->nrbufs);
> > if (!buf) {
> > ret = -ENOMEM;
> > goto error_out;
> > @@ -1374,7 +1374,7 @@ static unsigned int fill_queue(struct virtqueue *vq, spinlock_t *lock)
> >
> > nr_added_bufs = 0;
> > do {
> > - buf = alloc_buf(vq, PAGE_SIZE, 0);
> > + buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
> > if (!buf)
> > break;
> >
> > --
> > MST
>
> <formletter>
>
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree. Please read:
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
>
> </formletter>
Thanks!
I have some questions about this one:
Cc: <stable@vger.kernel.org> # 3.3.x: a1f84a3: sched: Check for idle
Cc: <stable@vger.kernel.org> # 3.3.x: 1b9508f: sched: Rate-limit newidle
Cc: <stable@vger.kernel.org> # 3.3.x: fd21073: sched: Fix affinity logic
Cc: <stable@vger.kernel.org> # 3.3.x
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. what does the kernel version mean? can I omit it?
2. so when I rebase to add the tag, this changes commit IDs for
following tags in the same tree, breaking their tags
in the process. Pretty annoying. Any idea how to do it better?
Thanks!
--
MST
^ permalink raw reply
* [RFC v3 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-04-25 5:15 UTC (permalink / raw)
To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
Hello everyone,
This RFC implements packed ring support in virtio driver.
Some simple functional tests have been done with Jason's
packed ring implementation in vhost:
https://lkml.org/lkml/2018/4/23/12
Both of ping and netperf worked as expected (with EVENT_IDX
disabled). But there are below known issues:
1. Reloading the guest driver will break the Tx/Rx;
2. Zeroing the flags when detaching a used desc will
break the guest -> host path.
Some simple functional tests have also been done with
Wei's packed ring implementation in QEMU:
http://lists.nongnu.org/archive/html/qemu-devel/2018-04/msg00342.html
Both of ping and netperf worked as expected (with EVENT_IDX
disabled). Reloading the guest driver also worked as expected.
TODO:
- Refinements (for code and commit log) and bug fixes;
- Discuss/fix/test EVENT_IDX support;
- Test devices other than net;
RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;
RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;
Thanks!
Tiwei Bie (5):
virtio: add packed ring definitions
virtio_ring: support creating packed ring
virtio_ring: add packed ring support
virtio_ring: add event idx support in packed ring
virtio_ring: enable packed ring
drivers/virtio/virtio_ring.c | 1271 ++++++++++++++++++++++++++++--------
include/linux/virtio_ring.h | 8 +-
include/uapi/linux/virtio_config.h | 12 +-
include/uapi/linux/virtio_ring.h | 36 +
4 files changed, 1049 insertions(+), 278 deletions(-)
--
2.11.0
^ permalink raw reply
* [RFC v3 1/5] virtio: add packed ring definitions
From: Tiwei Bie @ 2018-04-25 5:15 UTC (permalink / raw)
To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
include/uapi/linux/virtio_config.h | 12 +++++++++++-
include/uapi/linux/virtio_ring.h | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
* transport being used (eg. virtio_ring), the rest are per-device feature
* bits. */
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
#ifndef VIRTIO_CONFIG_NO_LEGACY
/* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
* this is for compatibility with legacy systems.
*/
#define VIRTIO_F_IOMMU_PLATFORM 33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED 34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER 35
+
#endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
/* This means the buffer contains a list of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
+#define VRING_DESC_F_AVAIL(b) ((b) << 7)
+#define VRING_DESC_F_USED(b) ((b) << 15)
+
/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
* will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
* optimization. */
#define VRING_AVAIL_F_NO_INTERRUPT 1
+#define VRING_EVENT_F_ENABLE 0x0
+#define VRING_EVENT_F_DISABLE 0x1
+#define VRING_EVENT_F_DESC 0x2
+
/* We support indirect buffer descriptors */
#define VIRTIO_RING_F_INDIRECT_DESC 28
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
}
+struct vring_packed_desc_event {
+ /* __virtio16 off : 15; // Descriptor Event Offset
+ * __virtio16 wrap : 1; // Descriptor Event Wrap Counter */
+ __virtio16 off_wrap;
+ /* __virtio16 flags : 2; // Descriptor Event Flags */
+ __virtio16 flags;
+};
+
+struct vring_packed_desc {
+ /* Buffer Address. */
+ __virtio64 addr;
+ /* Buffer Length. */
+ __virtio32 len;
+ /* Buffer ID. */
+ __virtio16 id;
+ /* The flags depending on descriptor type. */
+ __virtio16 flags;
+};
+
+struct vring_packed {
+ unsigned int num;
+
+ struct vring_packed_desc *desc;
+
+ struct vring_packed_desc_event *driver;
+
+ struct vring_packed_desc_event *device;
+};
+
#endif /* _UAPI_LINUX_VIRTIO_RING_H */
--
2.11.0
^ permalink raw reply related
* [RFC v3 2/5] virtio_ring: support creating packed ring
From: Tiwei Bie @ 2018-04-25 5:15 UTC (permalink / raw)
To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>
This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
drivers/virtio/virtio_ring.c | 764 ++++++++++++++++++++++++++++---------------
include/linux/virtio_ring.h | 8 +-
2 files changed, 513 insertions(+), 259 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..e164822ca66e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -64,8 +64,8 @@ struct vring_desc_state {
struct vring_virtqueue {
struct virtqueue vq;
- /* Actual memory layout for this queue */
- struct vring vring;
+ /* Is this a packed ring? */
+ bool packed;
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +79,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
- /* Head of free buffer list. */
- unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
/* Last used index we've seen. */
u16 last_used_idx;
- /* Last written value to avail->flags */
- u16 avail_flags_shadow;
+ union {
+ /* Available for split ring */
+ struct {
+ /* Actual memory layout for this queue. */
+ struct vring vring;
- /* Last written value to avail->idx in guest byte order */
- u16 avail_idx_shadow;
+ /* Head of free buffer list. */
+ unsigned int free_head;
+
+ /* Last written value to avail->flags */
+ u16 avail_flags_shadow;
+
+ /* Last written value to avail->idx in
+ * guest byte order. */
+ u16 avail_idx_shadow;
+ };
+
+ /* Available for packed ring */
+ struct {
+ /* Actual memory layout for this queue. */
+ struct vring_packed vring_packed;
+
+ /* Driver ring wrap counter. */
+ u8 wrap_counter;
+
+ /* Index of the next avail descriptor. */
+ unsigned int next_avail_idx;
+
+ /* Last written value to driver->flags in
+ * guest byte order. */
+ u16 event_flags_shadow;
+ };
+ };
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +227,17 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
cpu_addr, size, direction);
}
-static void vring_unmap_one(const struct vring_virtqueue *vq,
- struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+ dma_addr_t addr)
+{
+ if (!vring_use_dma_api(vq->vq.vdev))
+ return 0;
+
+ return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
{
u16 flags;
@@ -226,17 +261,9 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
}
}
-static int vring_mapping_error(const struct vring_virtqueue *vq,
- dma_addr_t addr)
-{
- if (!vring_use_dma_api(vq->vq.vdev))
- return 0;
-
- return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
- unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+ unsigned int total_sg,
+ gfp_t gfp)
{
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +284,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
return desc;
}
-static inline int virtqueue_add(struct virtqueue *_vq,
- struct scatterlist *sgs[],
- unsigned int total_sg,
- unsigned int out_sgs,
- unsigned int in_sgs,
- void *data,
- void *ctx,
- gfp_t gfp)
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
struct scatterlist *sg;
@@ -303,7 +330,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
/* If the host supports indirect descriptor tables, and we have multiple
* buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
- desc = alloc_indirect(_vq, total_sg, gfp);
+ desc = alloc_indirect_split(_vq, total_sg, gfp);
else {
desc = NULL;
WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
@@ -424,7 +451,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
for (n = 0; n < total_sg; n++) {
if (i == err_idx)
break;
- vring_unmap_one(vq, &desc[i]);
+ vring_unmap_one_split(vq, &desc[i]);
i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
}
@@ -435,6 +462,355 @@ static inline int virtqueue_add(struct virtqueue *_vq,
return -EIO;
}
+static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 new, old;
+ bool needs_kick;
+
+ START_USE(vq);
+ /* We need to expose available array entries before checking avail
+ * event. */
+ virtio_mb(vq->weak_barriers);
+
+ old = vq->avail_idx_shadow - vq->num_added;
+ new = vq->avail_idx_shadow;
+ vq->num_added = 0;
+
+#ifdef DEBUG
+ if (vq->last_add_time_valid) {
+ WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+ vq->last_add_time)) > 100);
+ }
+ vq->last_add_time_valid = false;
+#endif
+
+ if (vq->event) {
+ needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
+ new, old);
+ } else {
+ needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
+ }
+ END_USE(vq);
+ return needs_kick;
+}
+
+static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+ void **ctx)
+{
+ unsigned int i, j;
+ __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
+
+ /* Clear data ptr. */
+ vq->desc_state[head].data = NULL;
+
+ /* Put back on free list: unmap first-level descriptors and find end */
+ i = head;
+
+ while (vq->vring.desc[i].flags & nextflag) {
+ vring_unmap_one_split(vq, &vq->vring.desc[i]);
+ i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
+ vq->vq.num_free++;
+ }
+
+ vring_unmap_one_split(vq, &vq->vring.desc[i]);
+ vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
+ vq->free_head = head;
+
+ /* Plus final descriptor */
+ vq->vq.num_free++;
+
+ if (vq->indirect) {
+ struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
+ u32 len;
+
+ /* Free the indirect table, if any, now that it's unmapped. */
+ if (!indir_desc)
+ return;
+
+ len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
+
+ BUG_ON(!(vq->vring.desc[head].flags &
+ cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
+ BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+
+ for (j = 0; j < len / sizeof(struct vring_desc); j++)
+ vring_unmap_one_split(vq, &indir_desc[j]);
+
+ kfree(indir_desc);
+ vq->desc_state[head].indir_desc = NULL;
+ } else if (ctx) {
+ *ctx = vq->desc_state[head].indir_desc;
+ }
+}
+
+static inline bool more_used_split(const struct vring_virtqueue *vq)
+{
+ return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
+}
+
+static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
+ unsigned int *len,
+ void **ctx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ void *ret;
+ unsigned int i;
+ u16 last_used;
+
+ START_USE(vq);
+
+ if (unlikely(vq->broken)) {
+ END_USE(vq);
+ return NULL;
+ }
+
+ if (!more_used_split(vq)) {
+ pr_debug("No more buffers in queue\n");
+ END_USE(vq);
+ return NULL;
+ }
+
+ /* Only get used array entries after they have been exposed by host. */
+ virtio_rmb(vq->weak_barriers);
+
+ last_used = (vq->last_used_idx & (vq->vring.num - 1));
+ i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
+ *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
+
+ if (unlikely(i >= vq->vring.num)) {
+ BAD_RING(vq, "id %u out of range\n", i);
+ return NULL;
+ }
+ if (unlikely(!vq->desc_state[i].data)) {
+ BAD_RING(vq, "id %u is not a head!\n", i);
+ return NULL;
+ }
+
+ /* detach_buf_split clears data, so grab it now. */
+ ret = vq->desc_state[i].data;
+ detach_buf_split(vq, i, ctx);
+ vq->last_used_idx++;
+ /* If we expect an interrupt for the next entry, tell host
+ * by writing event index and flush out the write before
+ * the read in the next get_buf call. */
+ if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
+ virtio_store_mb(vq->weak_barriers,
+ &vring_used_event(&vq->vring),
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
+
+#ifdef DEBUG
+ vq->last_add_time_valid = false;
+#endif
+
+ END_USE(vq);
+ return ret;
+}
+
+static void virtqueue_disable_cb_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
+ vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+}
+
+static unsigned virtqueue_enable_cb_prepare_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 last_used_idx;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always do both to keep code simple. */
+ if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+ vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+ vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
+ END_USE(vq);
+ return last_used_idx;
+}
+
+static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ virtio_mb(vq->weak_barriers);
+ return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+}
+
+static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 bufs;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always update the event index to keep code simple. */
+ if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+ vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+ /* TODO: tune this threshold */
+ bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
+
+ virtio_store_mb(vq->weak_barriers,
+ &vring_used_event(&vq->vring),
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
+
+ if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
+ END_USE(vq);
+ return false;
+ }
+
+ END_USE(vq);
+ return true;
+}
+
+static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ unsigned int i;
+ void *buf;
+
+ START_USE(vq);
+
+ for (i = 0; i < vq->vring.num; i++) {
+ if (!vq->desc_state[i].data)
+ continue;
+ /* detach_buf clears data, so grab it now. */
+ buf = vq->desc_state[i].data;
+ detach_buf_split(vq, i, NULL);
+ vq->avail_idx_shadow--;
+ vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+ END_USE(vq);
+ return buf;
+ }
+ /* That should have freed everything. */
+ BUG_ON(vq->vq.num_free != vq->vring.num);
+
+ END_USE(vq);
+ return NULL;
+}
+
+/*
+ * The layout for the packed ring is a continuous chunk of memory
+ * which looks like this.
+ *
+ * struct vring_packed {
+ * // The actual descriptors (16 bytes each)
+ * struct vring_packed_desc desc[num];
+ *
+ * // Padding to the next align boundary.
+ * char pad[];
+ *
+ * // Driver Event Suppression
+ * struct vring_packed_desc_event driver;
+ *
+ * // Device Event Suppression
+ * struct vring_packed_desc_event device;
+ * };
+ */
+static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
+ void *p, unsigned long align)
+{
+ vr->num = num;
+ vr->desc = p;
+ vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
+ * num + align - 1) & ~(align - 1));
+ vr->device = vr->driver + 1;
+}
+
+static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
+{
+ return ((sizeof(struct vring_packed_desc) * num + align - 1)
+ & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
+}
+
+static inline int virtqueue_add_packed(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
+{
+ return -EIO;
+}
+
+static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
+{
+ return false;
+}
+
+static inline bool more_used_packed(const struct vring_virtqueue *vq)
+{
+ return false;
+}
+
+static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
+ unsigned int *len,
+ void **ctx)
+{
+ return NULL;
+}
+
+static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
+{
+}
+
+static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
+{
+ return 0;
+}
+
+static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
+{
+ return false;
+}
+
+static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
+{
+ return false;
+}
+
+static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+{
+ return NULL;
+}
+
+static inline int virtqueue_add(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ return vq->packed ? virtqueue_add_packed(_vq, sgs, total_sg, out_sgs,
+ in_sgs, data, ctx, gfp) :
+ virtqueue_add_split(_vq, sgs, total_sg, out_sgs,
+ in_sgs, data, ctx, gfp);
+}
+
/**
* virtqueue_add_sgs - expose buffers to other end
* @vq: the struct virtqueue we're talking about.
@@ -551,34 +927,9 @@ EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
bool virtqueue_kick_prepare(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- u16 new, old;
- bool needs_kick;
- START_USE(vq);
- /* We need to expose available array entries before checking avail
- * event. */
- virtio_mb(vq->weak_barriers);
-
- old = vq->avail_idx_shadow - vq->num_added;
- new = vq->avail_idx_shadow;
- vq->num_added = 0;
-
-#ifdef DEBUG
- if (vq->last_add_time_valid) {
- WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
- vq->last_add_time)) > 100);
- }
- vq->last_add_time_valid = false;
-#endif
-
- if (vq->event) {
- needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
- new, old);
- } else {
- needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
- }
- END_USE(vq);
- return needs_kick;
+ return vq->packed ? virtqueue_kick_prepare_packed(_vq) :
+ virtqueue_kick_prepare_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
@@ -626,58 +977,9 @@ bool virtqueue_kick(struct virtqueue *vq)
}
EXPORT_SYMBOL_GPL(virtqueue_kick);
-static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
- void **ctx)
-{
- unsigned int i, j;
- __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
-
- /* Clear data ptr. */
- vq->desc_state[head].data = NULL;
-
- /* Put back on free list: unmap first-level descriptors and find end */
- i = head;
-
- while (vq->vring.desc[i].flags & nextflag) {
- vring_unmap_one(vq, &vq->vring.desc[i]);
- i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
- vq->vq.num_free++;
- }
-
- vring_unmap_one(vq, &vq->vring.desc[i]);
- vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
- vq->free_head = head;
-
- /* Plus final descriptor */
- vq->vq.num_free++;
-
- if (vq->indirect) {
- struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
- u32 len;
-
- /* Free the indirect table, if any, now that it's unmapped. */
- if (!indir_desc)
- return;
-
- len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
-
- BUG_ON(!(vq->vring.desc[head].flags &
- cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
- BUG_ON(len == 0 || len % sizeof(struct vring_desc));
-
- for (j = 0; j < len / sizeof(struct vring_desc); j++)
- vring_unmap_one(vq, &indir_desc[j]);
-
- kfree(indir_desc);
- vq->desc_state[head].indir_desc = NULL;
- } else if (ctx) {
- *ctx = vq->desc_state[head].indir_desc;
- }
-}
-
static inline bool more_used(const struct vring_virtqueue *vq)
{
- return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
+ return vq->packed ? more_used_packed(vq) : more_used_split(vq);
}
/**
@@ -700,57 +1002,9 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
void **ctx)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- void *ret;
- unsigned int i;
- u16 last_used;
- START_USE(vq);
-
- if (unlikely(vq->broken)) {
- END_USE(vq);
- return NULL;
- }
-
- if (!more_used(vq)) {
- pr_debug("No more buffers in queue\n");
- END_USE(vq);
- return NULL;
- }
-
- /* Only get used array entries after they have been exposed by host. */
- virtio_rmb(vq->weak_barriers);
-
- last_used = (vq->last_used_idx & (vq->vring.num - 1));
- i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
- *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
-
- if (unlikely(i >= vq->vring.num)) {
- BAD_RING(vq, "id %u out of range\n", i);
- return NULL;
- }
- if (unlikely(!vq->desc_state[i].data)) {
- BAD_RING(vq, "id %u is not a head!\n", i);
- return NULL;
- }
-
- /* detach_buf clears data, so grab it now. */
- ret = vq->desc_state[i].data;
- detach_buf(vq, i, ctx);
- vq->last_used_idx++;
- /* If we expect an interrupt for the next entry, tell host
- * by writing event index and flush out the write before
- * the read in the next get_buf call. */
- if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
- virtio_store_mb(vq->weak_barriers,
- &vring_used_event(&vq->vring),
- cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
-
-#ifdef DEBUG
- vq->last_add_time_valid = false;
-#endif
-
- END_USE(vq);
- return ret;
+ return vq->packed ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
+ virtqueue_get_buf_ctx_split(_vq, len, ctx);
}
EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
@@ -772,12 +1026,10 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
- vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
-
+ if (vq->packed)
+ virtqueue_disable_cb_packed(_vq);
+ else
+ virtqueue_disable_cb_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
@@ -796,23 +1048,9 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- u16 last_used_idx;
- START_USE(vq);
-
- /* We optimistically turn back on interrupts, then check if there was
- * more to do. */
- /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
- * either clear the flags bit or point the event index at the next
- * entry. Always do both to keep code simple. */
- if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
- vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
- vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
- END_USE(vq);
- return last_used_idx;
+ return vq->packed ? virtqueue_enable_cb_prepare_packed(_vq) :
+ virtqueue_enable_cb_prepare_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
@@ -829,8 +1067,8 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- virtio_mb(vq->weak_barriers);
- return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+ return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
+ virtqueue_poll_split(_vq, last_used_idx);
}
EXPORT_SYMBOL_GPL(virtqueue_poll);
@@ -868,34 +1106,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- u16 bufs;
- START_USE(vq);
-
- /* We optimistically turn back on interrupts, then check if there was
- * more to do. */
- /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
- * either clear the flags bit or point the event index at the next
- * entry. Always update the event index to keep code simple. */
- if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
- vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
- /* TODO: tune this threshold */
- bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
-
- virtio_store_mb(vq->weak_barriers,
- &vring_used_event(&vq->vring),
- cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
-
- if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
- END_USE(vq);
- return false;
- }
-
- END_USE(vq);
- return true;
+ return vq->packed ? virtqueue_enable_cb_delayed_packed(_vq) :
+ virtqueue_enable_cb_delayed_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
@@ -910,27 +1123,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- unsigned int i;
- void *buf;
- START_USE(vq);
-
- for (i = 0; i < vq->vring.num; i++) {
- if (!vq->desc_state[i].data)
- continue;
- /* detach_buf clears data, so grab it now. */
- buf = vq->desc_state[i].data;
- detach_buf(vq, i, NULL);
- vq->avail_idx_shadow--;
- vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
- END_USE(vq);
- return buf;
- }
- /* That should have freed everything. */
- BUG_ON(vq->vq.num_free != vq->vring.num);
-
- END_USE(vq);
- return NULL;
+ return vq->packed ? virtqueue_detach_unused_buf_packed(_vq) :
+ virtqueue_detach_unused_buf_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
@@ -955,7 +1150,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
EXPORT_SYMBOL_GPL(vring_interrupt);
struct virtqueue *__vring_new_virtqueue(unsigned int index,
- struct vring vring,
+ union vring_union vring,
+ bool packed,
struct virtio_device *vdev,
bool weak_barriers,
bool context,
@@ -963,19 +1159,20 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
void (*callback)(struct virtqueue *),
const char *name)
{
- unsigned int i;
+ unsigned int num, i;
struct vring_virtqueue *vq;
- vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
+ num = packed ? vring.vring_packed.num : vring.vring_split.num;
+
+ vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
GFP_KERNEL);
if (!vq)
return NULL;
- vq->vring = vring;
vq->vq.callback = callback;
vq->vq.vdev = vdev;
vq->vq.name = name;
- vq->vq.num_free = vring.num;
+ vq->vq.num_free = num;
vq->vq.index = index;
vq->we_own_ring = false;
vq->queue_dma_addr = 0;
@@ -984,9 +1181,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
vq->weak_barriers = weak_barriers;
vq->broken = false;
vq->last_used_idx = 0;
- vq->avail_flags_shadow = 0;
- vq->avail_idx_shadow = 0;
vq->num_added = 0;
+ vq->packed = packed;
list_add_tail(&vq->vq.list, &vdev->vqs);
#ifdef DEBUG
vq->in_use = false;
@@ -997,18 +1193,37 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
!context;
vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
+ if (vq->packed) {
+ vq->vring_packed = vring.vring_packed;
+ vq->next_avail_idx = 0;
+ vq->wrap_counter = 1;
+ vq->event_flags_shadow = 0;
+ } else {
+ vq->vring = vring.vring_split;
+ vq->avail_flags_shadow = 0;
+ vq->avail_idx_shadow = 0;
+
+ /* Put everything in free lists. */
+ vq->free_head = 0;
+ for (i = 0; i < num-1; i++)
+ vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
+ }
+
/* No callback? Tell other side not to bother us. */
if (!callback) {
- vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(vdev, vq->avail_flags_shadow);
+ if (packed) {
+ vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(vdev,
+ vq->event_flags_shadow);
+ } else {
+ vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(vdev,
+ vq->avail_flags_shadow);
+ }
}
- /* Put everything in free lists. */
- vq->free_head = 0;
- for (i = 0; i < vring.num-1; i++)
- vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
- memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
+ memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
return &vq->vq;
}
@@ -1056,6 +1271,12 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size,
}
}
+static inline int
+__vring_size(unsigned int num, unsigned long align, bool packed)
+{
+ return packed ? vring_size_packed(num, align) : vring_size(num, align);
+}
+
struct virtqueue *vring_create_virtqueue(
unsigned int index,
unsigned int num,
@@ -1072,7 +1293,8 @@ struct virtqueue *vring_create_virtqueue(
void *queue = NULL;
dma_addr_t dma_addr;
size_t queue_size_in_bytes;
- struct vring vring;
+ union vring_union vring;
+ bool packed;
/* We assume num is a power of 2. */
if (num & (num - 1)) {
@@ -1080,9 +1302,13 @@ struct virtqueue *vring_create_virtqueue(
return NULL;
}
+ packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+
/* TODO: allocate each queue chunk individually */
- for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
- queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+ for (; num && __vring_size(num, vring_align, packed) > PAGE_SIZE;
+ num /= 2) {
+ queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+ packed),
&dma_addr,
GFP_KERNEL|__GFP_NOWARN|__GFP_ZERO);
if (queue)
@@ -1094,17 +1320,21 @@ struct virtqueue *vring_create_virtqueue(
if (!queue) {
/* Try to get a single page. You are my only hope! */
- queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+ queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+ packed),
&dma_addr, GFP_KERNEL|__GFP_ZERO);
}
if (!queue)
return NULL;
- queue_size_in_bytes = vring_size(num, vring_align);
- vring_init(&vring, num, queue, vring_align);
+ queue_size_in_bytes = __vring_size(num, vring_align, packed);
+ if (packed)
+ vring_init_packed(&vring.vring_packed, num, queue, vring_align);
+ else
+ vring_init(&vring.vring_split, num, queue, vring_align);
- vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
- notify, callback, name);
+ vq = __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+ context, notify, callback, name);
if (!vq) {
vring_free_queue(vdev, queue_size_in_bytes, queue,
dma_addr);
@@ -1130,10 +1360,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
void (*callback)(struct virtqueue *vq),
const char *name)
{
- struct vring vring;
- vring_init(&vring, num, pages, vring_align);
- return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
- notify, callback, name);
+ union vring_union vring;
+ bool packed;
+
+ packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+ if (packed)
+ vring_init_packed(&vring.vring_packed, num, pages, vring_align);
+ else
+ vring_init(&vring.vring_split, num, pages, vring_align);
+
+ return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+ context, notify, callback, name);
}
EXPORT_SYMBOL_GPL(vring_new_virtqueue);
@@ -1143,7 +1380,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
if (vq->we_own_ring) {
vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
- vq->vring.desc, vq->queue_dma_addr);
+ vq->packed ? (void *)vq->vring_packed.desc :
+ (void *)vq->vring.desc,
+ vq->queue_dma_addr);
}
list_del(&_vq->list);
kfree(vq);
@@ -1185,7 +1424,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
struct vring_virtqueue *vq = to_vvq(_vq);
- return vq->vring.num;
+ return vq->packed ? vq->vring_packed.num : vq->vring.num;
}
EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
@@ -1228,6 +1467,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
BUG_ON(!vq->we_own_ring);
+ if (vq->packed)
+ return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
+ (char *)vq->vring_packed.desc);
+
return vq->queue_dma_addr +
((char *)vq->vring.avail - (char *)vq->vring.desc);
}
@@ -1239,11 +1482,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
BUG_ON(!vq->we_own_ring);
+ if (vq->packed)
+ return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
+ (char *)vq->vring_packed.desc);
+
return vq->queue_dma_addr +
((char *)vq->vring.used - (char *)vq->vring.desc);
}
EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
+/* Only available for split ring */
const struct vring *virtqueue_get_vring(struct virtqueue *vq)
{
return &to_vvq(vq)->vring;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bbf32524ab27..a0075894ad16 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
struct virtio_device;
struct virtqueue;
+union vring_union {
+ struct vring vring_split;
+ struct vring_packed vring_packed;
+};
+
/*
* Creates a virtqueue and allocates the descriptor ring. If
* may_reduce_num is set, then this may allocate a smaller ring than
@@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
/* Creates a virtqueue with a custom layout. */
struct virtqueue *__vring_new_virtqueue(unsigned int index,
- struct vring vring,
+ union vring_union vring,
+ bool packed,
struct virtio_device *vdev,
bool weak_barriers,
bool ctx,
--
2.11.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox