Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH 0/6] virtio-console: spec compliance fixes
From: Michael S. Tsirkin @ 2018-04-24 18:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Arnd Bergmann, Amit Shah, Greg Kroah-Hartman, stable,
	virtualization
In-Reply-To: <1524248223-393618-1-git-send-email-mst@redhat.com>

On Fri, Apr 20, 2018 at 09:17:59PM +0300, Michael S. Tsirkin wrote:
> Turns out virtio console tries to take a buffer out of an active vq.
> Works by sheer luck, and is explicitly forbidden by spec.  And while
> going over it I saw that error handling is also broken -
> failure is easy to trigger if I force allocations to fail.
> 
> Lightly tested.

Amit - any feedback before I push these patches?

> Michael S. Tsirkin (6):
>   virtio_console: don't tie bufs to a vq
>   virtio: add ability to iterate over vqs
>   virtio_console: free buffers after reset
>   virtio_console: drop custom control queue cleanup
>   virtio_console: move removal code
>   virtio_console: reset on out of memory
> 
>  drivers/char/virtio_console.c | 155 ++++++++++++++++++++----------------------
>  include/linux/virtio.h        |   3 +
>  2 files changed, 75 insertions(+), 83 deletions(-)
> 
> -- 
> MST
> 

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 17:38 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241319390.28995@file01.intranet.prod.int.rdu2.redhat.com>

On Tue 24-04-18 13:28:49, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > > > 
> > > > > 
> > > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > > > 
> > > > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > > > [...]
> > > > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > > >  	 */
> > > > > > >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > > > >  
> > > > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > > > +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > > > +	if (!(prandom_u32_max(2) & 1))
> > > > > > > +		goto do_vmalloc;
> > > > > > > +#endif
> > > > > > 
> > > > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > > > simply do not follow should_failslab path or even reuse the function?
> > > > > 
> > > > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> > > > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> > > > > there).
> > > > 
> > > > Are you telling me that you are shaping a debugging functionality basing
> > > > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > > > 
> > > > > Fail-injection framework is if off by default and it must be explicitly 
> > > > > enabled and configured by the user - and most users won't enable it.
> > > > 
> > > > It can be enabled easily. And if you care enough for your debugging
> > > > kernel then just make it enabled unconditionally.
> > > 
> > > So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not 
> > > quite sure if 3 lines of debugging code need an extra option, but if you 
> > > don't want to reuse any existing debug option, it may be possible. Adding 
> > > it to the RHEL debug kernel would be trivial.
> > 
> > Wouldn't it be equally trivial to simply enable the fault injection? You
> > would get additional failure paths testing as a bonus.
> 
> The RHEL and Fedora debugging kernels are compiled with fault injection. 
> But the fault-injection framework will do nothing unless it is enabled by 
> a kernel parameter or debugfs write.
> 
> Most users don't know about the fault injection kernel parameters or 
> debugfs files and won't enabled it. We need a CONFIG_ option to enable it 
> by default in the debugging kernels (and we could add a kernel parameter 
> to override the default, fine-tune the fallback probability etc.)

If it is a real issue to install the debugging kernel with the required
kernel parameter then I a config option for the default on makes sense
to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 17:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424170349.GQ17484@dhcp22.suse.cz>



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > > 
> > > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > > [...]
> > > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > >  	 */
> > > > > >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > > >  
> > > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > > +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > > +	if (!(prandom_u32_max(2) & 1))
> > > > > > +		goto do_vmalloc;
> > > > > > +#endif
> > > > > 
> > > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > > simply do not follow should_failslab path or even reuse the function?
> > > > 
> > > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> > > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> > > > there).
> > > 
> > > Are you telling me that you are shaping a debugging functionality basing
> > > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > > 
> > > > Fail-injection framework is if off by default and it must be explicitly 
> > > > enabled and configured by the user - and most users won't enable it.
> > > 
> > > It can be enabled easily. And if you care enough for your debugging
> > > kernel then just make it enabled unconditionally.
> > 
> > So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not 
> > quite sure if 3 lines of debugging code need an extra option, but if you 
> > don't want to reuse any existing debug option, it may be possible. Adding 
> > it to the RHEL debug kernel would be trivial.
> 
> Wouldn't it be equally trivial to simply enable the fault injection? You
> would get additional failure paths testing as a bonus.

The RHEL and Fedora debugging kernels are compiled with fault injection. 
But the fault-injection framework will do nothing unless it is enabled by 
a kernel parameter or debugfs write.

Most users don't know about the fault injection kernel parameters or 
debugfs files and won't enabled it. We need a CONFIG_ option to enable it 
by default in the debugging kernels (and we could add a kernel parameter 
to override the default, fine-tune the fallback probability etc.)

Mikulas

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Matthew Wilcox @ 2018-04-24 17:16 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Lameter, dm-devel, eric.dumazet, mst, netdev,
	linux-kernel, Michal Hocko, Pekka Enberg, linux-mm, edumazet,
	David Rientjes, Joonsoo Kim, Andrew Morton, virtualization,
	David Miller, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804240818530.28016@file01.intranet.prod.int.rdu2.redhat.com>

On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> 
> 
> On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> 
> > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > Some bugs (such as buffer overflows) are better detected
> > > with kmalloc code, so we must test the kmalloc path too.
> > 
> > Well now, this brings up another item for the collective TODO list --
> > implement redzone checks for vmalloc.  Unless this is something already
> > taken care of by kasan or similar.
> 
> The kmalloc overflow testing is also not ideal - it rounds the size up to 
> the next slab size and detects buffer overflows only at this boundary.
> 
> Some times ago, I made a "kmalloc guard" patch that places a magic number 
> immediatelly after the requested size - so that it can detect overflows at 
> byte boundary 
> ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> 
> That patch found a bug in crypto code:
> ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )

Is it still worth doing this, now we have kasan?

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 17:03 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241250350.28995@file01.intranet.prod.int.rdu2.redhat.com>

On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > [...]
> > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > >  	 */
> > > > >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > >  
> > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > > +	if (!(prandom_u32_max(2) & 1))
> > > > > +		goto do_vmalloc;
> > > > > +#endif
> > > > 
> > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > simply do not follow should_failslab path or even reuse the function?
> > > 
> > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> > > there).
> > 
> > Are you telling me that you are shaping a debugging functionality basing
> > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > 
> > > Fail-injection framework is if off by default and it must be explicitly 
> > > enabled and configured by the user - and most users won't enable it.
> > 
> > It can be enabled easily. And if you care enough for your debugging
> > kernel then just make it enabled unconditionally.
> 
> So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not 
> quite sure if 3 lines of debugging code need an extra option, but if you 
> don't want to reuse any existing debug option, it may be possible. Adding 
> it to the RHEL debug kernel would be trivial.

Wouldn't it be equally trivial to simply enable the fault injection? You
would get additional failure paths testing as a bonus.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424162906.GM17484@dhcp22.suse.cz>



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > [...]
> > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > >  	 */
> > > >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > >  
> > > > +#ifdef CONFIG_DEBUG_SG
> > > > +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > > +	if (!(prandom_u32_max(2) & 1))
> > > > +		goto do_vmalloc;
> > > > +#endif
> > > 
> > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > simply do not follow should_failslab path or even reuse the function?
> > 
> > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> > there).
> 
> Are you telling me that you are shaping a debugging functionality basing
> on what RHEL has enabled? And you call me evil. This is just rediculous.
> 
> > Fail-injection framework is if off by default and it must be explicitly 
> > enabled and configured by the user - and most users won't enable it.
> 
> It can be enabled easily. And if you care enough for your debugging
> kernel then just make it enabled unconditionally.

So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not 
quite sure if 3 lines of debugging code need an extra option, but if you 
don't want to reuse any existing debug option, it may be possible. Adding 
it to the RHEL debug kernel would be trivial.

Mikulas

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24 16:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424161242.GK17484@dhcp22.suse.cz>



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > 
> > > > Fixing __vmalloc code 
> > > > is easy and it doesn't require cooperation with maintainers.
> > > 
> > > But it is a hack against the intention of the scope api.
> > 
> > It is not!
> 
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.

That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
developer) from converting the code to the scope API. You make nonsensical 
excuses.

Mikulas

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 16:29 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424161242.GK17484@dhcp22.suse.cz>

On Tue 24-04-18 10:12:42, Michal Hocko wrote:
> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > 
> > > > Fixing __vmalloc code 
> > > > is easy and it doesn't require cooperation with maintainers.
> > > 
> > > But it is a hack against the intention of the scope api.
> > 
> > It is not!
> 
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.

http://lkml.kernel.org/r/20180424162712.GL17484@dhcp22.suse.cz

let's see how it rolls this time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 16:29 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241142340.15660@file01.intranet.prod.int.rdu2.redhat.com>

On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > [...]
> > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > >  	 */
> > >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > >  
> > > +#ifdef CONFIG_DEBUG_SG
> > > +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > +	if (!(prandom_u32_max(2) & 1))
> > > +		goto do_vmalloc;
> > > +#endif
> > 
> > I really do not think there is anything DEBUG_SG specific here. Why you
> > simply do not follow should_failslab path or even reuse the function?
> 
> CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> there).

Are you telling me that you are shaping a debugging functionality basing
on what RHEL has enabled? And you call me evil. This is just rediculous.

> Fail-injection framework is if off by default and it must be explicitly 
> enabled and configured by the user - and most users won't enable it.

It can be enabled easily. And if you care enough for your debugging
kernel then just make it enabled unconditionally.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 16:12 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241107010.31601@file01.intranet.prod.int.rdu2.redhat.com>

On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > 
> > > Fixing __vmalloc code 
> > > is easy and it doesn't require cooperation with maintainers.
> > 
> > But it is a hack against the intention of the scope api.
> 
> It is not!

This discussion simply doesn't make much sense it seems. The scope API
is to document the scope of the reclaim recursion critical section. That
certainly is not a utility function like vmalloc.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 15:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424125121.GA17484@dhcp22.suse.cz>



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> [...]
> > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> >  	 */
> >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> >  
> > +#ifdef CONFIG_DEBUG_SG
> > +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > +	if (!(prandom_u32_max(2) & 1))
> > +		goto do_vmalloc;
> > +#endif
> 
> I really do not think there is anything DEBUG_SG specific here. Why you
> simply do not follow should_failslab path or even reuse the function?

CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
there).

Fail-injection framework is if off by default and it must be explicitly 
enabled and configured by the user - and most users won't enable it.

Mikulas

> > +
> >  	/*
> >  	 * We want to attempt a large physically contiguous block first because
> >  	 * it is less likely to fragment multiple larger blocks and therefore
> > @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
> >  	if (ret || size <= PAGE_SIZE)
> >  		return ret;
> >  
> > +#ifdef CONFIG_DEBUG_SG
> > +do_vmalloc:
> > +#endif
> >  	return __vmalloc_node_flags_caller(size, node, flags,
> >  			__builtin_return_address(0));
> >  }
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24 15:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424133146.GG17484@dhcp22.suse.cz>



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> 
> > Fixing __vmalloc code 
> > is easy and it doesn't require cooperation with maintainers.
> 
> But it is a hack against the intention of the scope api.

It is not! You can fix __vmalloc now and you can convert the kernel to the 
scope API in 4 years. It's not one way or the other.

> It also alows maintainers to not care about their broken code.

Most maintainers don't even know that it's broken. Out of 14 subsystems 
using __vmalloc with GFP_NOIO/NOFS, only 2 realized that its 
implementation is broken and implemented a workaround (me and the XFS 
developers).

Misimplementing a function in a subtle and hard-to-notice way won't drive 
developers away from using it.

> > > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 
> > > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> > > > does he have veto over this part of the code? I'd much rather argue with 
> > > > people who have constructive comments about fixing bugs than with him.
> > > 
> > > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> > > I would be much more willing to change my mind if you would back your
> > > patch by a real bug report. Hacks are acceptable when we have a real
> > > issue in hands. But if we want to fix potential issue then better make
> > > it properly.
> > 
> > Developers should fix bugs in advance, not to wait until a crash hapens, 
> > is analyzed and reported.
> 
> I agree. But are those existing users broken in the first place? I have
> seen so many GFP_NOFS abuses that I would dare to guess that most of
> those vmalloc NOFS abusers can be simply turned into GFP_KERNEL. Maybe
> that is the reason we haven't heard any complains in years.

alloc_pages reclaims clean pages and most hard work is done by kswapd, so 
GFP_KERNEL doesn't cause much issues with writeback. But cheating isn't 
justified if you can get away with it. Incorrect GFP flags cause real 
problems with shrinkers - because shrinkers are called from alloc_pages 
and they do respond to GFP flags.

I had reported deadlock due to GFP issues (9d28eb12447). And the worst 
thing about these bug reports is that they are totally unreproducible and 
I get nothing, but a stacktrace in bugzilla. I had to guess what happened 
and I couldn't even test if the patch fixed the bug.

I'm not really happy that you are deliberately leaving these issues behind 
and making excuses.

Mikulas

^ permalink raw reply

* Re: [PATCH] vhost_net: use packet weight for rx handler, too
From: David Miller @ 2018-04-24 14:02 UTC (permalink / raw)
  To: pabeni; +Cc: kvm, mst, netdev, virtualization, haibinzhang
In-Reply-To: <11f2a27cee0c660a611af381ac1b68d9526095e3.1524556673.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 24 Apr 2018 10:34:36 +0200

> Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
> tx polling to 2 * vq size"), we need a packet-based limit for
> handler_rx, too - elsewhere, under rx flood with small packets,
> tx can be delayed for a very long time, even without busypolling.
> 
> The pkt limit applied to handle_rx must be the same applied by
> handle_tx, or we will get unfair scheduling between rx and tx.
> Tying such limit to the queue length makes it less effective for
> large queue length values and can introduce large process
> scheduler latencies, so a constant valued is used - likewise
> the existing bytes limit.
> 
> The selected limit has been validated with PVP[1] performance
> test with different queue sizes:
> 
> queue size		256	512	1024
> 
> baseline		366	354	362
> weight 128		715	723	670
> weight 256		740	745	733
> weight 512		600	460	583
> weight 1024		423	427	418
> 
> A packet weight of 256 gives peek performances in under all the
> tested scenarios.
> 
> No measurable regression in unidirectional performance tests has
> been detected.
> 
> [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-24 13:31 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804232006540.2299@file01.intranet.prod.int.rdu2.redhat.com>

On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> 
> 
> On Mon, 23 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> > 
> > > > > He didn't want to fix vmalloc(GFP_NOIO)
> > > > 
> > > > I don't remember that conversation, so I don't know whether I agree with
> > > > his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> > > > towards marking regions with memalloc_noio_save() / restore.  If you do
> > > > that, you won't need vmalloc(GFP_NOIO).
> > > 
> > > He said the same thing a year ago. And there was small progress. 6 out of 
> > > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in 
> > > infiniband and 1 in btrfs. (the whole discussion is here 
> > > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
> > 
> > Well this is not that easy. It requires a cooperation from maintainers.
> > I can only do as much. I've posted patches in the past and actively
> > bringing up this topic at LSFMM last two years...
> 
> You're right - but you have chosen the uneasy path.

Yes.

> Fixing __vmalloc code 
> is easy and it doesn't require cooperation with maintainers.

But it is a hack against the intention of the scope api. It also alows
maintainers to not care about their broken code.

> > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 
> > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> > > does he have veto over this part of the code? I'd much rather argue with 
> > > people who have constructive comments about fixing bugs than with him.
> > 
> > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> > I would be much more willing to change my mind if you would back your
> > patch by a real bug report. Hacks are acceptable when we have a real
> > issue in hands. But if we want to fix potential issue then better make
> > it properly.
> 
> Developers should fix bugs in advance, not to wait until a crash hapens, 
> is analyzed and reported.

I agree. But are those existing users broken in the first place? I have
seen so many GFP_NOFS abuses that I would dare to guess that most of
those vmalloc NOFS abusers can be simply turned into GFP_KERNEL. Maybe
that is the reason we haven't heard any complains in years.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 12:51 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804232003100.2299@file01.intranet.prod.int.rdu2.redhat.com>

On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
[...]
> @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
>  	 */
>  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>  
> +#ifdef CONFIG_DEBUG_SG
> +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> +	if (!(prandom_u32_max(2) & 1))
> +		goto do_vmalloc;
> +#endif

I really do not think there is anything DEBUG_SG specific here. Why you
simply do not follow should_failslab path or even reuse the function?

> +
>  	/*
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
>  
> +#ifdef CONFIG_DEBUG_SG
> +do_vmalloc:
> +#endif
>  	return __vmalloc_node_flags_caller(size, node, flags,
>  			__builtin_return_address(0));
>  }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] gpu: drm: qxl: Adding new typedef vm_fault_t
From: Gerd Hoffmann @ 2018-04-24 12:35 UTC (permalink / raw)
  To: Souptick Joarder, airlied, linux-kernel, willy, virtualization,
	dri-devel, airlied, Maarten Lankhorst, Gustavo Padovan, Sean Paul
In-Reply-To: <20180424121151.GX31310@phenom.ffwll.local>

On Tue, Apr 24, 2018 at 02:11:51PM +0200, Daniel Vetter wrote:
> On Mon, Apr 23, 2018 at 12:49:24PM +0200, Gerd Hoffmann wrote:
> > On Tue, Apr 17, 2018 at 07:08:44PM +0530, Souptick Joarder wrote:
> > > Use new return type vm_fault_t for fault handler. For
> > > now, this is just documenting that the function returns
> > > a VM_FAULT value rather than an errno. Once all instances
> > > are converted, vm_fault_t will become a distinct type.
> > > 
> > > Reference id -> 1c8f422059ae ("mm: change return type to
> > > vm_fault_t")
> > 
> > Hmm, that commit isn't yet in drm-misc-next.
> > Will drm-misc-next merge with 4.17-rcX soon?
> 
> For backmerge requests you need to cc/ping the drm-misc maintainers.
> Adding them. I think the hold-up also was that Dave was on vacations
> still.

Ah, ok.

So my expectation that a backmerge happens anyway after -rc1/2 is in
line with reality, it is just to be delayed this time.  I'll stay
tuned ;)

cheers,
  Gerd

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 12:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, dm-devel, eric.dumazet, mst, netdev,
	linux-kernel, Michal Hocko, Pekka Enberg, linux-mm, edumazet,
	David Rientjes, Joonsoo Kim, Andrew Morton, virtualization,
	David Miller, Vlastimil Babka
In-Reply-To: <20180424034643.GA26636@bombadil.infradead.org>



On Mon, 23 Apr 2018, Matthew Wilcox wrote:

> On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > Some bugs (such as buffer overflows) are better detected
> > with kmalloc code, so we must test the kmalloc path too.
> 
> Well now, this brings up another item for the collective TODO list --
> implement redzone checks for vmalloc.  Unless this is something already
> taken care of by kasan or similar.

The kmalloc overflow testing is also not ideal - it rounds the size up to 
the next slab size and detects buffer overflows only at this boundary.

Some times ago, I made a "kmalloc guard" patch that places a magic number 
immediatelly after the requested size - so that it can detect overflows at 
byte boundary 
( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )

That patch found a bug in crypto code:
( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )

Mikulas

^ permalink raw reply

* Re: [PATCH] gpu: drm: qxl: Adding new typedef vm_fault_t
From: Daniel Vetter @ 2018-04-24 12:11 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: willy, airlied, Gustavo Padovan, Maarten Lankhorst, linux-kernel,
	dri-devel, virtualization, Sean Paul, Souptick Joarder, airlied
In-Reply-To: <20180423104924.ozalv7x5ppjifnxc@sirius.home.kraxel.org>

On Mon, Apr 23, 2018 at 12:49:24PM +0200, Gerd Hoffmann wrote:
> On Tue, Apr 17, 2018 at 07:08:44PM +0530, Souptick Joarder wrote:
> > Use new return type vm_fault_t for fault handler. For
> > now, this is just documenting that the function returns
> > a VM_FAULT value rather than an errno. Once all instances
> > are converted, vm_fault_t will become a distinct type.
> > 
> > Reference id -> 1c8f422059ae ("mm: change return type to
> > vm_fault_t")
> 
> Hmm, that commit isn't yet in drm-misc-next.
> Will drm-misc-next merge with 4.17-rcX soon?

For backmerge requests you need to cc/ping the drm-misc maintainers.
Adding them. I think the hold-up also was that Dave was on vacations
still.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 11:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	Michal Hocko, linux-mm, edumazet, Andrew Morton, virtualization,
	David Miller, Vlastimil Babka
In-Reply-To: <alpine.DEB.2.21.1804231945410.58980@chino.kir.corp.google.com>



On Mon, 23 Apr 2018, David Rientjes wrote:

> On Mon, 23 Apr 2018, Mikulas Patocka wrote:
> 
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> > 
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> > 
> > These bugs are hard to reproduce because kvmalloc falls back to vmalloc
> > only if memory is fragmented.
> > 
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
> > is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
> > verify the addresses passed to it, and so the user will get a reliable
> > stacktrace.
> 
> Why not just do it unconditionally?  Sounds better than "50% of the time 
> this will catch bugs".

Because kmalloc (with slub_debug) detects buffer overflows better than 
vmalloc. vmalloc detects buffer overflows only at a page boundary. This is 
intended for debugging kernels and debugging kernels should detect as many 
bugs as possible.

Mikulas

^ permalink raw reply

* Re: [PATCH] vhost_net: use packet weight for rx handler, too
From: Jason Wang @ 2018-04-24  9:11 UTC (permalink / raw)
  To: Paolo Abeni, kvm; +Cc: netdev, virtualization, haibinzhang, Michael S. Tsirkin
In-Reply-To: <11f2a27cee0c660a611af381ac1b68d9526095e3.1524556673.git.pabeni@redhat.com>



On 2018年04月24日 16:34, Paolo Abeni wrote:
> Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
> tx polling to 2 * vq size"), we need a packet-based limit for
> handler_rx, too - elsewhere, under rx flood with small packets,
> tx can be delayed for a very long time, even without busypolling.
>
> The pkt limit applied to handle_rx must be the same applied by
> handle_tx, or we will get unfair scheduling between rx and tx.
> Tying such limit to the queue length makes it less effective for
> large queue length values and can introduce large process
> scheduler latencies, so a constant valued is used - likewise
> the existing bytes limit.
>
> The selected limit has been validated with PVP[1] performance
> test with different queue sizes:
>
> queue size		256	512	1024
>
> baseline		366	354	362
> weight 128		715	723	670
> weight 256		740	745	733
> weight 512		600	460	583
> weight 1024		423	427	418
>
> A packet weight of 256 gives peek performances in under all the
> tested scenarios.
>
> No measurable regression in unidirectional performance tests has
> been detected.
>
> [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>   drivers/vhost/net.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index bbf38befefb2..c4b49fca4871 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -46,8 +46,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>   #define VHOST_NET_WEIGHT 0x80000
>   
>   /* Max number of packets transferred before requeueing the job.
> - * Using this limit prevents one virtqueue from starving rx. */
> -#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> + * Using this limit prevents one virtqueue from starving others with small
> + * pkts.
> + */
> +#define VHOST_NET_PKT_WEIGHT 256
>   
>   /* MAX number of TX used buffers for outstanding zerocopy */
>   #define VHOST_MAX_PEND 128
> @@ -587,7 +589,7 @@ static void handle_tx(struct vhost_net *net)
>   			vhost_zerocopy_signal_used(net, vq);
>   		vhost_net_tx_packet(net);
>   		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> -		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
> +		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
>   			vhost_poll_queue(&vq->poll);
>   			break;
>   		}
> @@ -769,6 +771,7 @@ static void handle_rx(struct vhost_net *net)
>   	struct socket *sock;
>   	struct iov_iter fixup;
>   	__virtio16 num_buffers;
> +	int recv_pkts = 0;
>   
>   	mutex_lock_nested(&vq->mutex, 0);
>   	sock = vq->private_data;
> @@ -872,7 +875,8 @@ static void handle_rx(struct vhost_net *net)
>   		if (unlikely(vq_log))
>   			vhost_log_write(vq, vq_log, log, vhost_len);
>   		total_len += vhost_len;
> -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> +		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> +		    unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
>   			vhost_poll_queue(&vq->poll);
>   			goto out;
>   		}

The numbers looks impressive.

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks!
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH] vhost_net: use packet weight for rx handler, too
From: Paolo Abeni @ 2018-04-24  8:34 UTC (permalink / raw)
  To: kvm; +Cc: netdev, virtualization, haibinzhang, Michael S. Tsirkin

Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
tx polling to 2 * vq size"), we need a packet-based limit for
handler_rx, too - elsewhere, under rx flood with small packets,
tx can be delayed for a very long time, even without busypolling.

The pkt limit applied to handle_rx must be the same applied by
handle_tx, or we will get unfair scheduling between rx and tx.
Tying such limit to the queue length makes it less effective for
large queue length values and can introduce large process
scheduler latencies, so a constant valued is used - likewise
the existing bytes limit.

The selected limit has been validated with PVP[1] performance
test with different queue sizes:

queue size		256	512	1024

baseline		366	354	362
weight 128		715	723	670
weight 256		740	745	733
weight 512		600	460	583
weight 1024		423	427	418

A packet weight of 256 gives peek performances in under all the
tested scenarios.

No measurable regression in unidirectional performance tests has
been detected.

[1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 drivers/vhost/net.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index bbf38befefb2..c4b49fca4871 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -46,8 +46,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 #define VHOST_NET_WEIGHT 0x80000
 
 /* Max number of packets transferred before requeueing the job.
- * Using this limit prevents one virtqueue from starving rx. */
-#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
+ * Using this limit prevents one virtqueue from starving others with small
+ * pkts.
+ */
+#define VHOST_NET_PKT_WEIGHT 256
 
 /* MAX number of TX used buffers for outstanding zerocopy */
 #define VHOST_MAX_PEND 128
@@ -587,7 +589,7 @@ static void handle_tx(struct vhost_net *net)
 			vhost_zerocopy_signal_used(net, vq);
 		vhost_net_tx_packet(net);
 		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
-		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
+		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			break;
 		}
@@ -769,6 +771,7 @@ static void handle_rx(struct vhost_net *net)
 	struct socket *sock;
 	struct iov_iter fixup;
 	__virtio16 num_buffers;
+	int recv_pkts = 0;
 
 	mutex_lock_nested(&vq->mutex, 0);
 	sock = vq->private_data;
@@ -872,7 +875,8 @@ static void handle_rx(struct vhost_net *net)
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, vhost_len);
 		total_len += vhost_len;
-		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
+		    unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			goto out;
 		}
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-24  5:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
	virtualization, Siwei Liu, Sridhar Samudrala, David Miller
In-Reply-To: <20180424043042-mutt-send-email-mst@kernel.org>

On Tue, 24 Apr 2018 04:42:22 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote:
> > On Mon, 23 Apr 2018 12:44:39 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >   
> > > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:  
> > > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:    
> > > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >>    
> > > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:    
> > > >> > > > >
> > > >> > > > >I will NAK patches to change to common code for netvsc especially the
> > > >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> > > >> > > > >mode, ans we really can't have a new model; or do backport.
> > > >> > > > >
> > > >> > > > >Plus, DPDK is now dependent on existing model.    
> > > >> > > >
> > > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.    
> > > >> > >
> > > >> > > The network device model is a userspace API, and DPDK is a userspace application.    
> > > >> >
> > > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > > >> > AFAIK it's normally banging device registers directly.
> > > >> >    
> > > >> > > You can't go breaking userspace even if you don't like the application.    
> > > >> >
> > > >> > Could you please explain how is the proposed patchset breaking
> > > >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> > > >> > API at all.
> > > >> >    
> > > >>
> > > >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> > > >> to look for Linux netvsc device and the paired VF device and setup the
> > > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> > > >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> > > >> VF device.
> > > >>
> > > >> So it depends on existing 2 device model. You can't go to a 3 device model
> > > >> or start hiding devices from userspace.    
> > > >
> > > > Okay so how does the existing patch break that? IIUC does not go to
> > > > a 3 device model since netvsc calls failover_register directly.
> > > >    
> > > >> Also, I am working on associating netvsc and VF device based on serial number
> > > >> rather than MAC address. The serial number is how Windows works now, and it makes
> > > >> sense for Linux and Windows to use the same mechanism if possible.    
> > > >
> > > > Maybe we should support same for virtio ...
> > > > Which serial do you mean? From vpd?
> > > >
> > > > I guess you will want to keep supporting MAC for old hypervisors?  
> > 
> > The serial number has always been in the hypervisor since original support of SR-IOV
> > in WS2008.  So no backward compatibility special cases would be needed.  
> 
> Is that a serial from real hardware or a hypervisor thing?
> 
> 

It is a hypervisor thing in the PCI hyperv code and the hyperv Netvsc interface.
It might also be in the PCI spec, but the value in Hyper-V is being generated by the host.

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Matthew Wilcox @ 2018-04-24  3:46 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Michal Hocko,
	linux-mm, edumazet, Andrew Morton, virtualization, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804232003100.2299@file01.intranet.prod.int.rdu2.redhat.com>

On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> Some bugs (such as buffer overflows) are better detected
> with kmalloc code, so we must test the kmalloc path too.

Well now, this brings up another item for the collective TODO list --
implement redzone checks for vmalloc.  Unless this is something already
taken care of by kasan or similar.

^ permalink raw reply

* Re: [PATCH 3/6] virtio_console: free buffers after reset
From: Jason Wang @ 2018-04-24  2:40 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Arnd Bergmann, Amit Shah, Greg Kroah-Hartman, stable,
	virtualization, stable
In-Reply-To: <1524248223-393618-4-git-send-email-mst@redhat.com>



On 2018年04月21日 02:18, Michael S. Tsirkin wrote:
> Console driver is out of spec. The spec says:
> 	A driver MUST NOT decrement the available idx on a live
> 	virtqueue (ie. there is no way to “unexpose” buffers).
> and it does exactly that by trying to detach unused buffers
> without doing a device reset first.
>
> Defer detaching the buffers until device unplug.
>
> Of course this means we might get an interrupt for
> a vq without an attached port now. Handle that by
> discarding the consumed buffer.
>
> Reported-by: Tiwei Bie <tiwei.bie@intel.com>
> Fixes: b3258ff1d6 ("virtio: Decrement avail idx on buffer detach")
> CC: stable@kernel.org
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

I wonder whether or not we can have some BUG_ON() in 
virtqueue_detach_unused_buf() to detect such bugs (e.g by checking status?).

Thanks

> ---
>   drivers/char/virtio_console.c | 49 +++++++++++++++++++++----------------------
>   1 file changed, 24 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> index 3e56f32..26a66ff 100644
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -1402,7 +1402,6 @@ static int add_port(struct ports_device *portdev, u32 id)
>   {
>   	char debugfs_name[16];
>   	struct port *port;
> -	struct port_buffer *buf;
>   	dev_t devt;
>   	unsigned int nr_added_bufs;
>   	int err;
> @@ -1513,8 +1512,6 @@ static int add_port(struct ports_device *portdev, u32 id)
>   	return 0;
>   
>   free_inbufs:
> -	while ((buf = virtqueue_detach_unused_buf(port->in_vq)))
> -		free_buf(buf, true);
>   free_device:
>   	device_destroy(pdrvdata.class, port->dev->devt);
>   free_cdev:
> @@ -1539,34 +1536,14 @@ static void remove_port(struct kref *kref)
>   
>   static void remove_port_data(struct port *port)
>   {
> -	struct port_buffer *buf;
> -
>   	spin_lock_irq(&port->inbuf_lock);
>   	/* Remove unused data this port might have received. */
>   	discard_port_data(port);
>   	spin_unlock_irq(&port->inbuf_lock);
>   
> -	/* Remove buffers we queued up for the Host to send us data in. */
> -	do {
> -		spin_lock_irq(&port->inbuf_lock);
> -		buf = virtqueue_detach_unused_buf(port->in_vq);
> -		spin_unlock_irq(&port->inbuf_lock);
> -		if (buf)
> -			free_buf(buf, true);
> -	} while (buf);
> -
>   	spin_lock_irq(&port->outvq_lock);
>   	reclaim_consumed_buffers(port);
>   	spin_unlock_irq(&port->outvq_lock);
> -
> -	/* Free pending buffers from the out-queue. */
> -	do {
> -		spin_lock_irq(&port->outvq_lock);
> -		buf = virtqueue_detach_unused_buf(port->out_vq);
> -		spin_unlock_irq(&port->outvq_lock);
> -		if (buf)
> -			free_buf(buf, true);
> -	} while (buf);
>   }
>   
>   /*
> @@ -1791,13 +1768,24 @@ static void control_work_handler(struct work_struct *work)
>   	spin_unlock(&portdev->c_ivq_lock);
>   }
>   
> +static void flush_bufs(struct virtqueue *vq, bool can_sleep)
> +{
> +	struct port_buffer *buf;
> +	unsigned int len;
> +
> +	while ((buf = virtqueue_get_buf(vq, &len)))
> +		free_buf(buf, can_sleep);
> +}
> +
>   static void out_intr(struct virtqueue *vq)
>   {
>   	struct port *port;
>   
>   	port = find_port_by_vq(vq->vdev->priv, vq);
> -	if (!port)
> +	if (!port) {
> +		flush_bufs(vq, false);
>   		return;
> +	}
>   
>   	wake_up_interruptible(&port->waitqueue);
>   }
> @@ -1808,8 +1796,10 @@ static void in_intr(struct virtqueue *vq)
>   	unsigned long flags;
>   
>   	port = find_port_by_vq(vq->vdev->priv, vq);
> -	if (!port)
> +	if (!port) {
> +		flush_bufs(vq, false);
>   		return;
> +	}
>   
>   	spin_lock_irqsave(&port->inbuf_lock, flags);
>   	port->inbuf = get_inbuf(port);
> @@ -1984,6 +1974,15 @@ static const struct file_operations portdev_fops = {
>   
>   static void remove_vqs(struct ports_device *portdev)
>   {
> +	struct virtqueue *vq;
> +
> +	virtio_device_for_each_vq(portdev->vdev, vq) {
> +		struct port_buffer *buf;
> +
> +		flush_bufs(vq, true);
> +		while ((buf = virtqueue_detach_unused_buf(vq)))
> +			free_buf(buf, true);
> +	}
>   	portdev->vdev->config->del_vqs(portdev->vdev);
>   	kfree(portdev->in_vqs);
>   	kfree(portdev->out_vqs);

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-24  1:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180424044250-mutt-send-email-mst@kernel.org>

On Tue, Apr 24, 2018 at 04:43:22AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > > Hello everyone,
> > > > > > > > > 
> > > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > > > 
> > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > > > > 
> > > > > > > > > TODO:
> > > > > > > > > - Refinements and bug fixes;
> > > > > > > > > - Split into small patches;
> > > > > > > > > - Test indirect descriptor support;
> > > > > > > > > - Test/fix event suppression support;
> > > > > > > > > - Test devices other than net;
> > > > > > > > > 
> > > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > > ---
> > > > > > > > >    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> > > > > > > > >    include/linux/virtio_ring.h        |    8 +-
> > > > > > > > >    include/uapi/linux/virtio_config.h |   12 +-
> > > > > > > > >    include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > > > >    4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > +
> > > > > > > > > +	if (vq->indirect) {
> > > > > > > > > +		u32 len;
> > > > > > > > > +
> > > > > > > > > +		desc = vq->desc_state[head].indir_desc;
> > > > > > > > > +		/* Free the indirect table, if any, now that it's unmapped. */
> > > > > > > > > +		if (!desc)
> > > > > > > > > +			goto out;
> > > > > > > > > +
> > > > > > > > > +		len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > > > +				      vq->vring_packed.desc[head].len);
> > > > > > > > > +
> > > > > > > > > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > > > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > > > > can safely remove this BUG_ON() here.
> > > > > > > > 
> > > > > > > > > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > > > > remove this BUG_ON() too.
> > > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > > > And I think something related to this in the spec isn't very
> > > > > > > clear currently.
> > > > > > > 
> > > > > > > In the spec, there are below words:
> > > > > > > 
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > > > """
> > > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > > > is reserved and is ignored by the device.
> > > > > > > """
> > > > > > > 
> > > > > > > So when device writes back an used descriptor in this case,
> > > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > > > is reserved and should be ignored.
> > > > > > > 
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > > > """
> > > > > > > Element Length is reserved for used descriptors without the
> > > > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > > > """
> > > > > > > 
> > > > > > > And this is the way how driver ignores the `len` in an used
> > > > > > > descriptor.
> > > > > > > 
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > > > """
> > > > > > > To increase ring capacity the driver can store a (read-only
> > > > > > > by the device) table of indirect descriptors anywhere in memory,
> > > > > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > > > > containing this indirect descriptor table;
> > > > > > > """
> > > > > > > 
> > > > > > > So the indirect descriptors in the table are read-only by
> > > > > > > the device. And the only descriptor which is writeable by
> > > > > > > the device is the descriptor in the main virtqueue (with
> > > > > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > > > > `len` in this descriptor, we won't be able to get the
> > > > > > > length of the data written by the device.
> > > > > > > 
> > > > > > > So I think the `len` in this descriptor will carry the
> > > > > > > length of the data written by the device (if the buffers
> > > > > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > > > > isn't set by the device. How do you think?
> > > > > > 
> > > > > > Yes I think so. But we'd better need clarification from Michael.
> > > > > 
> > > > > I think if you use a descriptor, and you want to supply len
> > > > > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > > > > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > > > > If that's a problem we could look at relaxing that last requirement -
> > > > > does driver want INDIRECT in used descriptor to match
> > > > > the value in the avail descriptor for some reason?
> > > > 
> > > > For indirect, driver needs some way to get the length
> > > > of the data written by the driver. And the descriptors
> > > > in the indirect table is read-only, so the only place
> > > > device could put this value is the descriptor with the
> > > > VIRTQ_DESC_F_INDIRECT flag set.
> > > 
> > > when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
> > > (and clear VIRTQ_DESC_F_INDIRECT).
> > 
> > So the spec allows device to set VIRTQ_DESC_F_WRITE bit
> > when writing out an used descriptor even if the corresponding
> > descriptors it just parsed don't have the VIRTQ_DESC_F_WRITE
> > bit set?
> > 
> > Best regards,
> > Tiwei Bie
> 
> I think so. In a used descriptor, VIRTQ_DESC_F_WRITE just means length
> is valid.

Got it. It's very neat. Thanks! :)

Best regards,
Tiwei Bie

> 
> -- 
> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox