Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] connector: Allow permission checking in the receiver callbacks
From: Lars Ellenberg @ 2009-09-30 13:20 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Philipp Reisner, linux-kernel, netdev, Andrew Morton
In-Reply-To: <20090930112057.GA15150@ioremap.net>

On Wed, Sep 30, 2009 at 03:20:57PM +0400, Evgeniy Polyakov wrote:
> Hi Philipp.
> 
> On Tue, Sep 29, 2009 at 04:48:07PM +0200, Philipp Reisner (philipp.reisner@linbit.com) wrote:
> > Various users of the connector should actually check if the
> > sender's capabilities of a netlink/connector packet are
> > actually sufficient for the operation they trigger. Up to
> > now the connector framework did not allow the kernel side
> > receiver to do so.
> > 
> > This patch set does the groundwork.
> > 
> > Philipp Reisner (4):
> >   connector: Keep the skb in cn_callback_data
> >   connector: Provide the sender's credentials to the callback
> >   connector/dm: Fixed a compilation warning
> >   connector: Removed the destruct_data callback since it is always
> >     kfree_skb()
> 
> Patches look good to me.
> Andrew please apply to the appropriate tree. I do not know whether it is
> acceptible now, since it is not a bugfix, but merely a simple cleanup.
> Feel free to add my signed off or ack, thank you.

Thanks.

Actually it is the basis for follow-up security fixes.

Without this, unprivileged user space is able to send arbitrary
connector requests to kernel subsystems, which have no way to verify the
privileges of the sender anymore, because that information, even though
available at the netlink layer, has been dropped by the connector.

Once this is applied, the various in-kernel receiving connector
callbacks can (and need to) add cap_raised(nsb->eff_cap, cap) where
appropriate. For example, you don't want some guest user to be able to
trigger a dst_del_node callback by sending a crafted netlink message,
right?

So it _is_ a (design-) bug fix.
Or am I missing something?

	Lars

^ permalink raw reply

* Re: [PATCH] ipvs: Add boundary check on ioctl arguments
From: Hannes Eder @ 2009-09-30 13:38 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Wensong Zhang, netdev, linux-kernel, Simon Horman
In-Reply-To: <20090930131109.2b3f71b8@infradead.org>

[cc: +Simon Horman]

Arjan van de Ven wrote:
 > From 761a182f96b3707e1fee44e1079ba227e48745d1 Mon Sep 17 00:00:00 2001
 > From: Arjan van de Ven <arjan@linux.intel.com>
 > Date: Wed, 30 Sep 2009 13:05:51 +0200
 > Subject: [PATCH] ipvs: Add boundary check on ioctl arguments
 >
 > The ipvs code has a nifty system for doing the size of ioctl command copies;
 > it defines an array with values into which it indexes the cmd to find the
 > right length.
 >
 > Unfortunately, the ipvs code forgot to check if the cmd was in the range
 > that the array provides, allowing for an index outside of the array,
 > which then gives a "garbage" result into the length, which then gets
 > used for copying into a stack buffer.
 >
 > Fix this by adding sanity checks on these as well as the copy size.
 >
 > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
 > ---
 >  net/netfilter/ipvs/ip_vs_ctl.c |   14 +++++++++++++-
 >  1 files changed, 13 insertions(+), 1 deletions(-)
 >
 > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
 > index ac624e5..3c52796 100644
 > --- a/net/netfilter/ipvs/ip_vs_ctl.c
 > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
 > @@ -2077,6 +2077,10 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user 
*user, unsigned int len)
 >  	if (!capable(CAP_NET_ADMIN))
 >  		return -EPERM;
 >
 > +	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX + 1)
 > +		return -EINVAL;
 > +	if (len < 0 || len >  MAX_ARG_LEN)
 > +		return -EINVAL;
 >  	if (len != set_arglen[SET_CMDID(cmd)]) {
 >  		pr_err("set_ctl: len %u != %u\n",
 >  		       len, set_arglen[SET_CMDID(cmd)]);
 > @@ -2353,17 +2357,25 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user 
*user, int *len)
 >  {
 >  	unsigned char arg[128];

can MAX_ARG_LEN be used here?

 >  	int ret = 0;
 > +	unsigned int copylen;
 >
 >  	if (!capable(CAP_NET_ADMIN))
 >  		return -EPERM;
 >
 > +	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX + 1)
 > +		return -EINVAL;
 > +
 >  	if (*len < get_arglen[GET_CMDID(cmd)]) {
 >  		pr_err("get_ctl: len %u < %u\n",
 >  		       *len, get_arglen[GET_CMDID(cmd)]);
 >  		return -EINVAL;
 >  	}
 >
 > -	if (copy_from_user(arg, user, get_arglen[GET_CMDID(cmd)]) != 0)
 > +	copylen = get_arglen[GET_CMDID(cmd)];
 > +	if (copylen > 128)

I think it's better to use 'copylen > sizeof(arg)' here.

 > +		return -EINVAL;
 > +
 > +	if (copy_from_user(arg, user, copylen) != 0)
 >  		return -EFAULT;
 >
 >  	if (mutex_lock_interruptible(&__ip_vs_mutex))

^ permalink raw reply

* Re: [PATCH] bcm63xx_enet: timeout off by one in do_mdio_op()
From: Maxime Bizon @ 2009-09-30 14:01 UTC (permalink / raw)
  To: Roel Kluin; +Cc: netdev, Andrew Morton
In-Reply-To: <4AB7DD50.6030409@gmail.com>


On Mon, 2009-09-21 at 22:08 +0200, Roel Kluin wrote:

Hi Roel,

> `while (limit-- >= 0)' reaches -2 after the loop upon timeout.

The 1000us limit was chosen arbitrarily, since mdio access are much
shorter, and was just to prevent CPU lockup in case of hardware bug.

But it looks like a bug, and since you're the second one reporting this,
this should be fixed :)


> Signed-off-by: Roel Kluin <roel.kluin@gmail.com>

Acked-by: Maxime Bizon <mbizon@freebox.fr>

> ---
> Small chance to occur, probably.
> 
> diff --git a/drivers/net/bcm63xx_enet.c b/drivers/net/bcm63xx_enet.c
> index 09d2709..ba29dc3 100644
> --- a/drivers/net/bcm63xx_enet.c
> +++ b/drivers/net/bcm63xx_enet.c
> @@ -90,7 +90,7 @@ static int do_mdio_op(struct bcm_enet_priv *priv, unsigned int data)
>  		if (enet_readl(priv, ENET_IR_REG) & ENET_IR_MII)
>  			break;
>  		udelay(1);
> -	} while (limit-- >= 0);
> +	} while (limit-- > 0);
>  
>  	return (limit < 0) ? 1 : 0;
>  }


-- 
Maxime



^ permalink raw reply

* [PATCH] tg3: Remove prev_vlan_tag from struct tx_ring_info
From: Eric Dumazet @ 2009-09-30 14:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: Matt Carlson, Linux Netdev List, Michael Chan

prev_vlan_tag field is not used.

Patch saves 512*8 bytes per tx queue ring on 64bit arches.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/drivers/net/tg3.h b/drivers/net/tg3.h
index 82b45d8..524691c 100644
--- a/drivers/net/tg3.h
+++ b/drivers/net/tg3.h
@@ -2412,7 +2412,6 @@ struct ring_info {
 
 struct tx_ring_info {
 	struct sk_buff                  *skb;
-	u32                             prev_vlan_tag;
 };
 
 struct tg3_config_info {

^ permalink raw reply related

* Re: mac80211: NOHZ: local_softirq_pending 08
From: Michael Buesch @ 2009-09-30 14:33 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: John W. Linville, Kalle Valo, linux-wireless, netdev,
	Johannes Berg
In-Reply-To: <4AC3475C.7000403@hartkopp.net>

On Wednesday 30 September 2009 13:56:12 Oliver Hartkopp wrote:
> John W. Linville wrote:
> > On Sat, Sep 12, 2009 at 06:41:12PM +0200, Oliver Hartkopp wrote:
> > 
> >> i cooked a patch that introduces netif_rx_ti() and fixes up the problems in
> >> mac80211 and the CAN subsystem.
> > 
> > Oliver,
> > 
> > Are you going to send this patch to Dave?  If you want me to carry
> > it instead, please resend it with a proper changelog including a
> > Signed-off-by line.  For that matter, Dave will most certainly want
> > that as well...
> 
> Hello John,
> 
> as i wrote here
> 
> http://marc.info/?l=linux-netdev&m=125277885910179&w=2
> 
> there are currently only three occurrences of checks that use netif_rx() and
> netif_rx_ni() depending on in_interrupt().
> 
> And regarding the suggested fix from Michael, that checked every(!) netif_rx()
> whether it is in interrupt or not, i was unsure if a netif_tx_ti() would make
> sense for only three cases?!?
> 
> If you think it makes sense, i can post a patch for that ... but:
> 
> Indeed it costs some additional investigation to prove whether netif_rx() or
> netif_rx_ni() should be used in each case. But IMHO this has to be done before
> providing a pump-gun function that solves the problem without thinking if we
> are in irq-context or not. I want to avoid that people are using netif_rx_ti()
> as some kind of default ...
> 
> I don't know how expensive in_interrupt() is, but it IMO should be avoided
> when the context for a code section can be determined in another way.

What if we just get the fix merged and discuss later whether it's worth to optimize a picosecond or not??
My patch fixes the _bug_. You can merge a more "efficient" fix later that saves one or two CPU cycles.

-- 
Greetings, Michael.

^ permalink raw reply

* Re: mac80211: NOHZ: local_softirq_pending 08
From: Kalle Valo @ 2009-09-30 14:47 UTC (permalink / raw)
  To: Michael Buesch
  Cc: Oliver Hartkopp, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Johannes Berg
In-Reply-To: <200909301633.04376.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>

Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org> writes:

>> I don't know how expensive in_interrupt() is, but it IMO should be
>> avoided when the context for a code section can be determined in
>> another way.
>
> What if we just get the fix merged and discuss later whether it's
> worth to optimize a picosecond or not?? My patch fixes the _bug_.
> You can merge a more "efficient" fix later that saves one or two CPU
> cycles.

I agree with Michael. The bug is real and I have verified that
Michael's patch fixes the issue. Better to apply the patch now, it's
trivial to change the implementation if/when the network stack has
support for this.

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* N_PPP_SYNC ldisc BUG: sleeping function called from invalid context
From: Tilman Schmidt @ 2009-09-30 14:50 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Alan Cox

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

While testing pppd with capiplugin on kernel 2.6.31 (plus my CAPI and
Gigaset patches) I came across this BUG message. (Please ignore the
Tainted flag, it's from the Nvidia driver and doesn't influence what's
following.)

BUG: sleeping function called from invalid context at kernel/mutex.c:280
in_atomic(): 1, irqs_disabled(): 1, pid: 10, name: events/0
INFO: lockdep is turned off.
irq event stamp: 2034
hardirqs last  enabled at (2033): [<c0347b07>]
_spin_unlock_irqrestore+0x3c/0x6c
hardirqs last disabled at (2034): [<c03475d0>] _spin_lock_irq+0x11/0x3f
softirqs last  enabled at (2008): [<c012f598>] __do_softirq+0x114/0x11c
softirqs last disabled at (1971): [<c010593a>] do_softirq+0x69/0xc7
Pid: 10, comm: events/0 Tainted: P           2.6.31-vanilla #1
Call Trace:
 [<c0124a3c>] __might_sleep+0x10e/0x116
 [<c0346bdd>] mutex_lock_nested+0x18/0x31
 [<c026d39b>] tty_unthrottle+0x10/0x38
 [<f8dcc31f>] ppp_sync_receive+0x168/0x170 [ppp_synctty]
 [<f8fbb9ce>] handle_minor_recv+0x187/0x1cd [capi]
 [<f8fbc19b>] capi_recv_message+0x1d9/0x24e [capi]
 [<fa5be461>] recv_handler+0x56/0x6f [kernelcapi]
 [<c0138cba>] worker_thread+0x14a/0x21d
 [<c0138c9e>] ? worker_thread+0x12e/0x21d
 [<fa5be40b>] ? recv_handler+0x0/0x6f [kernelcapi]
 [<c013b375>] ? autoremove_wake_function+0x0/0x30
 [<c013b201>] kthread+0x64/0x69
 [<c0138b70>] ? worker_thread+0x0/0x21d
 [<c013b19d>] ? kthread+0x0/0x69
 [<c0103f33>] kernel_thread_helper+0x7/0x10

Turns out the ppp_sync_receive() function (drivers/net/ppp_synctty.c
line 385ff.) has a comment in front stating:

/*
 * This can now be called from hard interrupt level as well
 * as soft interrupt level or mainline.
 */

but calls tty_unthrottle() which in turn calls mutex_lock() which of
course can sleep.

That tty_unthrottle() call was already removed once by

commit a6540f731d506d9e82444cf0020e716613d4c46c
Author: Alan Cox <alan@linux.intel.com>
Date:   Mon Jun 22 18:42:29 2009 +0100

    ppp: Fix throttling bugs

but re-added by

commit 4a21b8cb3550f19f838f7c48345fbbf6a0e8536b
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jul 16 09:14:23 2009 -0700

    Revert "ppp: Fix throttling bugs"

    This reverts commit a6540f731d506d9e82444cf0020e716613d4c46c, as
    requested by Alan:

      "... as it was wrong, the pty code is now fixed and the fact this
       isn't reverted is breaking pptp setups."

Opinions?

Thanks,
Tilman

- --
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.4 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKw3AyQ3+did9BuFsRAiZXAKCGaos/qZNTlStEP4SE90PA0ZoMAQCdFtvf
U9chE7at35y8c6CGGS1IGg0=
=Vpq4
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: mac80211: NOHZ: local_softirq_pending 08
From: Johannes Berg @ 2009-09-30 14:54 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Michael Buesch, Oliver Hartkopp, John W. Linville, linux-wireless,
	netdev
In-Reply-To: <87ocosqykb.fsf@purkki.valot.fi>

[-- Attachment #1: Type: text/plain, Size: 489 bytes --]

On Wed, 2009-09-30 at 17:47 +0300, Kalle Valo wrote:

> I agree with Michael. The bug is real and I have verified that
> Michael's patch fixes the issue. Better to apply the patch now, it's
> trivial to change the implementation if/when the network stack has
> support for this.

FWIW, I think in mac80211 the in_interrupt() check can never return true
since we postpone all RX to the tasklet. But the tasklet seems to be ok
-- so should it really be in_interrupt()?

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: mac80211: NOHZ: local_softirq_pending 08
From: Michael Buesch @ 2009-09-30 15:10 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Kalle Valo, Oliver Hartkopp, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1254322466.3959.5.camel-YfaajirXv2244ywRPIzf9A@public.gmane.org>

On Wednesday 30 September 2009 16:54:26 Johannes Berg wrote:
> On Wed, 2009-09-30 at 17:47 +0300, Kalle Valo wrote:
> 
> > I agree with Michael. The bug is real and I have verified that
> > Michael's patch fixes the issue. Better to apply the patch now, it's
> > trivial to change the implementation if/when the network stack has
> > support for this.
> 
> FWIW, I think in mac80211 the in_interrupt() check can never return true
> since we postpone all RX to the tasklet. But the tasklet seems to be ok
> -- so should it really be in_interrupt()?

I think a tasklet is also in_interrupt(), because it's a softirq.
in_interrupt() returns false in process context. The problem appeared when
the b43 driver started passing RX frames while being in process context (threaded IRQ).
It previously was in tasklet (= softirq) context.

-- 
Greetings, Michael.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] ipvs: Add boundary check on ioctl arguments
From: Arjan van de Ven @ 2009-09-30 15:18 UTC (permalink / raw)
  To: Hannes Eder; +Cc: Wensong Zhang, netdev, linux-kernel, Simon Horman
In-Reply-To: <4AC35F44.60707@google.com>

On Wed, 30 Sep 2009 15:38:12 +0200
Hannes Eder <heder@google.com> wrote:
>  > @@ -2353,17 +2357,25 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd,
>  > void __user 
> *user, int *len)
>  >  {
>  >  	unsigned char arg[128];
> 
> can MAX_ARG_LEN be used here?

I am not convinced... it is a different numerical value,
so it could be an ABI change. Rather not do that in this
type of patch...

>  > +	copylen = get_arglen[GET_CMDID(cmd)];
>  > +	if (copylen > 128)
> 
> I think it's better to use 'copylen > sizeof(arg)' here.

fair enough; updated patch below

>From 28ae217858e683c0c94c02219d46a9a9c87f61c6 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 30 Sep 2009 13:05:51 +0200
Subject: [PATCH] ipvs: Add boundary check on ioctl arguments

The ipvs code has a nifty system for doing the size of ioctl command copies;
it defines an array with values into which it indexes the cmd to find the
right length.

Unfortunately, the ipvs code forgot to check if the cmd was in the range
that the array provides, allowing for an index outside of the array,
which then gives a "garbage" result into the length, which then gets
used for copying into a stack buffer.

Fix this by adding sanity checks on these as well as the copy size.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 net/netfilter/ipvs/ip_vs_ctl.c |   14 +++++++++++++-
 1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index ac624e5..7adc876 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2077,6 +2077,10 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
+	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX + 1)
+		return -EINVAL;
+	if (len < 0 || len >  sizeof(arg))
+		return -EINVAL;
 	if (len != set_arglen[SET_CMDID(cmd)]) {
 		pr_err("set_ctl: len %u != %u\n",
 		       len, set_arglen[SET_CMDID(cmd)]);
@@ -2353,17 +2357,25 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	unsigned char arg[128];
 	int ret = 0;
+	unsigned int copylen;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
+	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX + 1)
+		return -EINVAL;
+
 	if (*len < get_arglen[GET_CMDID(cmd)]) {
 		pr_err("get_ctl: len %u < %u\n",
 		       *len, get_arglen[GET_CMDID(cmd)]);
 		return -EINVAL;
 	}
 
-	if (copy_from_user(arg, user, get_arglen[GET_CMDID(cmd)]) != 0)
+	copylen = get_arglen[GET_CMDID(cmd)];
+	if (copylen > sizeof(arg))
+		return -EINVAL;
+
+	if (copy_from_user(arg, user, copylen) != 0)
 		return -EFAULT;
 
 	if (mutex_lock_interruptible(&__ip_vs_mutex))
-- 
1.6.2.5



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related

* Re: mac80211: NOHZ: local_softirq_pending 08
From: Johannes Berg @ 2009-09-30 15:21 UTC (permalink / raw)
  To: Michael Buesch
  Cc: Kalle Valo, Oliver Hartkopp, John W. Linville, linux-wireless,
	netdev
In-Reply-To: <200909301710.31082.mb@bu3sch.de>

[-- Attachment #1: Type: text/plain, Size: 773 bytes --]

On Wed, 2009-09-30 at 17:10 +0200, Michael Buesch wrote:
> On Wednesday 30 September 2009 16:54:26 Johannes Berg wrote:
> > On Wed, 2009-09-30 at 17:47 +0300, Kalle Valo wrote:
> > 
> > > I agree with Michael. The bug is real and I have verified that
> > > Michael's patch fixes the issue. Better to apply the patch now, it's
> > > trivial to change the implementation if/when the network stack has
> > > support for this.
> > 
> > FWIW, I think in mac80211 the in_interrupt() check can never return true
> > since we postpone all RX to the tasklet. But the tasklet seems to be ok
> > -- so should it really be in_interrupt()?
> 
> I think a tasklet is also in_interrupt(), because it's a softirq.

Ah, yes, indeed, in_interrupt() vs. in_irq().

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH] ipvs: Add boundary check on ioctl arguments
From: Hannes Eder @ 2009-09-30 15:33 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Wensong Zhang, netdev, linux-kernel, Simon Horman
In-Reply-To: <20090930171833.5ce0011d@infradead.org>

On Wed, Sep 30, 2009 at 17:18, Arjan van de Ven <arjan@infradead.org> wrote:
> On Wed, 30 Sep 2009 15:38:12 +0200
> Hannes Eder <heder@google.com> wrote:
>>  > @@ -2353,17 +2357,25 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd,
>>  > void __user
>> *user, int *len)
>>  >  {
>>  >    unsigned char arg[128];
>>
>> can MAX_ARG_LEN be used here?
>
> I am not convinced... it is a different numerical value,
> so it could be an ABI change. Rather not do that in this
> type of patch...

For do_ip_vs_set_ctl MAX_ARG_LEN is used:

static int
do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
{
	int ret;
	unsigned char arg[MAX_ARG_LEN];
...

I assume that will be fine for do_ip_vs_get_ctl as well.

-Hannes

^ permalink raw reply

* Re: N_PPP_SYNC ldisc BUG: sleeping function called from invalid context
From: Alan Cox @ 2009-09-30 16:47 UTC (permalink / raw)
  To: Tilman Schmidt; +Cc: linux-kernel, netdev, Alan Cox
In-Reply-To: <4AC37032.4030901@imap.cc>

>  [<c026d39b>] tty_unthrottle+0x10/0x38
>  [<f8dcc31f>] ppp_sync_receive+0x168/0x170 [ppp_synctty]
>  [<f8fbb9ce>] handle_minor_recv+0x187/0x1cd [capi]
>  [<f8fbc19b>] capi_recv_message+0x1d9/0x24e [capi]

Really need to see the rest of the call trace to be sure

> Turns out the ppp_sync_receive() function (drivers/net/ppp_synctty.c
> line 385ff.) has a comment in front stating:
> 
> /*
>  * This can now be called from hard interrupt level as well
>  * as soft interrupt level or mainline.
>  */

Which is wrong. The flip_buffer_push -> rx processing path should never
be called from IRQ context and that was fixed for various drivers that
mis-set tty->low_latency, as well as in the PPP rework. The PPP case is
actually unrelated in many was.

> Opinions?

See how we got into that code direct from an IRQ path. The expectation of
the tty logic is that it gets processed from work queues either
specifically in driver or via tty_flip_buffer_push when tty->low_latency
= 0

^ permalink raw reply

* Re: [PATCH] tg3: Remove prev_vlan_tag from struct tx_ring_info
From: Matt Carlson @ 2009-09-30 17:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Matthew Carlson, Linux Netdev List, Michael Chan
In-Reply-To: <4AC36638.8070304@gmail.com>

On Wed, Sep 30, 2009 at 07:07:52AM -0700, Eric Dumazet wrote:
> prev_vlan_tag field is not used.
> 
> Patch saves 512*8 bytes per tx queue ring on 64bit arches.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good to me.

Acked-by: Matthew Carlson <mcarlson@broadcom.com>

> ---
> 
> diff --git a/drivers/net/tg3.h b/drivers/net/tg3.h
> index 82b45d8..524691c 100644
> --- a/drivers/net/tg3.h
> +++ b/drivers/net/tg3.h
> @@ -2412,7 +2412,6 @@ struct ring_info {
>  
>  struct tx_ring_info {
>  	struct sk_buff                  *skb;
> -	u32                             prev_vlan_tag;
>  };
>  
>  struct tg3_config_info {
> 


^ permalink raw reply

* Re: mac80211: NOHZ: local_softirq_pending 08
From: Oliver Hartkopp @ 2009-09-30 17:51 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Michael Buesch, Kalle Valo, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1254324077.3959.7.camel-YfaajirXv2244ywRPIzf9A@public.gmane.org>

Johannes Berg wrote:
> On Wed, 2009-09-30 at 17:10 +0200, Michael Buesch wrote:
>> On Wednesday 30 September 2009 16:54:26 Johannes Berg wrote:
>>> On Wed, 2009-09-30 at 17:47 +0300, Kalle Valo wrote:
>>>
>>>> I agree with Michael. The bug is real and I have verified that
>>>> Michael's patch fixes the issue. Better to apply the patch now, it's
>>>> trivial to change the implementation if/when the network stack has
>>>> support for this.
>>> FWIW, I think in mac80211 the in_interrupt() check can never return true
>>> since we postpone all RX to the tasklet. But the tasklet seems to be ok
>>> -- so should it really be in_interrupt()?
>> I think a tasklet is also in_interrupt(), because it's a softirq.
> 
> Ah, yes, indeed, in_interrupt() vs. in_irq().
> 

Oops!

I missed that for my previous patch i added for two occurrences in the CAN
sources.

I'm currently compiling the patch for netif_rx_ti() and will post it in some
minutes (for CAN and mac80211) when it runs without probs.

Regards,
Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] net: fix NOHZ: local_softirq_pending 08
From: Oliver Hartkopp @ 2009-09-30 18:18 UTC (permalink / raw)
  To: David Miller
  Cc: Johannes Berg, Michael Buesch, Kalle Valo, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4AC39A90.6060602-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 505 bytes --]

Socket buffers that are generated and received inside softirqs or from process
context must not use netif_rx() that's intended to be used from irq context only.

This patch introduces a new helper function netif_rx_ti(skb) that tests for
in_interrupt() before invoking netif_rx() or netif_rx_ni().

It fixes the ratelimited kernel warning

        NOHZ: local_softirq_pending 08

in the mac80211 and can subsystems.

Signed-off-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

---




[-- Attachment #2: net-NOHZ-local_softirq_pending-08.patch --]
[-- Type: text/x-patch, Size: 4031 bytes --]

diff --git a/drivers/net/can/vcan.c b/drivers/net/can/vcan.c
index 80ac563..899f3d3 100644
--- a/drivers/net/can/vcan.c
+++ b/drivers/net/can/vcan.c
@@ -80,7 +80,7 @@ static void vcan_rx(struct sk_buff *skb, struct net_device *dev)
 	skb->dev       = dev;
 	skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-	netif_rx_ni(skb);
+	netif_rx_ti(skb);
 }
 
 static netdev_tx_t vcan_tx(struct sk_buff *skb, struct net_device *dev)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 94958c1..dc8dfb2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1509,6 +1509,19 @@ extern int		netdev_budget;
 extern void netdev_run_todo(void);
 
 /**
+ *	netif_rx_ti - test for irq context and post buffer to the network code
+ *	@skb: buffer to post
+ *
+ */
+static inline int netif_rx_ti(struct sk_buff *skb)
+{
+	if (in_interrupt())
+		return netif_rx(skb);
+	else
+		return netif_rx_ni(skb);
+}
+
+/**
  *	dev_put - release reference to device
  *	@dev: network device
  *
diff --git a/net/can/af_can.c b/net/can/af_can.c
index 6068321..c21e7f4 100644
--- a/net/can/af_can.c
+++ b/net/can/af_can.c
@@ -199,8 +199,6 @@ static int can_create(struct net *net, struct socket *sock, int protocol)
  * @skb: pointer to socket buffer with CAN frame in data section
  * @loop: loopback for listeners on local CAN sockets (recommended default!)
  *
- * Due to the loopback this routine must not be called from hardirq context.
- *
  * Return:
  *  0 on success
  *  -ENETDOWN when the selected interface is down
@@ -280,7 +278,7 @@ int can_send(struct sk_buff *skb, int loop)
 	}
 
 	if (newskb)
-		netif_rx_ni(newskb);
+		netif_rx_ti(newskb);
 
 	/* update statistics */
 	can_stats.tx_frames++;
diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index 5608f6c..bbcb4cb 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -606,7 +606,7 @@ static void ieee80211_send_layer2_update(struct sta_info *sta)
 	skb->dev = sta->sdata->dev;
 	skb->protocol = eth_type_trans(skb, sta->sdata->dev);
 	memset(skb->cb, 0, sizeof(skb->cb));
-	netif_rx(skb);
+	netif_rx_ti(skb);
 }
 
 static void sta_apply_parameters(struct ieee80211_local *local,
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 797f539..1109f99 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -591,7 +591,7 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 				skb2 = skb_clone(skb, GFP_ATOMIC);
 				if (skb2) {
 					skb2->dev = prev_dev;
-					netif_rx(skb2);
+					netif_rx_ti(skb2);
 				}
 			}
 
@@ -600,7 +600,7 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 	}
 	if (prev_dev) {
 		skb->dev = prev_dev;
-		netif_rx(skb);
+		netif_rx_ti(skb);
 		skb = NULL;
 	}
 	rcu_read_unlock();
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index c01588f..5bb7c04 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -309,7 +309,7 @@ ieee80211_rx_monitor(struct ieee80211_local *local, struct sk_buff *origskb,
 			skb2 = skb_clone(skb, GFP_ATOMIC);
 			if (skb2) {
 				skb2->dev = prev_dev;
-				netif_rx(skb2);
+				netif_rx_ti(skb2);
 			}
 		}
 
@@ -320,7 +320,7 @@ ieee80211_rx_monitor(struct ieee80211_local *local, struct sk_buff *origskb,
 
 	if (prev_dev) {
 		skb->dev = prev_dev;
-		netif_rx(skb);
+		netif_rx_ti(skb);
 	} else
 		dev_kfree_skb(skb);
 
@@ -1349,7 +1349,7 @@ ieee80211_deliver_skb(struct ieee80211_rx_data *rx)
 			/* deliver to local stack */
 			skb->protocol = eth_type_trans(skb, dev);
 			memset(skb->cb, 0, sizeof(skb->cb));
-			netif_rx(skb);
+			netif_rx_ti(skb);
 		}
 	}
 
@@ -1943,7 +1943,7 @@ static void ieee80211_rx_cooked_monitor(struct ieee80211_rx_data *rx)
 			skb2 = skb_clone(skb, GFP_ATOMIC);
 			if (skb2) {
 				skb2->dev = prev_dev;
-				netif_rx(skb2);
+				netif_rx_ti(skb2);
 			}
 		}
 
@@ -1954,7 +1954,7 @@ static void ieee80211_rx_cooked_monitor(struct ieee80211_rx_data *rx)
 
 	if (prev_dev) {
 		skb->dev = prev_dev;
-		netif_rx(skb);
+		netif_rx_ti(skb);
 		skb = NULL;
 	} else
 		goto out_free_skb;

^ permalink raw reply related

* Re: [PATCH] net: fix NOHZ: local_softirq_pending 08
From: John W. Linville @ 2009-09-30 18:47 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: David Miller, Johannes Berg, Michael Buesch, Kalle Valo,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4AC3A0F1.3060306-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

On Wed, Sep 30, 2009 at 08:18:25PM +0200, Oliver Hartkopp wrote:
> Socket buffers that are generated and received inside softirqs or from process
> context must not use netif_rx() that's intended to be used from irq context only.
> 
> This patch introduces a new helper function netif_rx_ti(skb) that tests for
> in_interrupt() before invoking netif_rx() or netif_rx_ni().
> 
> It fixes the ratelimited kernel warning
> 
>         NOHZ: local_softirq_pending 08
> 
> in the mac80211 and can subsystems.
> 
> Signed-off-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

http://bugzilla.kernel.org/show_bug.cgi?id=14278

Acked-by: John W. Linville <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: N_PPP_SYNC ldisc BUG: sleeping function called from invalid context
From: Tilman Schmidt @ 2009-09-30 18:55 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, netdev, Alan Cox
In-Reply-To: <20090930174704.796b24b9@lxorguk.ukuu.org.uk>

[-- Attachment #1: Type: text/plain, Size: 2412 bytes --]

Alan Cox schrieb:
>>  [<c026d39b>] tty_unthrottle+0x10/0x38
>>  [<f8dcc31f>] ppp_sync_receive+0x168/0x170 [ppp_synctty]
>>  [<f8fbb9ce>] handle_minor_recv+0x187/0x1cd [capi]
>>  [<f8fbc19b>] capi_recv_message+0x1d9/0x24e [capi]
> 
> Really need to see the rest of the call trace to be sure

There wasn't more than what I posted. I had six of them, they looked all
identical, and all of them ended after the kernel_thread_helper line. 

>> Turns out the ppp_sync_receive() function (drivers/net/ppp_synctty.c
>> line 385ff.) has a comment in front stating:
>>
>> /*
>>  * This can now be called from hard interrupt level as well
>>  * as soft interrupt level or mainline.
>>  */
> 
> Which is wrong. The flip_buffer_push -> rx processing path should never
> be called from IRQ context and that was fixed for various drivers that
> mis-set tty->low_latency, as well as in the PPP rework. The PPP case is
> actually unrelated in many was.

Might be worth correcting that text then before is misleads someone.

>> Opinions?
> 
> See how we got into that code direct from an IRQ path. The expectation of
> the tty logic is that it gets processed from work queues either
> specifically in driver or via tty_flip_buffer_push when tty->low_latency
> = 0

I'm at a loss here. According to all the backtraces:

- ppp_sync_receive() was called, as the LD's receive_buf method,
  via handle_recv_skb() [drivers/isdn/capi/capi.c line 504, inlined]
  from handle_minor_recv() [drivers/isdn/capi/capi.c line 519]

- handle_minor_recv() was called from capi_recv_message()
  [drivers/isdn/capi/capi.c line 656]

- capi_recv_message() was called, as the CAPI application's
  recv_message method, from recv_handler()
  [drivers/isdn/capi/kcapi.c line 268]

- recv_handler() is never called directly. It's only scheduled
  via the work queue ap->recv_work from capi_ctr_handle_message()
  [drivers/isdn/capi/kcapi.c line 349]

Even if we don't trust the backtraces, there's not much room for
another activation path. So for all I know, the expectation of the
tty logic should have been met. The call was indeed processed from
a work queue.

Why then does mutex_lock() complain?

-- 
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply

* Re: [PATCH] connector: Allow permission checking in the receiver callbacks
From: Evgeniy Polyakov @ 2009-09-30 19:29 UTC (permalink / raw)
  To: Lars Ellenberg; +Cc: Philipp Reisner, linux-kernel, netdev, Andrew Morton
In-Reply-To: <20090930132034.GE8032@barkeeper1-xen.linbit>

On Wed, Sep 30, 2009 at 03:20:35PM +0200, Lars Ellenberg (lars.ellenberg@linbit.com) wrote:
> Actually it is the basis for follow-up security fixes.
> 
> Without this, unprivileged user space is able to send arbitrary
> connector requests to kernel subsystems, which have no way to verify the
> privileges of the sender anymore, because that information, even though
> available at the netlink layer, has been dropped by the connector.

It is not. One can add some checks at receiving time which happens in
process context to get its credentials, but nothing in netlink itself
carry this info. Getting that connector schedules workqueue this ability
is lost.

> Once this is applied, the various in-kernel receiving connector
> callbacks can (and need to) add cap_raised(nsb->eff_cap, cap) where
> appropriate. For example, you don't want some guest user to be able to
> trigger a dst_del_node callback by sending a crafted netlink message,
> right?
> 
> So it _is_ a (design-) bug fix.
> Or am I missing something?

This patchset is not a bugfix, just a cleanup, since none in patchset
uses netlink_skb_parms and currently I see no users which are affected
by this behaviour in the mainline branch (not counting staging tree).

But if proposed configuration changes for DM are on the way, then I
agree and they should force this patchset into the tree as a bugfix.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [PATCH] ipvs: Add boundary check on ioctl arguments
From: Julian Anastasov @ 2009-09-30 19:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Hannes Eder, Wensong Zhang, netdev, linux-kernel, Simon Horman
In-Reply-To: <20090930171833.5ce0011d@infradead.org>


	Hello,

On Wed, 30 Sep 2009, Arjan van de Ven wrote:

> fair enough; updated patch below
> 
> >From 28ae217858e683c0c94c02219d46a9a9c87f61c6 Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <arjan@linux.intel.com>
> Date: Wed, 30 Sep 2009 13:05:51 +0200
> Subject: [PATCH] ipvs: Add boundary check on ioctl arguments
> 
> The ipvs code has a nifty system for doing the size of ioctl command copies;
> it defines an array with values into which it indexes the cmd to find the
> right length.
> 
> Unfortunately, the ipvs code forgot to check if the cmd was in the range
> that the array provides, allowing for an index outside of the array,
> which then gives a "garbage" result into the length, which then gets
> used for copying into a stack buffer.

	do_ip_vs_get_ctl and do_ip_vs_set_ctl are nf_sockopt_ops
handlers, so the range is checked by nf_sockopt_find() in Netfilter
code. get_arglen[] and set_arglen[] are minimum values for
the length and they can be 0. Later len can be checked
additionally and surely can exceed 128 (include/linux/ip_vs.h has
all user structures). Can you show the exact cmd and len
used, may be there is error in some command or may be the
provided user structure is wrong?

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-30 20:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABF33B2.4000805@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 35235 bytes --]

Avi Kivity wrote:
> On 09/26/2009 12:32 AM, Gregory Haskins wrote:
>>>>
>>>> I realize in retrospect that my choice of words above implies vbus _is_
>>>> complete, but this is not what I was saying.  What I was trying to
>>>> convey is that vbus is _more_ complete.  Yes, in either case some kind
>>>> of glue needs to be written.  The difference is that vbus implements
>>>> more of the glue generally, and leaves less required to be customized
>>>> for each iteration.
>>>>
>>>>        
>>>
>>> No argument there.  Since you care about non-virt scenarios and virtio
>>> doesn't, naturally vbus is a better fit for them as the code stands.
>>>      
>> Thanks for finally starting to acknowledge there's a benefit, at least.
>>    
> 
> I think I've mentioned vbus' finer grained layers as helpful here,
> though I doubt the value of this.  Hypervisors are added rarely, while
> devices and drivers are added (and modified) much more often.  I don't
> buy the anything-to-anything promise.

The ease in which a new hypervisor should be able to integrate into the
stack is only one of vbus's many benefits.

> 
>> To be more precise, IMO virtio is designed to be a performance oriented
>> ring-based driver interface that supports all types of hypervisors (e.g.
>> shmem based kvm, and non-shmem based Xen).  vbus is designed to be a
>> high-performance generic shared-memory interconnect (for rings or
>> otherwise) framework for environments where linux is the underpinning
>> "host" (physical or virtual).  They are distinctly different, but
>> complementary (the former addresses the part of the front-end, and
>> latter addresses the back-end, and a different part of the front-end).
>>    
> 
> They're not truly complementary since they're incompatible.

No, that is incorrect.  Not to be rude, but for clarity:

  Complementary \Com`ple*men"ta*ry\, a.
     Serving to fill out or to complete; as, complementary
     numbers.
     [1913 Webster]

Citation: www.dict.org

IOW: Something being complementary has nothing to do with guest/host
binary compatibility.  virtio-pci and virtio-vbus are both equally
complementary to virtio since they fill in the bottom layer of the
virtio stack.

So yes, vbus is truly complementary to virtio afaict.

> A 2.6.27 guest, or Windows guest with the existing virtio drivers, won't work
> over vbus.

Binary compatibility with existing virtio drivers, while nice to have,
is not a specific requirement nor goal.  We will simply load an updated
KMP/MSI into those guests and they will work again.  As previously
discussed, this is how more or less any system works today.  It's like
we are removing an old adapter card and adding a new one to "uprev the
silicon".

>  Further, non-shmem virtio can't work over vbus.

Actually I misspoke earlier when I said virtio works over non-shmem.
Thinking about it some more, both virtio and vbus fundamentally require
shared-memory, since sharing their metadata concurrently on both sides
is their raison d'être.

The difference is that virtio utilizes a pre-translation/mapping (via
->add_buf) from the guest side.  OTOH, vbus uses a post translation
scheme (via memctx) from the host-side.  If anything, vbus is actually
more flexible because it doesn't assume the entire guest address space
is directly mappable.

In summary, your statement is incorrect (though it is my fault for
putting that idea in your head).

>  Since
> virtio is guest-oriented and host-agnostic, it can't ignore
> non-shared-memory hosts (even though it's unlikely virtio will be
> adopted there)

Well, to be fair no one said it has to ignore them.  Either virtio-vbus
transport is present and available to the virtio stack, or it isn't.  If
its present, it may or may not publish objects for consumption.
Providing a virtio-vbus transport in no way limits or degrades the
existing capabilities of the virtio stack.  It only enhances them.

I digress.  The whole point is moot since I realized that the non-shmem
distinction isn't accurate anyway.  They both require shared-memory for
the metadata, and IIUC virtio requires the entire address space to be
mappable whereas vbus only assumes the metadata is.

> 
>> In addition, the kvm-connector used in AlacrityVM's design strives to
>> add value and improve performance via other mechanisms, such as dynamic
>>   allocation, interrupt coalescing (thus reducing exit-ratio, which is a
>> serious issue in KVM)
> 
> Do you have measurements of inter-interrupt coalescing rates (excluding
> intra-interrupt coalescing).

I actually do not have a rig setup to explicitly test inter-interrupt
rates at the moment.  Once things stabilize for me, I will try to
re-gather some numbers here.  Last time I looked, however, there were
some decent savings for inter as well.

Inter rates are interesting because they are what tends to ramp up with
IO load more than intra since guest interrupt mitigation techniques like
NAPI often quell intra-rates naturally.  This is especially true for
data-center, cloud, hpc-grid, etc, kind of workloads (vs vanilla
desktops, etc) that tend to have multiple IO ports (multi-homed nics,
disk-io, etc).  Those various ports tend to be workload-related to one
another (e.g. 3-tier web stack may use multi-homed network and disk-io
at the same time, trigged by one IO event).

An interesting thing here is that you don't even need a fancy
multi-homed setup to see the effects of my exit-ratio reduction work:
even single port configurations suffer from the phenomenon since many
devices have multiple signal-flows (e.g. network adapters tend to have
at least 3 flows: rx-ready, tx-complete, and control-events (link-state,
etc).  Whats worse, is that the flows often are indirectly related (for
instance, many host adapters will free tx skbs during rx operations, so
you tend to get bursts of tx-completes at the same time as rx-ready.  If
the flows map 1:1 with IDT, they will suffer the same problem.

In any case, here is an example run of a simple single-homed guest over
standard GigE.  Whats interesting here is that .qnotify to .notify
ratio, as this is the interrupt-to-signal ratio.  In this case, its
170047/151918, which comes out to about 11% savings in interrupt injections:

vbus-guest:/home/ghaskins # netperf -H dev
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
dev.laurelwood.net (192.168.1.10) port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

1048576  16384  16384    10.01     940.77
vbus-guest:/home/ghaskins # cat /sys/kernel/debug/pci-to-vbus-bridge
  .events                        : 170048
  .qnotify                       : 151918
  .qinject                       : 0
  .notify                        : 170047
  .inject                        : 18238
  .bridgecalls                   : 18
  .buscalls                      : 12
vbus-guest:/home/ghaskins # cat /proc/interrupts
            CPU0
   0:         87   IO-APIC-edge      timer
   1:          6   IO-APIC-edge      i8042
   4:        733   IO-APIC-edge      serial
   6:          2   IO-APIC-edge      floppy
   7:          0   IO-APIC-edge      parport0
   8:          0   IO-APIC-edge      rtc0
   9:          0   IO-APIC-fasteoi   acpi
  10:          0   IO-APIC-fasteoi   virtio1
  12:         90   IO-APIC-edge      i8042
  14:       3041   IO-APIC-edge      ata_piix
  15:       1008   IO-APIC-edge      ata_piix
  24:     151933   PCI-MSI-edge      vbus
  25:          0   PCI-MSI-edge      virtio0-config
  26:        190   PCI-MSI-edge      virtio0-input
  27:         28   PCI-MSI-edge      virtio0-output
 NMI:          0   Non-maskable interrupts
 LOC:       9854   Local timer interrupts
 SPU:          0   Spurious interrupts
 CNT:          0   Performance counter interrupts
 PND:          0   Performance pending work
 RES:          0   Rescheduling interrupts
 CAL:          0   Function call interrupts
 TLB:          0   TLB shootdowns
 TRM:          0   Thermal event interrupts
 THR:          0   Threshold APIC interrupts
 MCE:          0   Machine check exceptions
 MCP:          1   Machine check polls
 ERR:          0
 MIS:          0

Its important to note here that we are actually looking at the interrupt
rate, not the exit rate (which is usually a multiple of the interrupt
rate, since you have to factor in as many as three exits per interrupt
(IPI, window, EOI).  Therefore we saved about 18k interrupts in this 10
second burst, but we may have actually saved up to 54k exits in the
process. This is only over a 10 second window at GigE rates, so YMMV.
These numbers get even more dramatic on higher end hardware, but I
haven't had a chance to generate new numbers yet.

Looking at some external stats paints an even bleaker picture: "exits"
as reported by kvm_stat for virtio-pci based virtio-net tip the scales
at 65k/s vs 36k/s for vbus based venet.  And virtio is consuming ~30% of
my quad-core's cpu, vs 19% for venet during the test.  Its hard to know
which innovation or innovations may be responsible for the entire
reduction, but certainly the interrupt-to-signal ratio mentioned above
is probably helping.

The even worse news for 1:1 models is that the ratio of
exits-per-interrupt climbs with load (exactly when it hurts the most)
since that is when the probability that the vcpu will need all three
exits is the highest.

> 
>> and priortizable/nestable signals.
>>    
> 
> That doesn't belong in a bus.

Everyone is of course entitled to an opinion, but the industry as a
whole would disagree with you.  Signal path routing (1:1, aggregated,
etc) is at the discretion of the bus designer.  Most buses actually do
_not_ support 1:1 with IDT (think USB, SCSI, IDE, etc).

PCI is somewhat of an outlier in that regard afaict.  Its actually a
nice feature of PCI when its used within its design spec (HW).  For
SW/PV, 1:1 suffers from, among other issues, that "triple-exit scaling"
issue in the signal path I mentioned above.  This is one of the many
reasons I think PCI is not the best choice for PV.

> 
>> Today there is a large performance disparity between what a KVM guest
>> sees and what a native linux application sees on that same host.  Just
>> take a look at some of my graphs between "virtio", and "native", for
>> example:
>>
>> http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png
>>    
> 
> That's a red herring.  The problem is not with virtio as an ABI, but
> with its implementation in userspace.  vhost-net should offer equivalent
> performance to vbus.

That's pure speculation.  I would advise you to reserve such statements
until after a proper bakeoff can be completed.  This is not to mention
that vhost-net does nothing to address our other goals, like scheduler
coordination and non-802.x fabrics.

> 
>> A dominant vbus design principle is to try to achieve the same IO
>> performance for all "linux applications" whether they be literally
>> userspace applications, or things like KVM vcpus or Ira's physical
>> boards.  It also aims to solve problems not previously expressible with
>> current technologies (even virtio), like nested real-time.
>>
>> And even though you repeatedly insist otherwise, the neat thing here is
>> that the two technologies mesh (at least under certain circumstances,
>> like when virtio is deployed on a shared-memory friendly linux backend
>> like KVM).  I hope that my stack diagram below depicts that clearly.
>>    
> 
> Right, when you ignore the points where they don't fit, it's a perfect
> mesh.

Where doesn't it fit?

> 
>>> But that's not a strong argument for vbus; instead of adding vbus you
>>> could make virtio more friendly to non-virt
>>>      
>> Actually, it _is_ a strong argument then because adding vbus is what
>> helps makes virtio friendly to non-virt, at least for when performance
>> matters.
>>    
> 
> As vhost-net shows, you can do that without vbus

Citation please.  Afaict, the one use case that we looked at for vhost
outside of KVM failed to adapt properly, so I do not see how this is true.

> and without breaking compatibility.

Compatibility with what?  vhost hasn't even been officially deployed in
KVM environments afaict, nevermind non-virt.  Therefore, how could it
possibly have compatibility constraints with something non-virt already?
 Citation please.

> 
> 
> 
>>> Right.  virtio assumes that it's in a virt scenario and that the guest
>>> architecture already has enumeration and hotplug mechanisms which it
>>> would prefer to use.  That happens to be the case for kvm/x86.
>>>      
>> No, virtio doesn't assume that.  It's stack provides the "virtio-bus"
>> abstraction and what it does assume is that it will be wired up to
>> something underneath. Kvm/x86 conveniently has pci, so the virtio-pci
>> adapter was created to reuse much of that facility.  For other things
>> like lguest and s360, something new had to be created underneath to make
>> up for the lack of pci-like support.
>>    
> 
> Right, I was wrong there.  But it does allow you to have a 1:1 mapping
> between native devices and virtio devices.

vbus allows you to have 1:1 if that is what you want, but we strive to
do better.

> 
> 
>>>> So to answer your question, the difference is that the part that has to
>>>> be customized in vbus should be a fraction of what needs to be
>>>> customized with vhost because it defines more of the stack.
>>>>        
>>> But if you want to use the native mechanisms, vbus doesn't have any
>>> added value.
>>>      
>> First of all, thats incorrect.  If you want to use the "native"
>> mechanisms (via the way the vbus-connector is implemented, for instance)
>> you at least still have the benefit that the backend design is more
>> broadly re-useable in more environments (like non-virt, for instance),
>> because vbus does a proper job of defining the requisite
>> layers/abstractions compared to vhost.  So it adds value even in that
>> situation.
>>    
> 
> Maybe.  If vhost-net isn't sufficient I'm sure there will be patches sent.

It isn't, and I've already done that.

> 
>> Second of all, with PV there is no such thing as "native".  It's
>> software so it can be whatever we want.  Sure, you could argue that the
>> guest may have built-in support for something like PCI protocol.

[1]

>> However, PCI protocol itself isn't suitable for high-performance PV out
>> of the can.  So you will therefore invariably require new software
>> layers on top anyway, even if part of the support is already included.
>>    
> 
> Of course there is such a thing as native, a pci-ready guest has tons of
> support built into it

I specifically mentioned that already ([1]).

You are also overstating its role, since the basic OS is what implements
the native support for bus-objects, hotswap, etc, _not_ PCI.  PCI just
rides underneath and feeds trivial events up, as do other bus-types
(usb, scsi, vbus, etc).  And once those events are fed, you still need a
PV layer to actually handle the bus interface in a high-performance
manner so its not like you really have a "native" stack in either case.

> that doesn't need to be retrofitted.

No, that is incorrect.  You have to heavily modify the pci model with
layers on top to get any kind of performance out of it.  Otherwise, we
would just use realtek emulation, which is technically the native PCI
you are apparently so enamored with.

Not to mention there are things you just plain can't do in PCI today,
like dynamically assign signal-paths, priority, and coalescing, etc.

> Since
> practically everyone (including Xen) does their paravirt drivers atop
> pci, the claim that pci isn't suitable for high performance is incorrect.

Actually IIUC, I think Xen bridges to their own bus as well (and only
where they have to), just like vbus.  They don't use PCI natively.  PCI
is perfectly suited as a bridge transport for PV, as I think the Xen and
vbus examples have demonstrated.  Its the 1:1 device-model where PCI has
the most problems.

> 
> 
>> And lastly, why would you _need_ to use the so called "native"
>> mechanism?  The short answer is, "you don't".  Any given system (guest
>> or bare-metal) already have a wide-range of buses (try running "tree
>> /sys/bus" in Linux).  More importantly, the concept of adding new buses
>> is widely supported in both the Windows and Linux driver model (and
>> probably any other guest-type that matters).  Therefore, despite claims
>> to the contrary, its not hard or even unusual to add a new bus to the
>> mix.
>>    
> 
> The short answer is "compatibility".

There was a point in time where the same could be said for virtio-pci
based drivers vs realtek and e1000, so that argument is demonstrably
silly.  No one tried to make virtio work in a binary compatible way with
realtek emulation, yet we all survived the requirement for loading a
virtio driver to my knowledge.

The bottom line is: Binary device compatibility is not required in any
other system (as long as you follow sensible versioning/id rules), so
why is KVM considered special?

The fact is, it isn't special (at least not in this regard).  What _is_
required is "support" and we fully intend to support these proposed
components.  I assure you that at least the users that care about
maximum performance will not generally mind loading a driver.  Most of
them would have to anyway if they want to get beyond realtek emulation.

> 
> 
>> In summary, vbus is simply one more bus of many, purpose built to
>> support high-end IO in a virt-like model, giving controlled access to
>> the linux-host underneath it.  You can write a high-performance layer
>> below the OS bus-model (vbus), or above it (virtio-pci) but either way
>> you are modifying the stack to add these capabilities, so we might as
>> well try to get this right.
>>
>> With all due respect, you are making a big deal out of a minor issue.
>>    
> 
> It's not minor to me.

I am certainly in no position to tell you how to feel, but this
declaration would seem from my perspective to be more of a means to an
end than a legitimate concern.  Otherwise we would never have had virtio
support in the first place, since it was not "compatible" with previous
releases.

> 
>>>> And, as
>>>> eluded to in my diagram, both virtio-net and vhost (with some
>>>> modifications to fit into the vbus framework) are potentially
>>>> complementary, not competitors.
>>>>
>>>>        
>>> Only theoretically.  The existing installed base would have to be thrown
>>> away
>>>      
>> "Thrown away" is pure hyperbole.  The installed base, worse case, needs
>> to load a new driver for a missing device.
> 
> Yes, we all know how fun this is.

Making systems perform 5x faster _is_ fun, yes.  I love what I do for a
living.

>  Especially if the device changed is your boot disk.

If and when that becomes a priority concern, that would be a function
transparently supported in the BIOS shipped with the hypervisor, and
would thus be invisible to the user.

>  You may not care about the pain caused to users, but I do, so I will
> continue to insist on compatibility.

No, you are incorrect on two counts.

1) Of course I care about pain to users or I wouldn't be funded.  Right
now the pain from my perspective is caused to users in the
high-performance community who want to deploy KVM based solutions.  They
are unable to do so due to its performance disparity compared to
bare-metal, outside of pass-through hardware which is not widely
available in a lot of existing deployments.  I aim to fix that disparity
while reusing the existing hardware investment by writing smarter
software, and I assure you that these users won't mind loading a driver
in the guest to take advantage of it.

For the users that don't care about maximum performance, there is no
change (and thus zero pain) required.  They can use realtek or virtio if
they really want to.  Neither is going away to my knowledge, and lets
face it: 2.6Gb/s out of virtio to userspace isn't *that* bad.  But "good
enough" isn't good enough, and I won't rest till we get to native
performance.  Additionally, I want to support previously unavailable
modes of operations (e.g. real-time) and advanced fabrics (e.g. IB).

2) True pain to users is not caused by lack of binary compatibility.
Its caused by lack of support.  And its a good thing or we would all be
emulating 8086 architecture forever...

..oh wait, I guess we kind of do that already ;).  But at least we can
slip in something more advanced once in a while (APIC vs PIC, USB vs
uart, iso9660 vs floppy, for instance) and update the guest stack
instead of insisting it must look like ISA forever for compatibility's sake.

> 
>>> or we'd need to support both.
>>>
>>>
>>>      
>> No matter what model we talk about, there's always going to be a "both"
>> since the userspace virtio models are probably not going to go away (nor
>> should they).
>>    
> 
> virtio allows you to have userspace-only, kernel-only, or
> start-with-userspace-and-move-to-kernel-later, all transparent to the
> guest.  In many cases we'll stick with userspace-only.

The user will not care where the model lives, per se.  Only that it is
supported, and it works well.

Likewise, I know from experience that the developer will not like
writing the same code twice, so the "runs in both" model is not
necessarily a great design trait either.

> 
>>> All this is after kvm has decoded that vbus is addresses.  It can't work
>>> without someone outside vbus deciding that.
>>>      
>> How the connector message is delivered is really not relevant.  Some
>> architectures will simply deliver the message point-to-point (like the
>> original hypercall design for KVM, or something like Ira's rig), and
>> some will need additional demuxing (like pci-bridge/pio based KVM).
>> It's an implementation detail of the connector.
>>
>> However, the real point here is that something needs to establish a
>> scoped namespace mechanism, add items to that namespace, and advertise
>> the presence of the items to the guest.  vbus has this facility built in
>> to its stack.  vhost doesn't, so it must come from elsewhere.
>>    
> 
> So we have: vbus needs a connector, vhost needs a connector.  vbus
> doesn't need userspace to program the addresses (but does need userspace
> to instantiate the devices and to program the bus address decode)

First of all, bus-decode is substantially easier than per-device decode
(you have to track all those per-device/per-signal fds somewhere,
integrate with hotswap, etc), and its only done once per guest at
startup and left alone.  So its already not apples to apples.

Second, while its true that the general kvm-connector bus-decode needs
to be programmed,  that is a function of adapting to the environment
that _you_ created for me.  The original kvm-connector was discovered
via cpuid and hypercalls, and didn't need userspace at all to set it up.
 Therefore it would be entirely unfair of you to turn around and somehow
try to use that trait of the design against me since you yourself
imposed it.

As an additional data point, our other connectors have no such
bus-decode programming requirement.  Therefore, this is clearly
just a property of the KVM environment, not a function of the overall
vbus design.

> vhost needs userspace to instantiate the devices and program the addresses.
> 

Right.  And among other shortcomings it also requires a KVM-esque memory
model (which is not always going to work as we recently discussed), and
a redundant device-model to back it up in userspace, which is a
development and maintenance burden, and an external bus-model (filled by
pio-bus in KVM today).

>>>> In fact, it's actually a simpler design to unify things this way
>>>> because
>>>> you avoid splitting the device model up. Consider how painful the vhost
>>>> implementation would be if it didn't already have the userspace
>>>> virtio-net to fall-back on.  This is effectively what we face for new
>>>> devices going forward if that model is to persist.
>>>>
>>>>        
>>>
>>> It doesn't have just virtio-net, it has userspace-based hostplug
>>>      
>> vbus has hotplug too: mkdir and rmdir
>>    
> 
> Does that work from nonprivileged processes?

It will with the ioctl based control interface that I'll merge shortly.

>  Does it work on Windows?

This question doesn't make sense.  Hotswap control occurs on the host,
which is always Linux.

If you were asking about whether a windows guest will support hotswap:
the answer is "yes".  Our windows driver presents a unique PDO/FDO pair
for each logical device instance that is pushed out (just like the built
in usb, pci, scsi bus drivers that windows supports natively).

> 
>> As an added bonus, its device-model is modular.  A developer can write a
>> new device model, compile it, insmod it to the host kernel, hotplug it
>> to the running guest with mkdir/ln, and the come back out again
>> (hotunplug with rmdir, rmmod, etc).  They may do this all without taking
>> the guest down, and while eating QEMU based IO solutions for breakfast
>> performance wise.
>>
>> Afaict, qemu can't do either of those things.
>>    
> 
> We've seen that herring before,

Citation?

> and it's redder than ever.

This is more hyperbole.  I doubt that there would be many that would
argue that a modular architecture (that we get for free with LKM
support) is not desirable, even if its never used dynamically with a
running guest.  OTOH, I actually use this dynamic feature all the time
as I test my components, so its at least useful to me.

> 
> 
> 
>>> Refactor instead of duplicating.
>>>      
>> There is no duplicating.  vbus has no equivalent today as virtio doesn't
>> define these layers.
>>    
> 
> So define them if they're missing.

I just did.

> 
> 
>>>>
>>>>       
>>>>>    Use libraries (virtio-shmem.ko, libvhost.so).
>>>>>
>>>>>          
>>>> What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
>>>> dont need libvhost.so per se since you can just use standard kernel
>>>> interfaces (like configfs/sysfs).  I could create an .so going forward
>>>> for the new ioctl-based interface, I suppose.
>>>>
>>>>        
>>> Refactor instead of rewriting.
>>>      
>> There is no rewriting.  vbus has no equivalent today as virtio doesn't
>> define these layers.
>>
>> By your own admission, you said if you wanted that capability, use a
>> library.  What I think you are not understanding is vbus _is_ that
>> library.  So what is the problem, exactly?
>>    
> 
> It's not compatible.

No, that is incorrect.  What you are apparently not understanding is
that not only is vbus that library, but its extensible.  So even if
compatibility is your goal (it doesn't need to be IMO) it can be
accommodated by how you interface to the library.

>  If you were truly worried about code duplication
> in virtio, you'd refactor it to remove the duplication,

My primary objective is creating an extensible, high-performance,
shared-memory interconnect for systems that utilize a Linux host as
their IO-hub.  It just so happens that virtio can sit nicely on top of
such a model because shmem-rings are a subclass of shmem.  As a result
of its design, vbus also helps to reduce code duplication in the stack
for new environments due to its extensible nature.

However, vbus also has goals beyond what virtio is providing today that
are of more concern, and part of that is designing a connector/bus that
eliminates the shortcomings in the current pci-based design.

> without affecting existing guests.

Already covered above.

> 
>>>>> For kvm/x86 pci definitely remains king.
>>>>>
>>>>>          
>>>> For full virtualization, sure.  I agree.  However, we are talking about
>>>> PV here.  For PV, PCI is not a requirement and is a technical dead-end
>>>> IMO.
>>>>
>>>> KVM seems to be the only virt solution that thinks otherwise (*), but I
>>>> believe that is primarily a condition of its maturity.  I aim to help
>>>> advance things here.
>>>>
>>>> (*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
>>>> vmi-esq thing (I forget what its called) to name a few.  Love 'em or
>>>> hate 'em, most other hypervisors do something along these lines.  I'd
>>>> like to try to create one for KVM, but to unify them all (at least for
>>>> the Linux-based host designs).
>>>>
>>>>        
>>> VMware are throwing VMI away (won't be supported in their new product,
>>> and they've sent a patch to rip it off from Linux);
>>>      
>> vmware only cares about x86 iiuc, so probably not a good example.
>>    
> 
> Well, you brought it up.  Between you and me, I only care about x86 too.

Fair enough.

> 
>>> Xen has to tunnel
>>> xenbus in pci for full virtualization (which is where Windows is, and
>>> where Linux will be too once people realize it's faster).  lguest is
>>> meant as an example hypervisor, not an attempt to take over the world.
>>>      
>> So pick any other hypervisor, and the situation is often similar.
>>    
> 
> The situation is often pci.

Even if that were true, which is debatable, do not confuse "convenient"
with "optimal".  If you don't care about maximum performance and
advanced features like QOS, sure go ahead and use PCI.  Why not.

> 
>>
>>> An right now you can have a guest using pci to access a mix of
>>> userspace-emulated devices, userspace-emulated-but-kernel-accelerated
>>> virtio devices, and real host devices.  All on one dead-end bus.  Try
>>> that with vbus.
>>>      
>> vbus is not interested in userspace devices.  The charter is to provide
>> facilities for utilizing the host linux kernel's IO capabilities in the
>> most efficient, yet safe, manner possible.  Those devices that fit
>> outside that charter can ride on legacy mechanisms if that suits them
>> best.
>>    
> 
> vbus isn't, but I am.  I would prefer not to have to expose
> implementation decisions (kernel vs userspace) to the guest (vbus vs pci).
> 
>>>> That won't cut it.  For one, creating an eventfd is only part of the
>>>> equation.  I.e. you need to have originate/terminate somewhere
>>>> interesting (and in-kernel, otherwise use tuntap).
>>>>
>>>>        
>>> vbus needs the same thing so it cancels out.
>>>      
>> No, it does not.  vbus just needs a relatively simple single message
>> pipe between the guest and host (think "hypercall tunnel", if you will).
>>    
> 
> That's ioeventfd.  So far so similar.

No, that is incorrect.  For one, vhost uses them on a per-signal path
basis, whereas vbus only has one channel for the entire guest->host.

Second, I do not use ioeventfd anymore because it has too many problems
with the surrounding technology.  However, that is a topic for a
different thread.


> 
>>   Per queue/device addressing is handled by the same conceptual namespace
>> as the one that would trigger eventfds in the model you mention.  And
>> that namespace is built in to the vbus stack, and objects are registered
>> automatically as they are created.
>>
>> Contrast that to vhost, which requires some other kernel interface to
>> exist, and to be managed manually for each object that is created.  Your
>> libvhostconfig would need to somehow know how to perform this
>> registration operation, and there would have to be something in the
>> kernel to receive it, presumably on a per platform basis.  Solving this
>> problem generally would probably end up looking eerily like vbus,
>> because thats what vbus does.
>>    
> 
> vbus devices aren't magically instantiated.  Userspace needs to
> instantiate them too.  Sure, there's less work on the host side since
> you're using vbus instead of the native interface, but more work on the
> guest side since you're using vbus instead of the native interface.


No, that is incorrect.  The amount of "work" that a guest does is
actually the same in both cases, since the guest OS peforms the hotswap
handling natively for all bus types (at least for Linux and Windows).
You still need to have a PV layer to interface with those objects in
both cases, as well, so there is no such thing as "native interface" for
PV.  Its only a matter of where it occurs in the stack.

> 
> 
> 
>>> Well, let's see.  Can vbus today:
>>>
>>> - let userspace know which features are available (so it can decide if
>>> live migration is possible)
>>>      
>> yes, its in sysfs.
>>
>>   
>>> - let userspace limit which features are exposed to the guest (so it can
>>> make live migration possible among hosts of different capabilities)
>>>      
>> yes, its in sysfs.
>>    
> 
> Per-device?

Yes, see /sys/vbus/devices/$dev/ to get per-instance attributes

>  non-privileged-user capable?

The short answer is "not yet (I think)".  I need to write a patch to
properly set the mode attribute in sysfs, but I think this will be trivial.

> 
>>> - let userspace know which features were negotiated (so it can transfer
>>> them to the other host during live migration)
>>>      
>> no, but we can easily add ->save()/->restore() to the model going
>> forward, and the negotiated features are just a subcomponent if its
>> serialized stream.
>>
>>   
>>> - let userspace tell the kernel which features were negotiated (when
>>> live migration completes, to avoid requiring the guest to re-negotiate)
>>>      
>> that would be the function of the ->restore() deserializer.
>>
>>   
>>> - do all that from an unprivileged process
>>>      
>> yes, in the upcoming alacrityvm v0.3 with the ioctl based control plane.
>>    
> 
> Ah, so you have two control planes.

So what?  If anything, it goes to show how extensible the framework is
that a new plane could be added in 119 lines of code:

~/git/linux-2.6> stg show vbus-add-admin-ioctls.patch | diffstat
 Makefile       |    3 -
 config-ioctl.c |  117
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 1 deletion(-)

if and when having two control planes exceeds its utility, I will submit
a simple patch that removes the useless one.

> 
>> Bottom line: vbus isn't done, especially w.r.t. live-migration..but that
>> is not an valid argument against the idea if you believe in
>> release-early/release-often. kvm wasn't (isn't) done either when it was
>> proposed/merged.
>>
>>    
> 
> kvm didn't have an existing counterpart in Linux when it was
> proposed/merged.
> 

And likewise, neither does vbus.

Kind Regards,
-Greg










[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: N_PPP_SYNC ldisc BUG: sleeping function called from invalid context
From: Jarek Poplawski @ 2009-09-30 20:28 UTC (permalink / raw)
  To: Tilman Schmidt; +Cc: Alan Cox, linux-kernel, netdev, Alan Cox
In-Reply-To: <4AC3A986.4080808@imap.cc>

Tilman Schmidt wrote, On 09/30/2009 08:55 PM:

> Alan Cox schrieb:
>>>  [<c026d39b>] tty_unthrottle+0x10/0x38
>>>  [<f8dcc31f>] ppp_sync_receive+0x168/0x170 [ppp_synctty]
>>>  [<f8fbb9ce>] handle_minor_recv+0x187/0x1cd [capi]
>>>  [<f8fbc19b>] capi_recv_message+0x1d9/0x24e [capi]
>> Really need to see the rest of the call trace to be sure
> 
> There wasn't more than what I posted. I had six of them, they looked all
> identical, and all of them ended after the kernel_thread_helper line. 
...
> Why then does mutex_lock() complain?


Maybe it doesn't matter here, but this: 

> INFO: lockdep is turned off.
suggests there was some lockdep issue/warning earlier.


Jarek P.

^ permalink raw reply

* [PATCH] Regression: e100_phy_init() isolates even selected PHY, causes 10 seconds boot delay
From: Bernhard Kaindl @ 2009-09-30 20:33 UTC (permalink / raw)
  To: David S. Miller, Bruce Allan; +Cc: Jeff Kirsher, netdev

[-- Attachment #1: Type: text/plain, Size: 6815 bytes --]

Dear David, Dear Bruce,

The current e100.c:e100_phy_init() electrically isolates all
the PHYs (even the selected PHY -- for a short time!) from the MII.

This happens only for a short duration before the isolation
of the selected PHY is reverted, but it's enough to cause a
major disturbance in the startup of our e100-based cards:

On a number of Embedded/Industry Pentium boards which are in use,
the result is that the initial DHCP negotiation takes more
than 10 seconds to complete with 2.6.30 and .31, while it's
done in a fraction of a second with 2.6.29 and earlier
(kernels tested with no delay range from 2.6.23 to 2.6.29)

That regression was introduced on March 31 in the by a patch
from Bruce which first appeared in 2.6.30-rc3:

http://marc.info/?l=linux-netdev&m=123766715429780&w=2
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b55de80e49892002a1878013ab9aee1a30970be6

> From: Bruce Allan <bruce.w.allan@intel.com>
>
> This patch enables support for the new Intel 82552 adapter (new PHY paired
> with the existing MAC in the ICH7 chipset).  No new features are added to
> the driver, however there are minor changes due to updated registers and a
> few workarounds for hardware errata.

In the middle, the patch has two changes for e100_phy_init()

The first one looks like a code optimization and does not appear to matching
the criteria described by Bruce in his submission, so assume it made it into
the patch submission by accident:

> @@ -1276,16 +1294,12 @@ static int e100_phy_init(struct nic *nic)
> 	if (addr == 32)
> 		return -EAGAIN;
>
> -	/* Selected the phy and isolate the rest */
> -	for (addr = 0; addr < 32; addr++) {
> -		if (addr != nic->mii.phy_id) {
> -			mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
> -		} else {
> -			bmcr = mdio_read(netdev, addr, MII_BMCR);
> -			mdio_write(netdev, addr, MII_BMCR,
> -				bmcr & ~BMCR_ISOLATE);
> -		}
> -	}
> +	/* Isolate all the PHY ids */
> +	for (addr = 0; addr < 32; addr++)
> +		mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
> +	/* Select the discovered PHY */
> +	bmcr &= ~BMCR_ISOLATE;
> +	mdio_write(netdev, nic->mii.phy_id, MII_BMCR, bmcr);
>
> 	/* Get phy ID */
> 	id_lo = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID1);

If this was meant as a workaround for a hardware errata, it should
have been mentioned to ensure that no one undoes the change without
knowing.

Anyway:

What this does is that it removes the 2.6.23-2.6.29 PHY isolation loop
which ensured that *ONLY* PHY addresses which *do not* match the selected
PHY address are electrically isolated:

> -	/* Selected the phy and isolate the rest */
> -	for (addr = 0; addr < 32; addr++) {
> -             if (addr != nic->mii.phy_id) {
> -                     mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);

The 2.6.29 loop cleared the isolate bit of the discovered PHY in the
else clause of this if statement:

> -		} else {
> -			bmcr = mdio_read(netdev, addr, MII_BMCR);
> -			mdio_write(netdev, addr, MII_BMCR,
> -				bmcr & ~BMCR_ISOLATE);
> -		}

This loop is then replaced with electrical isolation of *_ALL_* PHYs:

> +	/* Isolate all the PHY ids */
> +	for (addr = 0; addr < 32; addr++)
> +		mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);

Which is reverted for the discovered PHY afterwards:

> +	/* Select the discovered PHY */
> +	bmcr &= ~BMCR_ISOLATE;
> +	mdio_write(netdev, nic->mii.phy_id, MII_BMCR, bmcr);

This change resulted in a delay of the "Link Up" message from the
e100 watchdog routine and in a number of DHCP packages getting lost
for a duration of about five seconds.

I suppose that this may have powered-down our PHYs for a short moment
or at least disturbed the connection which it has with the MII and/or
the outside world.

In any case:

Reverting solely this change alone fixed the 10 second boot delay!

For more information, I attached two driver debug logs of the commands
between firmware load and "Link Up" with and without isolation of the
selected PHY.

----------------------------------------------------------------------

But indeed, there is potential for a possibly valid optimization:

The old e100_phy_init() was reading the "bmcr" variable twice.
First, when probing the PHYs are probed:

/* Discover phy addr by searching addrs in order {1,0,2,..., 31} */
for (addr = 0; addr < 32; addr++) {
	nic->mii.phy_id = (addr == 0) ? 1 : (addr == 1) ? 0 : addr;
	bmcr = mdio_read(netdev, nic->mii.phy_id, MII_BMCR);
	stat = mdio_read(netdev, nic->mii.phy_id, MII_BMSR);
	stat = mdio_read(netdev, nic->mii.phy_id, MII_BMSR);
	if (!((bmcr == 0xFFFF) || ((stat == 0) && (bmcr == 0))))
		break;
-> When the selected PHY is found, the loop aborts here, so bmcr
     has the bmcr of the selected PHY
}

and second, when the old PHY setup loop was clearing the isolate bit:

> -		} else {
> -			bmcr = mdio_read(netdev, addr, MII_BMCR);
> -			mdio_write(netdev, addr, MII_BMCR,
> -				bmcr & ~BMCR_ISOLATE);
> -		}

Reading the MII_BMCR value twice was therefore not strictly necessary,
so I think this was the optimization which Bruce had in mind, and his
other motivation may have been to simplify the PHY setup loop.

So, what should be done is to only isolate the other PHYs and just clear
the isolate bit of the selected PHY, in the simplest possible way.

The patch which we are now using with 2.6.31 to fix this regression
does the following:

* Remove the loop which isolates *ALL* PHYs and then clears of
   the isolate bit of the selected PHY.

* Keep the new code which just clears the isolate bit of the BMCR
   of the discovered PHY

* Afterwards, isolate the unused PHYs.

This essentially resembles what the 2.6.23-2.6.29 code did, only
clearing the isolate bit of the discovered PHY is explicitly done
first, while before, it was in the middle of the PHY isolation loop.

A plain revert is probably safest in case of real paranoia here.
Electrically, this should be the same what we had until 2.6.30-rc3:

Tested by me and my colleagues:

--- linux-2.6.31/drivers/net/e100.c
+++ linux-2.6.31/drivers/net/e100.c
@@ -1294,13 +1294,15 @@ static int e100_phy_init(struct nic *nic
   	if (addr == 32)
   		return -EAGAIN;

-	/* Isolate all the PHY ids */
-	for (addr = 0; addr < 32; addr++)
-		mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
   	/* Select the discovered PHY */
   	bmcr &= ~BMCR_ISOLATE;
   	mdio_write(netdev, nic->mii.phy_id, MII_BMCR, bmcr);

+	/* Electrically isolate only the unused PHYs */
+	for (addr = 0; addr < 32; addr++)
+		if (addr != nic->mii.phy_id)
+			mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
+
   	/* Get phy ID */
   	id_lo = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID1);
   	id_hi = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID2);

Signed-off-by: Bernhard Kaindl <bernhard.kaindl@gmx.net>

Best Regards,
Bernhard Kaindl

[-- Attachment #2: e100_phy_init_without_isolation.txt --]
[-- Type: text/plain, Size: 1135 bytes --]

Without the optimzation which isolates the PHY, the link goes up within a 1/10th of
a second after starting to load the firmware:

 7.588075 e100 0000:00:09.0 firmware: using built-in firmware e100/d101s_ucode.bin
 7.604133 e100_configure: [00-07]=16:08:00:01:00:00:26:07
 7.608117 e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
 7.608117 e100_configure: [16-23]=00:40:FA:86:3F:05:00:00
 7.610785 e100_set_multicast_list: mc_count=0, flags=0x1002
 7.615390 e100_configure: [00-07]=16:08:00:01:00:00:26:07
 7.619380 e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
 7.619380 e100_configure: [16-23]=00:40:FA:86:3F:05:00:00
 7.619380 e100_intr: stat_ack = 0x20
 7.622719 e100_watchdog: right now = -74502
 7.627045 mdio_ctrl: READ:addr=1, reg=4, data_in=0x0000, data_out=0x182405E1
 7.630994 mdio_ctrl: READ:addr=1, reg=0, data_in=0x0000, data_out=0x18203000
 7.634011 mdio_ctrl: READ:addr=1, reg=5, data_in=0x0000, data_out=0x182541E1
 7.639498 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x1821782D
 7.643447 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x1821782D
 7.646419 NIC Link is Up 100 Mbps Full Duplex


[-- Attachment #3: e100_phy_init_with_isolation.txt --]
[-- Type: text/plain, Size: 3422 bytes --]

With the isolation of the PHY, 'link up' takes more than 2 seconds:

 8.241194 e100 0000:00:09.0 firmware: using built-in firmware e100/d101s_ucode.bin
 8.259478 e100_configure: [00-07]=16:08:00:01:00:00:26:07
 8.263462 e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
 8.263462 e100_configure: [16-23]=00:40:FA:86:3F:05:00:00
 8.266134 e100_set_multicast_list: mc_count=0, flags=0x1002
 8.270737 e100_configure: [00-07]=16:08:00:01:00:00:26:07
 8.274727 e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
 8.274727 e100_configure: [16-23]=00:40:FA:86:3F:05:00:00
 8.274727 e100_intr: stat_ack = 0x20
 8.278066 e100_watchdog: right now = -74619
 8.282389 mdio_ctrl: READ:addr=1, reg=4, data_in=0x0000, data_out=0x182405E1
 8.286339 mdio_ctrl: READ:addr=1, reg=0, data_in=0x0000, data_out=0x18203000
 8.289356 mdio_ctrl: READ:addr=1, reg=5, data_in=0x0000, data_out=0x182501E1
 8.294842 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x18217809
 8.298791 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x18217809
 8.301810 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x18217809
 8.307293 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x18217809
 8.311244 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x18217809
 8.314260 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x18217809
 8.318211 e100_rx_indicate: status=0x0000
 8.322954 e100_tx_clean: cb[0]->status = 0xA000
 8.326946 e100_tx_clean: cb[1]->status = 0xA000
 8.329309 e100_tx_clean: cb[2]->status = 0xA000
 8.334034 e100_tx_clean: cb[3]->status = 0xA000
 8.338028 e100_tx_clean: cb[4]->status = 0xA000
...
 8.355248 e100_configure: [00-07]=16:08:00:01:00:00:26:07
 8.359238 e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
 8.359238 e100_configure: [16-23]=00:40:FA:86:3F:05:00:00
 8.359238 e100_intr: stat_ack = 0x20
 8.362559 e100_rx_indicate: status=0x0000
 8.366550 e100_tx_clean: cb[5]->status = 0xA000
 8.369792 e100_tx_clean: cb[6]->status = 0xA000
 8.374548 e100_set_multicast_list: mc_count=1, flags=0x1003
 8.379152 e100_configure: [00-07]=16:08:00:01:00:00:26:07
 8.383143 e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
 8.383143 e100_configure: [16-23]=00:40:FA:86:3F:05:00:00
 8.383143 e100_intr: stat_ack = 0x20
 8.386435 e100_rx_indicate: status=0x0000
 8.390425 e100_tx_clean: cb[7]->status = 0xA000
 8.393667 e100_tx_clean: cb[8]->status = 0xA000
 8.398399 e100_intr: stat_ack = 0x20
 8.403041 e100_rx_indicate: status=0x0000
 8.407033 e100_set_multicast_list: mc_count=1, flags=0x1003
 8.410971 e100_configure: [00-07]=16:08:00:01:00:00:26:07
 8.414961 e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
 8.414961 e100_configure: [16-23]=00:40:FA:86:3F:05:00:00
 8.414961 e100_intr: stat_ack = 0x20
 8.418260 e100_rx_indicate: status=0x0000
 8.422249 e100_tx_clean: cb[9]->status = 0xA000
 8.425492 e100_tx_clean: cb[10]->status = 0xA000
 8.430746 e100_intr: stat_ack = 0x20
 8.435388 e100_rx_indicate: status=0x0000
10.567601 e100_watchdog: right now = -74046
10.571931 mdio_ctrl: READ:addr=1, reg=4, data_in=0x0000, data_out=0x182405E1
10.577416 mdio_ctrl: READ:addr=1, reg=0, data_in=0x0000, data_out=0x18203000
10.582911 mdio_ctrl: READ:addr=1, reg=5, data_in=0x0000, data_out=0x182541E1
10.586862 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x18217829
10.589877 mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x1821782D
10.595317 NIC Link is Up 100 Mbps Full Duplex


^ permalink raw reply

* kernel doc / docbook pdfdocs question
From: Doug Maxey @ 2009-09-30 19:59 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: netdev


Randy,

This may be slightly off topic for this list, but it does involve an
(as yet un-released) network driver. :)

Do you have any insight that could guide me toward a fix for an issue
seen with some header file constructs when trying to generate a pdf
docbook?

In my .tmpl file I do the "process my header file" construct:
...
!Ipath/to/myheader.h
...

In myheader.h an example decl that is giving me fits looks like:

struct foo {
	int bar;
	DECLARE_BITMAP(baz, LENGTH);
	int fotz;
};

It puts the bar and fotz decls in the output, but only outputs a
warning to stderr on the line with DECLARE_BITMAP, nothing about baz
(or DECLARE_BITMAP) is output into the .xml file.

My question is, would it be easier to pre-process the header, say with
gcc -EC via some rule in Documentation/DocBook/Makefile, or to try and
fix whatever gets called by scripts/basic/docbook?

I tried to find what the exact calling sequence was, but am getting
lost in what actually does the xml generation, and therefore how to
fix what is choking on the DECLARE_BITMAP.

Any tips will be very appreciated!

++doug



^ permalink raw reply

* Re: N_PPP_SYNC ldisc BUG: sleeping function called from invalid context
From: Jarek Poplawski @ 2009-09-30 21:12 UTC (permalink / raw)
  To: Tilman Schmidt; +Cc: Alan Cox, linux-kernel, netdev, Alan Cox
In-Reply-To: <4AC3A986.4080808@imap.cc>

Tilman Schmidt wrote, On 09/30/2009 08:55 PM:

> Alan Cox schrieb:
>>>  [<c026d39b>] tty_unthrottle+0x10/0x38
>>>  [<f8dcc31f>] ppp_sync_receive+0x168/0x170 [ppp_synctty]
>>>  [<f8fbb9ce>] handle_minor_recv+0x187/0x1cd [capi]
>>>  [<f8fbc19b>] capi_recv_message+0x1d9/0x24e [capi]
>> Really need to see the rest of the call trace to be sure
> 
> There wasn't more than what I posted. I had six of them, they looked all
> identical, and all of them ended after the kernel_thread_helper line. 
> 
>>> Turns out the ppp_sync_receive() function (drivers/net/ppp_synctty.c
>>> line 385ff.) has a comment in front stating:
>>>
>>> /*
>>>  * This can now be called from hard interrupt level as well
>>>  * as soft interrupt level or mainline.
>>>  */
>> Which is wrong. The flip_buffer_push -> rx processing path should never
>> be called from IRQ context and that was fixed for various drivers that
>> mis-set tty->low_latency, as well as in the PPP rework. The PPP case is
>> actually unrelated in many was.
> 
> Might be worth correcting that text then before is misleads someone.
> 
>>> Opinions?
>> See how we got into that code direct from an IRQ path. The expectation of
>> the tty logic is that it gets processed from work queues either
>> specifically in driver or via tty_flip_buffer_push when tty->low_latency
>> = 0
> 
> I'm at a loss here. According to all the backtraces:
> 
> - ppp_sync_receive() was called, as the LD's receive_buf method,
>   via handle_recv_skb() [drivers/isdn/capi/capi.c line 504, inlined]
>   from handle_minor_recv() [drivers/isdn/capi/capi.c line 519]
> 
> - handle_minor_recv() was called from capi_recv_message()
>   [drivers/isdn/capi/capi.c line 656]
> 
> - capi_recv_message() was called, as the CAPI application's
>   recv_message method, from recv_handler()
>   [drivers/isdn/capi/kcapi.c line 268]
> 
> - recv_handler() is never called directly. It's only scheduled
>   via the work queue ap->recv_work from capi_ctr_handle_message()
>   [drivers/isdn/capi/kcapi.c line 349]
> 
> Even if we don't trust the backtraces, there's not much room for
> another activation path. So for all I know, the expectation of the
> tty logic should have been met. The call was indeed processed from
> a work queue.
> 
> Why then does mutex_lock() complain?
 

Hmm... capi_recv_message() calls handle_minor_recv() under
spin_lock_irqsave(), doesn't it?

Jarek P.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox