Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 0/8] SECURITY ISSUE with connector
From: Greg KH @ 2009-10-02 13:58 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, netdev, Andrew Morton, David S. Miller, dm-devel,
	Evgeniy Polyakov, linux-fbdev-devel
In-Reply-To: <1254487211-11810-1-git-send-email-philipp.reisner@linbit.com>

On Fri, Oct 02, 2009 at 02:40:03PM +0200, Philipp Reisner wrote:
> Affected: All code that uses connector, in kernel and out of mainline
> 
> The connector, as it is today, does not allow the in kernel receiving
> parts to do any checks on privileges of a message's sender.

So, assume I know nothing about the connector architecture, what does
this mean in a security context?

> I know, there are not many out there that like connector, but as
> long as it is in the kernel, we have to fix the security issues it has!

And what specifically are the security issues?

> Please either drop connector, or someone who feels a bit responsible
> and has our beloved dictator's blessing, PLEASE PLEASE PLEASE take 
> this into your tree, and send the pull request to Linus.
> 
> Patches 1 to 4 are already Acked-by Evgeny, the connector's maintainer.
> Patches 5 to 7 are the obvious fixes to the connector user's code.

Obvious in what way?

thanks,

greg k-h

^ permalink raw reply

* [QUESTION] Packet Reordering detection and response with TCP in Reno-mode
From: Daniel Slot @ 2009-10-02 14:09 UTC (permalink / raw)
  To: netdev

I have some problems understanding Linux TCP's reordering detection
and response algorithms.
When the SACK option is used, the threshold adaption is understandable.
But in Reno-mode (without SACKs), reordering detection and response
are imho not clear.

Reordering detection:
How is it possible to determine the number of holes without SACK?
Simple DUPACKs do not provide enough information for such an estimation.

kernel 2.6.30.4 - net/ipv4/tcp_input.c -line 1934
static int tcp_limit_reno_sacked(struct tcp_sock *tp)
{
        u32 holes;

        holes = max(tp->lost_out, 1U);
        holes = min(holes, tp->packets_out);

        if ((tp->sacked_out + holes) > tp->packets_out) {
                tp->sacked_out = tp->packets_out - holes;
                return 1;
        }
        return 0;
}

Reordering response:
Reordering detection in Reno-mode is only possible in the disorder phase.
When packet reordering has been detected in Reno-mode,
linux's dupthresh (tp->reordering) is set to the number of packets in
flight (plus something else).
The question is, why choosing the number of packets in flight as new dupthresh?
And more important, why adapting the dupthresh when its old value is
still sufficient?
Detecting reordering in the disorder phase means that nothing has been
retransmitted yet.

kernel 2.6.30.4 - net/ipv4/tcp_input.c -line 1952
static void tcp_check_reno_reordering(struct sock *sk, const int addend)
{
        struct tcp_sock *tp = tcp_sk(sk);
        if (tcp_limit_reno_sacked(tp))
                tcp_update_reordering(sk, tp->packets_out + addend, 0);
}


29/09/2009 Daniel Slot (slot.daniel(at)gmail.com)

^ permalink raw reply

* [RFC PATCH] net: add dataref destructor to sk_buff
From: Gregory Haskins @ 2009-10-02 14:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, ghaskins

(Applies to davem/net-2.6.git:4fdb78d30)

Hi David, netdevs,

The following is an RFC for an attempt at addressing a zero-copy solution.

To be perfectly honest, I have no idea if this is the best solution, or if
there is truly a problem with skb->destructor that requires an alternate
mechanism.  What I do know is that this patch seems to work, and I would
like to see some kind of solution available upstream.  So I thought I would
send my hack out as at least a point of discussion.  FWIW: This has been
tested heavily in my rig and is technically suitable for inclusion after
review as is, if that is decided to be the optimal path forward here.

Thanks for your review and consideration,

Kind regards,
-Greg

----------------------------------------
From: Gregory Haskins <ghaskins@novell.com>
Subject: [RFC PATCH] net: add dataref destructor to sk_buff

What: The skb->destructor field is reportedly unreliable for ensuring
that all shinfo users have dropped their references.  Therefore, we add
a distinct ->release() method for the shinfo structure which is closely
tied to the underlying page resources we want to protect.

Why: We want to add zero-copy transmit support for AlacrityVM guests.
In order to support this, the host kernel must map guest pages directly
into a paged-skb and send it as normal.  put_page() alone is not
sufficient lifetime management since the pages are ultimately allocated
from within the guest.  Therefore, we need higher-level notification
when the skb is finally freed on the host so we can then inject a proper
"tx-complete" event into the guest context.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/skbuff.h |    2 ++
 net/core/skbuff.c      |    9 +++++++++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df7b23a..02cdab6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -207,6 +207,8 @@ struct skb_shared_info {
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+	void *          priv;
+	void           (*release)(struct sk_buff *skb);
 };
 
 /* We divide dataref into two halves.  The higher 16 bits hold references
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 80a9616..a7e40a9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo->tx_flags.flags = 0;
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
+	shinfo->release = NULL;
+	shinfo->priv = NULL;
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		if (skb_shinfo(skb)->release)
+			skb_shinfo(skb)->release(skb);
+
 		kfree(skb->head);
 	}
 }
@@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 	shinfo->tx_flags.flags = 0;
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
+	shinfo->release = NULL;
+	shinfo->priv = NULL;
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->data = skb->head + NET_SKB_PAD;
@@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	skb->hdr_len  = 0;
 	skb->nohdr    = 0;
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
+	skb_shinfo(skb)->release = NULL;
+	skb_shinfo(skb)->priv = NULL;
 	return 0;
 
 nodata:


^ permalink raw reply related

* [PATCH] TCPCT-1: adding a sysctl
From: William Allen Simpson @ 2009-10-02 14:58 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20091001225705.788d38ba@nehalam>

[-- Attachment #1: Type: text/plain, Size: 1510 bytes --]

Stephen Hemminger wrote:
> BUT numbered sysctl values are deprecated and should no longer be added.
> The current way is to use CTL_UNNUMBERED instead, if you use CTL_UNNUMBERED
> then the table does not need to be changed.
> 
Thank you, that was immensely helpful.  I was using an old (related) example.

While I've long had credit in BSD-derived systems, this is the first I've
tried to implement for Linux kernel -- although I did give permission 15 or so
years ago for a fair amount of my stuff to be ported here under GPL....

This is a straightforward re-implementation of an earlier patch, that no
longer applies cleanly, that was reviewed:

   http://thread.gmane.org/gmane.linux.network/102586

With the original author's permission:

Adam Langley wrote:
# I'm afraid that my draft is now mostly dead!
#
# Please feel free to use any of the code that you found if it helps you
# and all the best with it,
#

The principle difference is using a TCP option to carry the cookie nonce,
instead of an offset to a random nonce in the data.  This allows several
related concepts to use the same extension option.  This cookie option has
been suggested for many years.

   http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

Also, as mentioned earlier, I added a sysctl to turn on and off the cookie
feature globally.  The cookies are useful even without SYN data.

Since I'm new around here, this first patch is just the ioctl and sysctl.

Any suggestions for improvement?  Or general approval?


[-- Attachment #2: tcpct-1.patch --]
[-- Type: text/plain, Size: 11082 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 61723a7..a8d8a88 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -96,6 +96,7 @@ enum {
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
 #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
+#define TCP_COOKIE_DATA		15	/* TCP Cookie Transactions extension */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -170,6 +171,33 @@ struct tcp_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];		/* key (binary) */
 };
 
+/* for TCP_COOKIE_DATA socket option */
+#define TCP_COOKIE_MAX		16		/* 128-bits */
+#define TCP_COOKIE_MIN		 8		/*  64-bits */
+#define TCP_COOKIE_PAIR_SIZE	(2*TCP_COOKIE_MAX)
+
+#define TCP_S_DATA_MAX		64U		/* after TCP+IP options */
+#define TCP_S_DATA_MSS_DEFAULT	536U		/* default MSS (RFC1122) */
+
+/* Flags for both getsockopt and setsockopt */
+#define TCP_COOKIE_IN_ALWAYS	(1 << 0)	/* Discard SYN without cookie */
+#define TCP_COOKIE_OUT_NEVER	(1 << 1)	/* Prohibit outgoing cookies.
+						   Supercedes the others. */
+
+/* Flags for getsockopt */
+#define TCP_S_DATA_IN		(1 << 2)	/* Was data received? */
+#define TCP_S_DATA_OUT		(1 << 3)	/* Was data sent? */
+
+/* TCP Cookie Transactions data */
+struct tcp_cookie_data {
+	__u16	tcpcd_flags;			/* see above */
+	__u8	__tcpcd_pad1;			/* zero */
+	__u8	tcpcd_cookie_desired;		/* bytes */
+	__u16	tcpcd_s_data_desired;		/* bytes of variable data */
+	__u16	tcpcd_used;			/* bytes in value */
+	__u8	tcpcd_value[TCP_S_DATA_MSS_DEFAULT];
+};
+
 #ifdef __KERNEL__
 
 #include <linux/skbuff.h>
@@ -217,9 +245,13 @@ struct tcp_options_received {
 		sack_ok : 4,	/* SACK seen on SYN packet		*/
 		snd_wscale : 4,	/* Window scaling received from sender	*/
 		rcv_wscale : 4;	/* Window scaling to send to receiver	*/
-/*	SACKs data	*/
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+	u16	extend_ok:1;	/* Cookie{less,pair} seen		*/
+	u8	*cookie_copy;
+	u8	cookie_size;	/* bytes in copy */
+#endif
 	u8	num_sacks;	/* Number of SACK blocks		*/
-	u16	user_mss;  	/* mss requested by user in ioctl */
+	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 };
 
@@ -229,14 +261,27 @@ struct tcp_options_received {
  * only four options will fit in a standard TCP header */
 #define TCP_NUM_SACKS 4
 
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+struct tcp_cookie_pair;
+struct tcp_s_data_payload;
+#endif
+
 struct tcp_request_sock {
 	struct inet_request_sock 	req;
 #ifdef CONFIG_TCP_MD5SIG
 	/* Only used by TCP MD5 Signature so far. */
 	const struct tcp_request_sock_ops *af_specific;
 #endif
-	u32			 	rcv_isn;
-	u32			 	snt_isn;
+	u32				rcv_isn;
+	u32				snt_isn;
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+	u8				*cookie_copy;
+	u8				cookie_size;	/* bytes in copy */
+	u8				s_data_in:1,
+					s_data_out:1,
+					cookie_in_always:1,
+					cookie_out_never:1;
+#endif
 };
 
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -406,6 +451,33 @@ struct tcp_sock {
 /* TCP MD5 Signature Option information */
 	struct tcp_md5sig_info	*md5sig_info;
 #endif
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+	/* If s_data_desired > 0 and s_data_payload is non-NULL, then this
+	 * object holds a reference to it (s_data_payload->kref)
+	 */
+	struct tcp_s_data_payload	*s_data_payload;
+
+	/* When the cookie options are generated and exchanged, then this
+	 * object holds a reference to them (cookie_pair->kref)
+	 */
+	struct tcp_cookie_pair	  	*cookie_pair;
+
+	/* If s_data_payload is non-NULL, then this holds a copy of
+	 * s_data_payload->tsdpl_size.  Otherwise, this holds the user
+	 * specified tcpcd_s_data_desired (variable data).
+	 */
+	u16				s_data_desired;	/* bytes */
+
+	/* Initially, this holds the user specified tcpcd_cookie_desired.
+	 * Zero indicates default (sysctl_tcp_cookie_size).  After the
+	 * option has been exchanged, this holds the actual size.
+	 */
+	u8				cookie_desired;	/* bytes */
+	u8				s_data_in:1,
+					s_data_out:1,
+					cookie_in_always:1,
+					cookie_out_never:1;
+#endif
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
@@ -424,6 +496,12 @@ struct tcp_timewait_sock {
 	u16			  tw_md5_keylen;
 	u8			  tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
 #endif
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+	/* Few sockets in timewait have cookies; in that case, then this
+	 * object holds a reference to it (tw_cookie_pair->kref)
+	 */
+	struct tcp_cookie_pair	  *tw_cookie_pair;
+#endif
 };
 
 static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
@@ -431,6 +509,6 @@ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
 	return (struct tcp_timewait_sock *)sk;
 }
 
-#endif
+#endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_TCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..6755ed8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -30,6 +30,7 @@
 #include <linux/dmaengine.h>
 #include <linux/crypto.h>
 #include <linux/cryptohash.h>
+#include <linux/kref.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -167,6 +168,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_SACK             5       /* SACK Block */
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
+#define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
 
 /*
  *     TCP option lengths
@@ -177,6 +179,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_SACK_PERM      2
 #define TCPOLEN_TIMESTAMP      10
 #define TCPOLEN_MD5SIG         18
+#define TCPOLEN_COOKIE_BASE    2	/* Cookie-less header extension */
+#define TCPOLEN_COOKIE_PAIR    3	/* Cookie pair header extension */
+#define TCPOLEN_COOKIE_MAX     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX)
+#define TCPOLEN_COOKIE_MIN     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -237,6 +243,7 @@ extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_max_ssthresh;
+extern int sysctl_tcp_cookie_size;
 
 extern atomic_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -345,7 +352,12 @@ extern void tcp_enter_quickack_mode(struct sock *sk);
 
 static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 {
- 	rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+	rx_opt->cookie_copy = NULL;
+	rx_opt->cookie_size = rx_opt->extend_ok =
+#endif
+	rx_opt->tstamp_ok = rx_opt->sack_ok =
+	rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
 }
 
 #define	TCP_ECN_OK		1
@@ -1480,6 +1492,46 @@ struct tcp_request_sock_ops {
 #endif
 };
 
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+/**
+ * This structure contains variable data that is to be included in the
+ * cookie option and compared with later incoming segments.
+ *
+ * A tcp_sock contains a pointer to the current value, and this is cloned to
+ * the tcp_timewait_sock.
+ */
+struct tcp_cookie_pair {
+	struct kref	kref;
+	/* 32-bit aligned for faster comparisons? */
+	u8		tcpcp_data[TCP_COOKIE_PAIR_SIZE];
+	u8		tcpcp_size;	/* of the cookie pair */
+};
+
+static inline void tcp_cookie_pair_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct tcp_cookie_pair, kref));
+}
+
+/**
+ * This structure contains constant data that is to be included in the
+ * payload of SYN or SYNACK segments when the cookie option is present.
+ *
+ * This structure is immutable (save for the reference counter) once created.
+ * A tcp_sock contains a pointer to the current value, and this is cloned to
+ * the request socks as they are generated.
+ */
+struct tcp_s_data_payload {
+	struct kref	kref;
+	u16		tsdpl_size;	/* of the trailing payload */
+	u8		tsdpl_data[0];	/* trailing payload */
+};
+
+static inline void tcp_s_data_payload_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct tcp_s_data_payload, kref));
+}
+#endif
+
 extern void tcp_v4_init(void);
 extern void tcp_init(void);
 
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 70491d9..1cf3be5 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -627,3 +627,36 @@ config TCP_MD5SIG
 
 	  If unsure, say N.
 
+config TCP_OPT_COOKIE_EXTENSION
+	bool "TCP: Cookie option extension (EXPERIMENTAL)"
+	default n
+	depends on EXPERIMENTAL
+	select CRYPTO
+	select CRYPTO_MD5
+	---help---
+	  TCP/IP networking is open to an attack known as "SYN flooding".
+	  This denial-of-service attack prevents legitimate remote users
+	  from being able to connect to the computer during an ongoing
+	  attack and requires very little work from the attacker, who can
+	  operate from anywhere on the Internet.
+
+	  TCP Cookie Transactions (TCPCT) deter spoofing of client
+	  connections and prevent server resource exhaustion, by
+	  eliminating the need to maintain server state during <SYN>
+	  establishment and after <FIN> and <RST> termination of
+	  connections.  The TCPCT cookie exchange itself may optionally
+	  carry <SYN> data, limited in size to inhibit Denial of Service
+	  (DoS) attacks. Implements TCP header extension, allowing
+	  64-bit timestamps and more Selective Acknowledgments.
+
+	  Unlike the passive "SYN cookies" option, other TCP options will
+	  continue to work.  If configured, SYN cookies continue to function
+	  for those parties that do not use this Cookie extension option.
+
+	  If you say Y here, note that TCPCT isn't yet enabled by default.
+
+	  The sysctl "tcp_cookie_size" should be in the range 8 to 16,
+	  although any non-zero value will be adjusted automatically.
+
+	  If unsure, say N.
+
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2dcf04d..25b60eb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -712,6 +712,16 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "tcp_cookie_size",
+		.data		= &sysctl_tcp_cookie_size,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif
 	{
 		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "udp_mem",
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5200aab..93af24c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -59,6 +59,14 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+#ifdef CONFIG_SYSCTL
+/* By default, let the user enable it. */
+int sysctl_tcp_cookie_size __read_mostly = 0;
+#else
+int sysctl_tcp_cookie_size __read_mostly = TCP_COOKIE_MAX;
+#endif
+
+
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
 {

^ permalink raw reply related

* Re: [PATCH] net: Fix wrong sizeof
From: Randy Dunlap @ 2009-10-02 15:14 UTC (permalink / raw)
  To: Jean Delvare; +Cc: LKML, netdev, linux-doc, stable
In-Reply-To: <20091002113038.1dc3d284@hyperion.delvare>

On Fri, 2 Oct 2009 11:30:38 +0200 Jean Delvare wrote:

> Which is why I have always preferred sizeof(struct foo) over
> sizeof(var).
> 
> Signed-off-by: Jean Delvare <khali@linux-fr.org>
> Cc: Randy Dunlap <rdunlap@xenotime.net>

Acked-by: Randy Dunlap <rdunlap@xenotime.net>

I also prefer to use sizeof(struct xyz) in my non-kernel code
instead of sizeof(var).

> ---
> Stable team, the non-documentation part of this fix applies to 2.6.31,
> 2.6.30 and 2.6.27.
> 
>  Documentation/networking/timestamping/timestamping.c |    2 +-
>  drivers/net/iseries_veth.c                           |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-2.6.32-rc1.orig/Documentation/networking/timestamping/timestamping.c	2009-06-10 05:05:27.000000000 +0200
> +++ linux-2.6.32-rc1/Documentation/networking/timestamping/timestamping.c	2009-10-02 11:07:19.000000000 +0200
> @@ -381,7 +381,7 @@ int main(int argc, char **argv)
>  	memset(&hwtstamp, 0, sizeof(hwtstamp));
>  	strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name));
>  	hwtstamp.ifr_data = (void *)&hwconfig;
> -	memset(&hwconfig, 0, sizeof(&hwconfig));
> +	memset(&hwconfig, 0, sizeof(hwconfig));
>  	hwconfig.tx_type =
>  		(so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ?
>  		HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
> --- linux-2.6.32-rc1.orig/drivers/net/iseries_veth.c	2009-09-28 10:28:42.000000000 +0200
> +++ linux-2.6.32-rc1/drivers/net/iseries_veth.c	2009-10-02 11:07:15.000000000 +0200
> @@ -495,7 +495,7 @@ static void veth_take_cap_ack(struct vet
>  			   cnx->remote_lp);
>  	} else {
>  		memcpy(&cnx->cap_ack_event, event,
> -		       sizeof(&cnx->cap_ack_event));
> +		       sizeof(cnx->cap_ack_event));
>  		cnx->state |= VETH_STATE_GOTCAPACK;
>  		veth_kick_statemachine(cnx);
>  	}
> 
> 
> -- 
> Jean Delvare
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


---
~Randy

^ permalink raw reply

* [PATCH 0/4] More device type integration
From: Marcel Holtmann @ 2009-10-02 15:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Johannes Berg, Greg KH

Hi Dave,

I followed the work from Johannes and made sure we can register the
device type for wireless devices via the netdev notifier callback for
all cfg80211 based devices. This way we don't have to touch any of
the drivers.

For the mobile broadband cards from Ericsson, the device type is now
set to "wwan" and it also uses "wwan%d" for the default interface name.

Regards

Marcel


Johannes Berg (1):
  net: introduce NETDEV_POST_INIT notifier

Marcel Holtmann (3):
  usbnet: Use wwan%d interface name for mobile broadband devices
  usbnet: Set device type for wlan and wwan devices
  cfg80211: assign device type in netdev notifier callback

 drivers/net/usb/cdc_ether.c |   20 ++++++++++++++------
 drivers/net/usb/usbnet.c    |   17 +++++++++++++++++
 include/linux/notifier.h    |    1 +
 include/linux/usb/usbnet.h  |    1 +
 net/core/dev.c              |    6 ++++++
 net/mac80211/iface.c        |    5 -----
 net/wireless/core.c         |    7 +++++++
 7 files changed, 46 insertions(+), 11 deletions(-)


^ permalink raw reply

* [PATCH 1/4] usbnet: Use wwan%d interface name for mobile broadband devices
From: Marcel Holtmann @ 2009-10-02 15:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Johannes Berg, Greg KH
In-Reply-To: <cover.1254495724.git.marcel@holtmann.org>

Add support for usbnet based devices like CDC-Ether to indicate that they
are actually mobile broadband devices. In that case use wwan%d as default
interface name.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
---
 drivers/net/usb/cdc_ether.c |   20 ++++++++++++++------
 drivers/net/usb/usbnet.c    |    3 +++
 include/linux/usb/usbnet.h  |    1 +
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 4a6aff5..71e65fc 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -420,6 +420,14 @@ static const struct driver_info	cdc_info = {
 	.status =	cdc_status,
 };
 
+static const struct driver_info mbm_info = {
+	.description =	"Mobile Broadband Network Device",
+	.flags =	FLAG_WWAN,
+	.bind = 	cdc_bind,
+	.unbind =	usbnet_cdc_unbind,
+	.status =	cdc_status,
+};
+
 /*-------------------------------------------------------------------------*/
 
 
@@ -532,32 +540,32 @@ static const struct usb_device_id	products [] = {
 	/* Ericsson F3507g */
 	USB_DEVICE_AND_INTERFACE_INFO(0x0bdb, 0x1900, USB_CLASS_COMM,
 			USB_CDC_SUBCLASS_MDLM, USB_CDC_PROTO_NONE),
-	.driver_info = (unsigned long) &cdc_info,
+	.driver_info = (unsigned long) &mbm_info,
 }, {
 	/* Ericsson F3507g ver. 2 */
 	USB_DEVICE_AND_INTERFACE_INFO(0x0bdb, 0x1902, USB_CLASS_COMM,
 			USB_CDC_SUBCLASS_MDLM, USB_CDC_PROTO_NONE),
-	.driver_info = (unsigned long) &cdc_info,
+	.driver_info = (unsigned long) &mbm_info,
 }, {
 	/* Ericsson F3607gw */
 	USB_DEVICE_AND_INTERFACE_INFO(0x0bdb, 0x1904, USB_CLASS_COMM,
 			USB_CDC_SUBCLASS_MDLM, USB_CDC_PROTO_NONE),
-	.driver_info = (unsigned long) &cdc_info,
+	.driver_info = (unsigned long) &mbm_info,
 }, {
 	/* Ericsson F3307 */
 	USB_DEVICE_AND_INTERFACE_INFO(0x0bdb, 0x1906, USB_CLASS_COMM,
 			USB_CDC_SUBCLASS_MDLM, USB_CDC_PROTO_NONE),
-	.driver_info = (unsigned long) &cdc_info,
+	.driver_info = (unsigned long) &mbm_info,
 }, {
 	/* Toshiba F3507g */
 	USB_DEVICE_AND_INTERFACE_INFO(0x0930, 0x130b, USB_CLASS_COMM,
 			USB_CDC_SUBCLASS_MDLM, USB_CDC_PROTO_NONE),
-	.driver_info = (unsigned long) &cdc_info,
+	.driver_info = (unsigned long) &mbm_info,
 }, {
 	/* Dell F3507g */
 	USB_DEVICE_AND_INTERFACE_INFO(0x413c, 0x8147, USB_CLASS_COMM,
 			USB_CDC_SUBCLASS_MDLM, USB_CDC_PROTO_NONE),
-	.driver_info = (unsigned long) &cdc_info,
+	.driver_info = (unsigned long) &mbm_info,
 },
 	{ },		// END
 };
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ca5ca5a..8124cf1 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1295,6 +1295,9 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
 		/* WLAN devices should always be named "wlan%d" */
 		if ((dev->driver_info->flags & FLAG_WLAN) != 0)
 			strcpy(net->name, "wlan%d");
+		/* WWAN devices should always be named "wwan%d" */
+		if ((dev->driver_info->flags & FLAG_WWAN) != 0)
+			strcpy(net->name, "wwan%d");
 
 		/* maybe the remote can't receive an Ethernet MTU */
 		if (net->mtu > (dev->hard_mtu - net->hard_header_len))
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index f814730..86c31b7 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -90,6 +90,7 @@ struct driver_info {
 #define FLAG_WLAN	0x0080		/* use "wlan%d" names */
 #define FLAG_AVOID_UNLINK_URBS 0x0100	/* don't unlink urbs at usbnet_stop() */
 #define FLAG_SEND_ZLP	0x0200		/* hw requires ZLPs are sent */
+#define FLAG_WWAN	0x0400		/* use "wwan%d" names */
 
 
 	/* init device ... can sleep, or cause probe() failure */
-- 
1.6.2.5


^ permalink raw reply related

* [PATCH 2/4] usbnet: Set device type for wlan and wwan devices
From: Marcel Holtmann @ 2009-10-02 15:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Johannes Berg, Greg KH
In-Reply-To: <cover.1254495724.git.marcel@holtmann.org>

For usbnet devices with FLAG_WLAN and FLAG_WWAN set the proper device
type so that uevent contains the correct value. This then allows an easy
identification of the actual underlying technology of the Ethernet device.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
---
 drivers/net/usb/usbnet.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 8124cf1..378da8c 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1210,6 +1210,14 @@ static const struct net_device_ops usbnet_netdev_ops = {
 
 // precondition: never called in_interrupt
 
+static struct device_type wlan_type = {
+	.name	= "wlan",
+};
+
+static struct device_type wwan_type = {
+	.name	= "wwan",
+};
+
 int
 usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
 {
@@ -1325,6 +1333,12 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
 	dev->maxpacket = usb_maxpacket (dev->udev, dev->out, 1);
 
 	SET_NETDEV_DEV(net, &udev->dev);
+
+	if ((dev->driver_info->flags & FLAG_WLAN) != 0)
+		SET_NETDEV_DEVTYPE(net, &wlan_type);
+	if ((dev->driver_info->flags & FLAG_WWAN) != 0)
+		SET_NETDEV_DEVTYPE(net, &wwan_type);
+
 	status = register_netdev (net);
 	if (status)
 		goto out3;
-- 
1.6.2.5


^ permalink raw reply related

* [PATCH 3/4] net: introduce NETDEV_POST_INIT notifier
From: Marcel Holtmann @ 2009-10-02 15:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Johannes Berg, Greg KH
In-Reply-To: <cover.1254495724.git.marcel@holtmann.org>

From: Johannes Berg <johannes@sipsolutions.net>

For various purposes including a wireless extensions
bugfix, we need to hook into the netdev creation before
before netdev_register_kobject(). This will also ease
doing the dev type assignment that Marcel was working
on for cfg80211 drivers w/o touching them all.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
---
 include/linux/notifier.h |    1 +
 net/core/dev.c           |    6 ++++++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 44428d2..29714b8 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -201,6 +201,7 @@ static inline int notifier_to_errno(int ret)
 #define NETDEV_PRE_UP		0x000D
 #define NETDEV_BONDING_OLDTYPE  0x000E
 #define NETDEV_BONDING_NEWTYPE  0x000F
+#define NETDEV_POST_INIT	0x0010
 
 #define SYS_DOWN	0x0001	/* Notify of system down */
 #define SYS_RESTART	SYS_DOWN
diff --git a/net/core/dev.c b/net/core/dev.c
index b8f74cf..a74c8fd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4836,6 +4836,12 @@ int register_netdevice(struct net_device *dev)
 		dev->features |= NETIF_F_GSO;
 
 	netdev_initialize_kobject(dev);
+
+	ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
+	ret = notifier_to_errno(ret);
+	if (ret)
+		goto err_uninit;
+
 	ret = netdev_register_kobject(dev);
 	if (ret)
 		goto err_uninit;
-- 
1.6.2.5


^ permalink raw reply related

* [PATCH 4/4] cfg80211: assign device type in netdev notifier callback
From: Marcel Holtmann @ 2009-10-02 15:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Johannes Berg, Greg KH
In-Reply-To: <cover.1254495724.git.marcel@holtmann.org>

Instead of having to modify every non-mac80211 for device type assignment,
do this inside the netdev notifier callback of cfg80211. So all drivers
that integrate with cfg80211 will export a proper device type.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
---
 net/mac80211/iface.c |    5 -----
 net/wireless/core.c  |    7 +++++++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index b8295cb..f6005ad 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -754,10 +754,6 @@ int ieee80211_if_change_type(struct ieee80211_sub_if_data *sdata,
 	return 0;
 }
 
-static struct device_type wiphy_type = {
-	.name	= "wlan",
-};
-
 int ieee80211_if_add(struct ieee80211_local *local, const char *name,
 		     struct net_device **new_dev, enum nl80211_iftype type,
 		     struct vif_params *params)
@@ -789,7 +785,6 @@ int ieee80211_if_add(struct ieee80211_local *local, const char *name,
 
 	memcpy(ndev->dev_addr, local->hw.wiphy->perm_addr, ETH_ALEN);
 	SET_NETDEV_DEV(ndev, wiphy_dev(local->hw.wiphy));
-	SET_NETDEV_DEVTYPE(ndev, &wiphy_type);
 
 	/* don't use IEEE80211_DEV_TO_SUB_IF because it checks too much */
 	sdata = netdev_priv(ndev);
diff --git a/net/wireless/core.c b/net/wireless/core.c
index 45b2be3..e6f02e9 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -625,6 +625,10 @@ static void wdev_cleanup_work(struct work_struct *work)
 	dev_put(wdev->netdev);
 }
 
+static struct device_type wiphy_type = {
+	.name	= "wlan",
+};
+
 static int cfg80211_netdev_notifier_call(struct notifier_block * nb,
 					 unsigned long state,
 					 void *ndev)
@@ -641,6 +645,9 @@ static int cfg80211_netdev_notifier_call(struct notifier_block * nb,
 	WARN_ON(wdev->iftype == NL80211_IFTYPE_UNSPECIFIED);
 
 	switch (state) {
+	case NETDEV_POST_INIT:
+		SET_NETDEV_DEVTYPE(dev, &wiphy_type);
+		break;
 	case NETDEV_REGISTER:
 		/*
 		 * NB: cannot take rdev->mtx here because this may be
-- 
1.6.2.5


^ permalink raw reply related

* [PATCH v3] net: Add vbus_enet driver
From: Gregory Haskins @ 2009-10-02 15:33 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, alacrityvm-devel
In-Reply-To: <20090804010915.17855.2660.stgit@dev.haskins.net>

A virtualized 802.x network device based on the VBUS interface. It can be
used with any hypervisor/kernel that supports the virtual-ethernet/vbus
protocol.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: David S. Miller <davem@davemloft.net>

[ added several new features since last review:
        pre-mapped-transmit descriptors,
        event-queue,
	link-state event
	tx-complete event
]

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 MAINTAINERS             |    7 
 drivers/net/Kconfig     |   14 +
 drivers/net/Makefile    |    1 
 drivers/net/vbus-enet.c | 1203 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild    |    1 
 include/linux/venet.h   |  123 +++++
 6 files changed, 1349 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 include/linux/venet.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b484756..ade37b5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5456,6 +5456,13 @@ S:	Maintained
 F:	include/linux/vbus*
 F:	drivers/vbus/*
 
+VBUS ETHERNET DRIVER
+M:	Gregory Haskins <ghaskins@novell.com>
+S:	Maintained
+W:	http://developer.novell.com/wiki/index.php/AlacrityVM
+F:	include/linux/venet.h
+F:	drivers/net/vbus-enet.c
+
 VFAT/FAT/MSDOS FILESYSTEM
 M:	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
 S:	Maintained
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5ce7cba..722f892 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3211,4 +3211,18 @@ config VIRTIO_NET
 	  This is the virtual network driver for virtio.  It can be used with
           lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config VBUS_ENET
+	tristate "VBUS Ethernet Driver"
+	default n
+	select VBUS_PROXY
+	help
+	   A virtualized 802.x network device based on the VBUS
+	   "virtual-ethernet" interface.  It can be used with any
+	   hypervisor/kernel that supports the vbus+venet protocol.
+
+config VBUS_ENET_DEBUG
+        bool "Enable Debugging"
+	depends on VBUS_ENET
+	default n
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..2a3c7a9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -277,6 +277,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_NETXEN_NIC) += netxen/
 obj-$(CONFIG_NIU) += niu.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
 obj-$(CONFIG_SFC) += sfc/
 
 obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..e8a0553
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,1203 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus_driver.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/venet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("virtual-ethernet");
+MODULE_VERSION("1");
+
+static int rx_ringlen = 256;
+module_param(rx_ringlen, int, 0444);
+static int tx_ringlen = 256;
+module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
+
+#define PDEBUG(_dev, fmt, args...) dev_dbg(&(_dev)->dev, fmt, ## args)
+
+struct vbus_enet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+	unsigned long            count;
+};
+
+struct vbus_enet_priv {
+	spinlock_t                 lock;
+	struct net_device         *dev;
+	struct vbus_device_proxy  *vdev;
+	struct napi_struct         napi;
+	struct vbus_enet_queue     rxq;
+	struct {
+		struct vbus_enet_queue veq;
+		struct tasklet_struct  task;
+		struct sk_buff_head    outstanding;
+	} tx;
+	bool                       sg;
+	struct {
+		bool               enabled;
+		char              *pool;
+	} pmtd; /* pre-mapped transmit descriptors */
+	struct {
+		bool                   enabled;
+		bool                   linkstate;
+		bool                   txc;
+		unsigned long          evsize;
+		struct vbus_enet_queue veq;
+		struct tasklet_struct  task;
+		char                  *pool;
+	} evq;
+};
+
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv);
+
+static struct vbus_enet_priv *
+napi_to_priv(struct napi_struct *napi)
+{
+	return container_of(napi, struct vbus_enet_priv, napi);
+}
+
+static int
+queue_init(struct vbus_enet_priv *priv,
+	   struct vbus_enet_queue *q,
+	   int qid,
+	   size_t ringsize,
+	   void (*func)(struct ioq_notifier *))
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+	int ret;
+
+	ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
+	if (ret < 0)
+		panic("ioq_alloc failed: %d\n", ret);
+
+	if (func) {
+		q->notifier.signal = func;
+		q->queue->notifier = &q->notifier;
+	}
+
+	q->count = ringsize;
+
+	return 0;
+}
+
+static int
+devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * ---------------
+ * rx descriptors
+ * ---------------
+ */
+
+static void
+rxdesc_alloc(struct net_device *dev, struct ioq_ring_desc *desc, size_t len)
+{
+	struct sk_buff *skb;
+
+	len += ETH_HLEN;
+
+	skb = netdev_alloc_skb(dev, len + 2);
+	BUG_ON(!skb);
+
+	skb_reserve(skb, NET_IP_ALIGN); /* align IP on 16B boundary */
+
+	desc->cookie = (u64)skb;
+	desc->ptr    = (u64)__pa(skb->data);
+	desc->len    = len; /* total length  */
+	desc->valid  = 1;
+}
+
+static void
+rx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0); /* will never fail unless seriously broken */
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item, since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SKB and mark it valid
+	 */
+	while (!iter.desc->valid) {
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-head index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+static void
+rx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->valid) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		dev_kfree_skb(skb);
+	}
+}
+
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq    = priv->tx.veq.queue;
+	size_t      iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+	size_t      len    = sizeof(struct venet_sg) + iovlen;
+	struct ioq_iterator iter;
+	int i;
+	int ret;
+
+	if (!priv->sg)
+		/*
+		 * There is nothing to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return 0;
+
+	/* pre-allocate our descriptor pool if pmtd is enabled */
+	if (priv->pmtd.enabled) {
+		struct vbus_device_proxy *dev = priv->vdev;
+		size_t poollen = len * priv->tx.veq.count;
+		char *pool;
+		int shmid;
+
+		/* pmtdquery will return the shm-id to use for the pool */
+		ret = devcall(priv, VENET_FUNC_PMTDQUERY, NULL, 0);
+		BUG_ON(ret < 0);
+
+		shmid = ret;
+
+		pool = kzalloc(poollen, GFP_KERNEL | GFP_DMA);
+		if (!pool)
+			return -ENOMEM;
+
+		priv->pmtd.pool = pool;
+
+		ret = dev->ops->shm(dev, shmid, 0, pool, poollen, 0, NULL, 0);
+		BUG_ON(ret < 0);
+	}
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SG descriptor
+	 */
+	for (i = 0; i < priv->tx.veq.count; i++) {
+		struct venet_sg *vsg;
+
+		if (priv->pmtd.enabled) {
+			size_t offset = (i * len);
+
+			vsg = (struct venet_sg *)&priv->pmtd.pool[offset];
+			iter.desc->ptr = (u64)offset;
+		} else {
+			vsg = kzalloc(len, GFP_KERNEL);
+			if (!vsg)
+				return -ENOMEM;
+
+			iter.desc->ptr = (u64)__pa(vsg);
+		}
+
+		iter.desc->cookie = (u64)vsg;
+		iter.desc->len    = len;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+static void
+tx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->tx.veq.queue;
+	struct ioq_iterator iter;
+	struct sk_buff *skb;
+	int ret;
+
+	/* forcefully free all outstanding transmissions */
+	while ((skb = __skb_dequeue(&priv->tx.outstanding)))
+		dev_kfree_skb(skb);
+
+	if (!priv->sg)
+		/*
+		 * There is nothing else to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return;
+
+	if (priv->pmtd.enabled) {
+		/*
+		 * PMTD mode means we only need to free the pool
+		 */
+		kfree(priv->pmtd.pool);
+		return;
+	}
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/* seek to position 0 */
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->cookie) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+
+		kfree(vsg);
+	}
+}
+
+static void
+evq_teardown(struct vbus_enet_priv *priv)
+{
+	if (!priv->evq.enabled)
+		return;
+
+	ioq_put(priv->evq.veq.queue);
+	kfree(priv->evq.pool);
+}
+
+/*
+ * Open and close
+ */
+
+static int
+vbus_enet_open(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
+	BUG_ON(ret < 0);
+
+	napi_enable(&priv->napi);
+
+	return 0;
+}
+
+static int
+vbus_enet_stop(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	napi_disable(&priv->napi);
+
+	ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+vbus_enet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		dev_warn(&dev->dev, "Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+static void
+vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (napi_schedule_prep(&priv->napi)) {
+		/* Disable further interrupts */
+		ioq_notify_disable(priv->rxq.queue, 0);
+		__napi_schedule(&priv->napi);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	dev->mtu = new_mtu;
+
+	/*
+	 * FLUSHRX will cause the device to flush any outstanding
+	 * RX buffers.  They will appear to come in as 0 length
+	 * packets which we can simply discard and replace with new_mtu
+	 * buffers for the future.
+	 */
+	ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
+	BUG_ON(ret < 0);
+
+	vbus_enet_schedule_rx(priv);
+
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+vbus_enet_poll(struct napi_struct *napi, int budget)
+{
+	struct vbus_enet_priv *priv = napi_to_priv(napi);
+	int npackets = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG(priv->dev, "polling...\n");
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while ((npackets < budget) && (!iter.desc->sown)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		if (iter.desc->len) {
+			skb_put(skb, iter.desc->len);
+
+			/* Maintain stats */
+			npackets++;
+			priv->dev->stats.rx_packets++;
+			priv->dev->stats.rx_bytes += iter.desc->len;
+
+			/* Pass the buffer up to the stack */
+			skb->dev      = priv->dev;
+			skb->protocol = eth_type_trans(skb, priv->dev);
+			netif_receive_skb(skb);
+
+			mb();
+		} else
+			/*
+			 * the device may send a zero-length packet when its
+			 * flushing references on the ring.  We can just drop
+			 * these on the floor
+			 */
+			dev_kfree_skb(skb);
+
+		/* Grab a new buffer to put in the ring */
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	PDEBUG(priv->dev, "%d packets received\n", npackets);
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		napi_complete(napi);
+		ioq_notify_enable(priv->rxq.queue, 0);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int
+vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator    iter;
+	int ret;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "sending %d bytes\n", skb->len);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (ioq_full(priv->tx.veq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the
+		 * queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		dev_err(&priv->dev->dev, "tx on full queue bug\n");
+		return 1;
+	}
+
+	/*
+	 * We want to iterate on the tail of both the "inuse" and "valid" index
+	 * so we specify the "both" index
+	 */
+	ret = ioq_iter_init(priv->tx.veq.queue, &iter, ioq_idxtype_both,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+	BUG_ON(iter.desc->sown);
+
+	if (priv->sg) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+		struct scatterlist sgl[MAX_SKB_FRAGS+1];
+		struct scatterlist *sg;
+		int count, maxcount = ARRAY_SIZE(sgl);
+
+		sg_init_table(sgl, maxcount);
+
+		memset(vsg, 0, sizeof(*vsg));
+
+		vsg->cookie = (u64)skb;
+		vsg->len    = skb->len;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			vsg->flags      |= VENET_SG_FLAG_NEEDS_CSUM;
+			vsg->csum.start  = skb->csum_start - skb_headroom(skb);
+			vsg->csum.offset = skb->csum_offset;
+		}
+
+		if (skb_is_gso(skb)) {
+			struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+			vsg->flags |= VENET_SG_FLAG_GSO;
+
+			vsg->gso.hdrlen = skb_headlen(skb);
+			vsg->gso.size = sinfo->gso_size;
+			if (sinfo->gso_type & SKB_GSO_TCPV4)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV4;
+			else if (sinfo->gso_type & SKB_GSO_TCPV6)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV6;
+			else if (sinfo->gso_type & SKB_GSO_UDP)
+				vsg->gso.type = VENET_GSO_TYPE_UDP;
+			else
+				panic("Virtual-Ethernet: unknown GSO type " \
+				      "0x%x\n", sinfo->gso_type);
+
+			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+				vsg->flags |= VENET_SG_FLAG_ECN;
+		}
+
+		count = skb_to_sgvec(skb, sgl, 0, skb->len);
+
+		BUG_ON(count > maxcount);
+
+		for (sg = &sgl[0]; sg; sg = sg_next(sg)) {
+			struct venet_iov *iov = &vsg->iov[vsg->count++];
+
+			iov->len = sg->length;
+			iov->ptr = (u64)sg_phys(sg);
+		}
+
+		iter.desc->len = (u64)VSG_DESC_SIZE(vsg->count);
+
+	} else {
+		/*
+		 * non scatter-gather mode: simply put the skb right onto the
+		 * ring.
+		 */
+		iter.desc->cookie = (u64)skb;
+		iter.desc->len = (u64)skb->len;
+		iter.desc->ptr = (u64)__pa(skb->data);
+	}
+
+	iter.desc->valid  = 1;
+
+	priv->dev->stats.tx_packets++;
+	priv->dev->stats.tx_bytes += skb->len;
+
+	__skb_queue_tail(&priv->tx.outstanding, skb);
+
+	/*
+	 * This advances both indexes together implicitly, and then
+	 * signals the south side to consume the packet
+	 */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	dev->trans_start = jiffies; /* save the timestamp */
+
+	if (ioq_full(priv->tx.veq.queue, ioq_idxtype_valid)) {
+		/*
+		 * If the queue is congested, we must flow-control the kernel
+		 */
+		PDEBUG(priv->dev, "backpressure tx queue\n");
+		netif_stop_queue(dev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/* assumes priv->lock held */
+static void
+vbus_enet_skb_complete(struct vbus_enet_priv *priv, struct sk_buff *skb)
+{
+	PDEBUG(priv->dev, "completed sending %d bytes\n",
+	       skb->len);
+
+	__skb_unlink(skb, &priv->tx.outstanding);
+	dev_kfree_skb(skb);
+}
+
+/*
+ * reclaim any outstanding completed tx packets
+ *
+ * assumes priv->lock held
+ */
+static void
+vbus_enet_tx_reap(struct vbus_enet_priv *priv)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the head of the valid index, but we
+	 * do not want the iter_pop (below) to flip the ownership, so
+	 * we set the NOFLIPOWNER option
+	 */
+	ret = ioq_iter_init(priv->tx.veq.queue, &iter, ioq_idxtype_valid,
+			    IOQ_ITER_NOFLIPOWNER);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We are done once we find the first packet either invalid or still
+	 * owned by the south-side
+	 */
+	while (iter.desc->valid && !iter.desc->sown) {
+
+		if (!priv->evq.txc) {
+			struct sk_buff *skb;
+
+			if (priv->sg) {
+				struct venet_sg *vsg;
+
+				vsg = (struct venet_sg *)iter.desc->cookie;
+				skb = (struct sk_buff *)vsg->cookie;
+			} else
+				skb = (struct sk_buff *)iter.desc->cookie;
+
+			/*
+			 * If TXC is not enabled, we are required to free
+			 * the buffer resources now
+			 */
+			vbus_enet_skb_complete(priv, skb);
+		}
+
+		/* Reset the descriptor */
+		iter.desc->valid  = 0;
+
+		/* Advance the valid-index head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->dev)
+	    && !ioq_full(priv->tx.veq.queue, ioq_idxtype_valid)) {
+		PDEBUG(priv->dev, "re-enabling tx queue\n");
+		netif_wake_queue(priv->dev);
+	}
+}
+
+static void
+vbus_enet_timeout(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	dev_dbg(&dev->dev, "Transmit timeout\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+rx_isr(struct ioq_notifier *notifier)
+{
+	struct vbus_enet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
+	dev = priv->dev;
+
+	if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
+		vbus_enet_schedule_rx(priv);
+}
+
+static void
+deferred_tx_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "deferred_tx_isr\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	ioq_notify_enable(priv->tx.veq.queue, 0);
+}
+
+static void
+tx_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+
+       priv = container_of(notifier, struct vbus_enet_priv, tx.veq.notifier);
+
+       PDEBUG(priv->dev, "tx_isr\n");
+
+       ioq_notify_disable(priv->tx.veq.queue, 0);
+       tasklet_schedule(&priv->tx.task);
+}
+
+static void
+evq_linkstate_event(struct vbus_enet_priv *priv,
+		    struct venet_event_header *header)
+{
+	struct venet_event_linkstate *event =
+		(struct venet_event_linkstate *)header;
+
+	switch (event->state) {
+	case 0:
+		netif_carrier_off(priv->dev);
+		break;
+	case 1:
+		netif_carrier_on(priv->dev);
+		break;
+	default:
+		break;
+	}
+}
+
+static void
+evq_txc_event(struct vbus_enet_priv *priv,
+	      struct venet_event_header *header)
+{
+	struct venet_event_txc *event =
+		(struct venet_event_txc *)header;
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	vbus_enet_tx_reap(priv);
+	vbus_enet_skb_complete(priv, (struct sk_buff *)event->cookie);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+deferred_evq_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	int nevents = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG(priv->dev, "evq: polling...\n");
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->evq.veq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while (!iter.desc->sown) {
+		struct venet_event_header *header;
+
+		header = (struct venet_event_header *)iter.desc->cookie;
+
+		switch (header->id) {
+		case VENET_EVENT_LINKSTATE:
+			evq_linkstate_event(priv, header);
+			break;
+		case VENET_EVENT_TXC:
+			evq_txc_event(priv, header);
+			break;
+		default:
+			panic("venet: unexpected event id:%d of size %d\n",
+			      header->id, header->size);
+			break;
+		}
+
+		memset((void *)iter.desc->cookie, 0, priv->evq.evsize);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		nevents++;
+	}
+
+	PDEBUG(priv->dev, "%d events received\n", nevents);
+
+	ioq_notify_enable(priv->evq.veq.queue, 0);
+}
+
+static void
+evq_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+
+       priv = container_of(notifier, struct vbus_enet_priv, evq.veq.notifier);
+
+       PDEBUG(priv->dev, "evq_isr\n");
+
+       ioq_notify_disable(priv->evq.veq.queue, 0);
+       tasklet_schedule(&priv->evq.task);
+}
+
+static int
+vbus_enet_sg_negcap(struct vbus_enet_priv *priv)
+{
+	struct net_device *dev = priv->dev;
+	struct venet_capabilities caps;
+	int ret;
+
+	memset(&caps, 0, sizeof(caps));
+
+	if (sg_enabled) {
+		caps.gid = VENET_CAP_GROUP_SG;
+		caps.bits |= (VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+			      |VENET_CAP_ECN|VENET_CAP_PMTD);
+		/* note: exclude UFO for now due to stack bug */
+	}
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits & VENET_CAP_SG) {
+		priv->sg = true;
+
+		dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM|NETIF_F_FRAGLIST;
+
+		if (caps.bits & VENET_CAP_TSO4)
+			dev->features |= NETIF_F_TSO;
+		if (caps.bits & VENET_CAP_UFO)
+			dev->features |= NETIF_F_UFO;
+		if (caps.bits & VENET_CAP_TSO6)
+			dev->features |= NETIF_F_TSO6;
+		if (caps.bits & VENET_CAP_ECN)
+			dev->features |= NETIF_F_TSO_ECN;
+
+		if (caps.bits & VENET_CAP_PMTD)
+			priv->pmtd.enabled = true;
+	}
+
+	return 0;
+}
+
+static int
+vbus_enet_evq_negcap(struct vbus_enet_priv *priv, unsigned long count)
+{
+	struct venet_capabilities caps;
+	int ret;
+
+	memset(&caps, 0, sizeof(caps));
+
+	caps.gid = VENET_CAP_GROUP_EVENTQ;
+	caps.bits |= VENET_CAP_EVQ_LINKSTATE;
+	caps.bits |= VENET_CAP_EVQ_TXC;
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits) {
+		struct vbus_device_proxy *dev = priv->vdev;
+		struct venet_eventq_query query;
+		size_t                    poollen;
+		struct ioq_iterator       iter;
+		char                     *pool;
+		int                       i;
+
+		priv->evq.enabled = true;
+
+		if (caps.bits & VENET_CAP_EVQ_LINKSTATE) {
+			/*
+			 * We will assume there is no carrier until we get
+			 * an event telling us otherwise
+			 */
+			netif_carrier_off(priv->dev);
+			priv->evq.linkstate = true;
+		}
+
+		if (caps.bits & VENET_CAP_EVQ_TXC)
+			priv->evq.txc = true;
+
+		memset(&query, 0, sizeof(query));
+
+		ret = devcall(priv, VENET_FUNC_EVQQUERY, &query, sizeof(query));
+		if (ret < 0)
+			return ret;
+
+		priv->evq.evsize = query.evsize;
+		poollen = query.evsize * count;
+
+		pool = kzalloc(poollen, GFP_KERNEL | GFP_DMA);
+		if (!pool)
+			return -ENOMEM;
+
+		priv->evq.pool = pool;
+
+		ret = dev->ops->shm(dev, query.dpid, 0,
+				    pool, poollen, 0, NULL, 0);
+		if (ret < 0)
+			return ret;
+
+		queue_init(priv, &priv->evq.veq, query.qid, count, evq_isr);
+
+		ret = ioq_iter_init(priv->evq.veq.queue,
+				    &iter, ioq_idxtype_valid, 0);
+		BUG_ON(ret < 0);
+
+		ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+		BUG_ON(ret < 0);
+
+		/* Now populate each descriptor with an empty event */
+		for (i = 0; i < count; i++) {
+			size_t offset = (i * query.evsize);
+			void *addr = &priv->evq.pool[offset];
+
+			iter.desc->ptr    = (u64)offset;
+			iter.desc->cookie = (u64)addr;
+			iter.desc->len    = query.evsize;
+
+			ret = ioq_iter_push(&iter, 0);
+			BUG_ON(ret < 0);
+		}
+
+		/* Finally, enable interrupts */
+		tasklet_init(&priv->evq.task, deferred_evq_isr,
+			     (unsigned long)priv);
+		ioq_notify_enable(priv->evq.veq.queue, 0);
+	}
+
+	return 0;
+}
+
+static int
+vbus_enet_negcap(struct vbus_enet_priv *priv)
+{
+	int ret;
+
+	ret = vbus_enet_sg_negcap(priv);
+	if (ret < 0)
+		return ret;
+
+	return vbus_enet_evq_negcap(priv, tx_ringlen);
+}
+
+static int vbus_enet_set_tx_csum(struct net_device *dev, u32 data)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+
+	if (data && !priv->sg)
+		return -ENOSYS;
+
+	return ethtool_op_set_tx_hw_csum(dev, data);
+}
+
+static struct ethtool_ops vbus_enet_ethtool_ops = {
+	.set_tx_csum = vbus_enet_set_tx_csum,
+	.set_sg      = ethtool_op_set_sg,
+	.set_tso     = ethtool_op_set_tso,
+	.get_link    = ethtool_op_get_link,
+};
+
+static const struct net_device_ops vbus_enet_netdev_ops = {
+	.ndo_open            = vbus_enet_open,
+	.ndo_stop            = vbus_enet_stop,
+	.ndo_set_config      = vbus_enet_config,
+	.ndo_start_xmit      = vbus_enet_tx_start,
+	.ndo_change_mtu	     = vbus_enet_change_mtu,
+	.ndo_tx_timeout      = vbus_enet_timeout,
+	.ndo_set_mac_address = eth_mac_addr,
+	.ndo_validate_addr   = eth_validate_addr,
+};
+
+/*
+ * This is called whenever a new vbus_device_proxy is added to the vbus
+ * with the matching VENET_ID
+ */
+static int
+vbus_enet_probe(struct vbus_device_proxy *vdev)
+{
+	struct net_device  *dev;
+	struct vbus_enet_priv *priv;
+	int ret;
+
+	printk(KERN_INFO "VENET: Found new device at %lld\n", vdev->id);
+
+	ret = vdev->ops->open(vdev, VENET_VERSION, 0);
+	if (ret < 0)
+		return ret;
+
+	dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+
+	spin_lock_init(&priv->lock);
+	priv->dev  = dev;
+	priv->vdev = vdev;
+
+	ret = vbus_enet_negcap(priv);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error negotiating capabilities for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	skb_queue_head_init(&priv->tx.outstanding);
+
+	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
+	queue_init(priv, &priv->tx.veq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
+
+	rx_setup(priv);
+	tx_setup(priv);
+
+	ioq_notify_enable(priv->rxq.queue, 0);  /* enable rx interrupts */
+
+	if (!priv->evq.txc) {
+		/*
+		 * If the TXC feature is present, we will recieve our
+		 * tx-complete notification via the event-channel.  Therefore,
+		 * we only enable txq interrupts if the TXC feature is not
+		 * present.
+		 */
+		tasklet_init(&priv->tx.task, deferred_tx_isr,
+			     (unsigned long)priv);
+		ioq_notify_enable(priv->tx.veq.queue, 0);
+	}
+
+	dev->netdev_ops     = &vbus_enet_netdev_ops;
+	dev->watchdog_timeo = 5 * HZ;
+	SET_ETHTOOL_OPS(dev, &vbus_enet_ethtool_ops);
+	SET_NETDEV_DEV(dev, &vdev->dev);
+
+	netif_napi_add(dev, &priv->napi, vbus_enet_poll, 128);
+
+	ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error obtaining MAC address for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	dev->features |= NETIF_F_HIGHDMA;
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		goto out_free;
+	}
+
+	vdev->priv = priv;
+
+	return 0;
+
+ out_free:
+	free_netdev(dev);
+
+	return ret;
+}
+
+static int
+vbus_enet_remove(struct vbus_device_proxy *vdev)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	unregister_netdev(priv->dev);
+	napi_disable(&priv->napi);
+
+	rx_teardown(priv);
+	ioq_put(priv->rxq.queue);
+
+	tx_teardown(priv);
+	ioq_put(priv->tx.veq.queue);
+
+	if (priv->evq.enabled)
+		evq_teardown(priv);
+
+	dev->ops->close(dev, 0);
+
+	free_netdev(priv->dev);
+
+	return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops vbus_enet_driver_ops = {
+	.probe  = vbus_enet_probe,
+	.remove = vbus_enet_remove,
+};
+
+static struct vbus_driver vbus_enet_driver = {
+	.type   = VENET_TYPE,
+	.owner  = THIS_MODULE,
+	.ops    = &vbus_enet_driver_ops,
+};
+
+static __init int
+vbus_enet_init_module(void)
+{
+	printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
+	printk(KERN_DEBUG "VENET: Using %d/%d queue depth\n",
+	       rx_ringlen, tx_ringlen);
+	return vbus_driver_register(&vbus_enet_driver);
+}
+
+static __exit void
+vbus_enet_cleanup(void)
+{
+	vbus_driver_unregister(&vbus_enet_driver);
+}
+
+module_init(vbus_enet_init_module);
+module_exit(vbus_enet_cleanup);
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index fa15bbf..911f7ef 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -359,6 +359,7 @@ unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
 unifdef-y += vbus_pci.h
+unifdef-y += venet.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..b6bfd91
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,123 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#include <linux/types.h>
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+	__u32 gid;
+	__u32 bits;
+};
+
+#define VENET_CAP_GROUP_SG     0
+#define VENET_CAP_GROUP_EVENTQ 1
+
+/* CAPABILITIES-GROUP SG */
+#define VENET_CAP_SG     (1 << 0)
+#define VENET_CAP_TSO4   (1 << 1)
+#define VENET_CAP_TSO6   (1 << 2)
+#define VENET_CAP_ECN    (1 << 3)
+#define VENET_CAP_UFO    (1 << 4)
+#define VENET_CAP_PMTD   (1 << 5) /* pre-mapped tx desc */
+
+/* CAPABILITIES-GROUP EVENTQ */
+#define VENET_CAP_EVQ_LINKSTATE  (1 << 0)
+#define VENET_CAP_EVQ_TXC        (1 << 1) /* tx-complete */
+
+struct venet_iov {
+	__u32 len;
+	__u64 ptr;
+};
+
+#define VENET_SG_FLAG_NEEDS_CSUM (1 << 0)
+#define VENET_SG_FLAG_GSO        (1 << 1)
+#define VENET_SG_FLAG_ECN        (1 << 2)
+
+struct venet_sg {
+	__u64            cookie;
+	__u32            flags;
+	__u32            len;     /* total length of all iovs */
+	struct {
+		__u16    start;	  /* csum starting position */
+		__u16    offset;  /* offset to place csum */
+	} csum;
+	struct {
+#define VENET_GSO_TYPE_TCPV4	0	/* IPv4 TCP (TSO) */
+#define VENET_GSO_TYPE_UDP	1	/* IPv4 UDP (UFO) */
+#define VENET_GSO_TYPE_TCPV6	2	/* IPv6 TCP */
+		__u8     type;
+		__u16    hdrlen;
+		__u16    size;
+	} gso;
+	__u32            count;   /* nr of iovs */
+	struct venet_iov iov[1];
+};
+
+struct venet_eventq_query {
+	__u32 flags;
+	__u32 evsize;  /* size of each event */
+	__u32 dpid;    /* descriptor pool-id */
+	__u32 qid;
+	__u8  pad[16];
+};
+
+#define VENET_EVENT_LINKSTATE 0
+#define VENET_EVENT_TXC       1
+
+struct venet_event_header {
+	__u32 flags;
+	__u32 size;
+	__u32 id;
+};
+
+struct venet_event_linkstate {
+	struct venet_event_header header;
+	__u8                      state; /* 0 = down, 1 = up */
+};
+
+struct venet_event_txc {
+	struct venet_event_header header;
+	__u32                     txqid;
+	__u64                     cookie;
+};
+
+#define VSG_DESC_SIZE(count) (sizeof(struct venet_sg) + \
+			      sizeof(struct venet_iov) * ((count) - 1))
+
+#define VENET_FUNC_LINKUP    0
+#define VENET_FUNC_LINKDOWN  1
+#define VENET_FUNC_MACQUERY  2
+#define VENET_FUNC_NEGCAP    3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX   4
+#define VENET_FUNC_PMTDQUERY 5
+#define VENET_FUNC_EVQQUERY  6
+
+#endif /* _LINUX_VENET_H */


^ permalink raw reply related

* Re: [PATCH 0/8] SECURITY ISSUE with connector
From: Philipp Reisner @ 2009-10-02 15:54 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-fbdev-devel, netdev, linux-kernel, dm-devel,
	Evgeniy Polyakov, Andrew Morton, David S. Miller
In-Reply-To: <20091002135859.GA9383@kroah.com>

> On Fri, Oct 02, 2009 at 02:40:03PM +0200, Philipp Reisner wrote:
> > Affected: All code that uses connector, in kernel and out of mainline
> >
> > The connector, as it is today, does not allow the in kernel receiving
> > parts to do any checks on privileges of a message's sender.
>
> So, assume I know nothing about the connector architecture, what does
> this mean in a security context?
>

Think of the connector as a layer on top of netlink that allows more
than a hard coded number of subsystems to use netlink.

Netlink is used e.g. to modify routing tables in the kernel.

As it is today, subsystem utilising the connector can not examine
the capabilities of the user/program that sent the netlink message.

If the same would be true for netlink, than every unprivileged user
could change the routing tables on your box.

> > I know, there are not many out there that like connector, but as
> > long as it is in the kernel, we have to fix the security issues it has!
>
> And what specifically are the security issues?
>

unprivileged users can trigger operations that are supposed to be only
accessible to users having CAP_SYS_ADMIN (or some other CAP_XXX)

> > Please either drop connector, or someone who feels a bit responsible
> > and has our beloved dictator's blessing, PLEASE PLEASE PLEASE take
> > this into your tree, and send the pull request to Linus.
> >
> > Patches 1 to 4 are already Acked-by Evgeny, the connector's maintainer.
> > Patches 5 to 7 are the obvious fixes to the connector user's code.
>
> Obvious in what way?
>

They limit processing of connector/netlink messages in these subsystems
to messages sent from root (or some user having CAP_SYS_ADMIN).

That is obvious for dst, because device setup and destruction is done by
connector messages.

This is obvious for pohmelfs becuase these connector messages are
used there to change some configuration.

This is obvious for uvesafb because the connector messages are used 
there to delegate some video bios emulation to userspace. 

Last not least dm's dirty logging in user space, should be immune to
some crafted netlink packets sent by some unprivileged user.

Patches 1 to 4 fix the framework, should be merged as soon as possible.

Patches 5 to 8 (not 7) should probably be blessed by the affected
subsystem's maintainers. I think I have put all on CC. 

HTH.

-phil

^ permalink raw reply

* Re: [BUG net-2.6] bluetooth/rfcomm : sleeping function called from invalid context at mm/slub.c:1719
From: Oliver Hartkopp @ 2009-10-02 16:04 UTC (permalink / raw)
  To: Dave Young
  Cc: Marcel Holtmann, Linux Netdev List,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <a8e1da0910020401m2fb8493ax95ff55a3b66131a5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Dave Young wrote:
> On Fri, Oct 2, 2009 at 2:28 PM, Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org> wrote:
>> Hello Marcel,
>>
>> with current net-2.6 tree ...
>>
>> While starting my PPP Bluetooth dialup networking, i got this:
> 
> Hi, oliver
> 
> please try following patch:
> http://patchwork.kernel.org/patch/51326/

Hi Dave,

that fixed it at ppp startup!

Tested-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

Btw. when shutting down the ppp connection i still get this:

[  361.996887] INFO: trying to register non-static key.
[  361.996897] the code is fine but needs lockdep annotation.
[  361.996902] turning off the locking correctness validator.
[  361.996912] Pid: 0, comm: swapper Not tainted 2.6.31-08939-gdb8abec-dirty #22
[  361.996919] Call Trace:
[  361.996933]  [<c12e4fb2>] ? printk+0xf/0x11
[  361.996947]  [<c1042214>] register_lock_class+0x5a/0x295
[  361.996957]  [<c1043af2>] __lock_acquire+0x9b/0xc03
[  361.996967]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
[  361.996985]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
[  361.996995]  [<c104491f>] ? lock_release_non_nested+0x17b/0x1db
[  361.997008]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
[  361.997018]  [<c10426fd>] ? trace_hardirqs_off+0xb/0xd
[  361.997028]  [<c10446b6>] lock_acquire+0x5c/0x73
[  361.997039]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
[  361.997049]  [<c12e6e23>] _spin_lock_irqsave+0x24/0x34
[  361.997058]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
[  361.997066]  [<c124cd14>] skb_dequeue+0x12/0x4c
[  361.997075]  [<c124d579>] skb_queue_purge+0x14/0x1b
[  361.997088]  [<fa59ce3f>] l2cap_recv_frame+0xe9e/0x129a [l2cap]
[  361.997099]  [<c10421d1>] ? register_lock_class+0x17/0x295
[  361.997110]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
[  361.997128]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
[  361.997139]  [<c120de74>] ? uhci_giveback_urb+0xf2/0x162
[  361.997163]  [<f8bb4c45>] ? hci_rx_task+0xfe/0x1f8 [bluetooth]
[  361.997177]  [<fa59d2e4>] l2cap_recv_acldata+0xa9/0x1be [l2cap]
[  361.997190]  [<fa59d23b>] ? l2cap_recv_acldata+0x0/0x1be [l2cap]
[  361.997208]  [<f8bb4c77>] hci_rx_task+0x130/0x1f8 [bluetooth]
[  361.997219]  [<c102a098>] tasklet_action+0x6b/0xb2
[  361.997228]  [<c102a46b>] __do_softirq+0x82/0x101
[  361.997237]  [<c102a515>] do_softirq+0x2b/0x43
[  361.997246]  [<c102a619>] irq_exit+0x35/0x68
[  361.997256]  [<c1004513>] do_IRQ+0x80/0x96
[  361.997265]  [<c10030ae>] common_interrupt+0x2e/0x34
[  361.997275]  [<c104007b>] ? tick_device_uses_broadcast+0x71/0x7c
[  361.997286]  [<c11747a8>] ? acpi_idle_enter_simple+0x103/0x12e
[  361.997296]  [<c1174515>] acpi_idle_enter_bm+0xc3/0x253
[  361.997306]  [<c1238b6f>] cpuidle_idle_call+0x60/0x91
[  361.997315]  [<c1001d44>] cpu_idle+0x49/0x65
[  361.997324]  [<c12e2f0e>] start_secondary+0x190/0x195


Thanks,
Oliver

^ permalink raw reply

* Re: [PATCH 0/8] SECURITY ISSUE with connector
From: Greg KH @ 2009-10-02 16:10 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-fbdev-devel, netdev, linux-kernel, dm-devel,
	Evgeniy Polyakov, Andrew Morton, David S. Miller
In-Reply-To: <200910021754.12940.philipp.reisner@linbit.com>

On Fri, Oct 02, 2009 at 05:54:12PM +0200, Philipp Reisner wrote:
> > On Fri, Oct 02, 2009 at 02:40:03PM +0200, Philipp Reisner wrote:
> > > Affected: All code that uses connector, in kernel and out of mainline
> > >
> > > The connector, as it is today, does not allow the in kernel receiving
> > > parts to do any checks on privileges of a message's sender.
> >
> > So, assume I know nothing about the connector architecture, what does
> > this mean in a security context?
> >
> 
> Think of the connector as a layer on top of netlink that allows more
> than a hard coded number of subsystems to use netlink.
> 
> Netlink is used e.g. to modify routing tables in the kernel.
> 
> As it is today, subsystem utilising the connector can not examine
> the capabilities of the user/program that sent the netlink message.
> 
> If the same would be true for netlink, than every unprivileged user
> could change the routing tables on your box.
> 
> > > I know, there are not many out there that like connector, but as
> > > long as it is in the kernel, we have to fix the security issues it has!
> >
> > And what specifically are the security issues?
> >
> 
> unprivileged users can trigger operations that are supposed to be only
> accessible to users having CAP_SYS_ADMIN (or some other CAP_XXX)

Ok, but it doesn't look like there are that many connector operations
right now, right?

Anyway, I have no objection to the patches, and figure they should go
through David's network tree.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 0/8] SECURITY ISSUE with connector
From: Lars Ellenberg @ 2009-10-02 16:21 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-fbdev-devel, netdev, Philipp Reisner, linux-kernel,
	dm-devel, Evgeniy Polyakov, Andrew Morton, David S. Miller,
	Alasdair G Kergon
In-Reply-To: <20091002135859.GA9383@kroah.com>

On Fri, Oct 02, 2009 at 06:58:59AM -0700, Greg KH wrote:
> On Fri, Oct 02, 2009 at 02:40:03PM +0200, Philipp Reisner wrote:
> > Affected: All code that uses connector, in kernel and out of mainline
> > 
> > The connector, as it is today, does not allow the in kernel receiving
> > parts to do any checks on privileges of a message's sender.
> 
> So, assume I know nothing about the connector architecture, what does
> this mean in a security context?

Arbitrary unprivileged users may craft a netlink message, which gets delivered
through connector to callbacks (registered in kernel with cn_add_callback).

These callbacks will then act on the message, as if it originated from an
"expected" source.  But currently there is no mechanism to verify the origin, 
even if the callbacks would try to.

> > I know, there are not many out there that like connector, but as
> > long as it is in the kernel, we have to fix the security issues it has!
> 
> And what specifically are the security issues?

For the cn_ulog_callback (dm-log-userspace-transfer.c),
someone would be able to fake completion (with or without error code)
of ulog entries, copying arbitrary data into receiving_pkg entries.

   /*
    * This is the connector callback that delivers data
    * that was sent from userspace.
    */
   static void cn_ulog_callback(void *data)
   {
           struct cn_msg *msg = (struct cn_msg *)data;
           struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
   
           spin_lock(&receiving_list_lock);
           if (msg->len == 0)
                   fill_pkg(msg, NULL);
           else if (msg->len < sizeof(*tfr))
                   DMERR("Incomplete message received (expected %u, got %u): [%u]",
                         (unsigned)sizeof(*tfr), msg->len, msg->seq);
           else
                   fill_pkg(NULL, tfr);
           spin_unlock(&receiving_list_lock);
   }
   
   static int fill_pkg(struct cn_msg *msg, struct dm_ulog_request *tfr)
   {
           uint32_t rtn_seq = (msg) ? msg->seq : (tfr) ? tfr->seq : 0;
   ...
                   } else {
                           pkg->error = tfr->error;
                           memcpy(pkg->data, tfr->data, tfr->data_size);
                           *(pkg->data_size) = tfr->data_size;
                   }
                   complete(&pkg->complete);

   
   
should make that obvious: if an unprivileged user can deliver arbitrary msg to
cn_ulog_callback, that should at least be disruptive to services that use it.

fix: check origin of message for proper credentials (e.g. CAP_SYS_ADMIN).




what or how much damage a crafted message can do in uvesafb_cn_callback,
I'm not sure. But, if I get the msg->seq right, and get by the first
sanity check, again, arbitrary input is copied into some
kernel object, which will likely at least confuse that subsystem,
maybe do damage, or result in some sort of denial of service.

I just don't know what these uvesafb_ktask do, but I doubt that anyone but root
should be able to manipulate them.




in the case of dst and pohemlfs, it is (re|de) configuration of respective in
kernel objects, possibly exposing arbitrary data content
	@Evgeniy - is that statement correct? Does something prevent an
	unprivileged user to export arbitrary things via dst?
At least some sort of denial of service should be possible there.

for DRBD, we have of course similar problems as long as we use the connector
in its current form as our configuration choice.



I'm not sure what actual harm can be done by arbitrary calling
w1_reset_select_slave(), or w1_process_command_io(),
but allowing unprivileged users to meddle with arbitrary devices is most likely
not the intended behaviour there, either.


The "obvious" way was to first make the credentials and capabilities of the
message origin available to these callbacks, and then test on "CAP_SYS_ADMIN".


Note that the suggested usage of the connector for _userspace_ tools
is to bind() to some netlink socket, subscribing to apropriate mutlicast
groups, which will usually fail for unprivileged users in netlink_bind()
because of

        /* Only superuser is allowed to listen multicasts */
        if (nladdr->nl_groups) {
                if (!netlink_capable(sock, NL_NONROOT_RECV))
                        return -EPERM;
                err = netlink_realloc_groups(sk);
                if (err)
                        return err;
        }

So typical userspace tools will fail when used as non-root.
But if you leave out the bind, you are perfectly able to _send_ arbitrary
messages on that socket, even if you are not able to receive any replies from
connector kernel space in that case.



Cheers,

	Lars

^ permalink raw reply

* Re: [PATCH 5/8] dm/connector: Only process connector packages from privileged processes
From: Jonathan Brassow @ 2009-10-02 16:40 UTC (permalink / raw)
  To: device-mapper development
  Cc: linux-fbdev-devel, netdev, LKML, Philipp Reisner, Greg KH,
	Evgeniy Polyakov, Andrew Morton, David S. Miller, Alasdair Kergon
In-Reply-To: <1254487211-11810-6-git-send-email-philipp.reisner@linbit.com>


[-- Attachment #1.1: Type: text/plain, Size: 2190 bytes --]

This patch (and "[dm-devel] [PATCH 3/8] connector/dm: Fixed a  
compilation warning") will likely collide with an earlier patch (which  
agk is pushing) to fix the compilation warning (https://www.redhat.com/archives/dm-devel/2009-September/msg00218.html 
), but the fix-up will be trivial.

The dm-log-userspace code checks that incoming messages correspond to  
requests that were sent to userspace by way of a sequence number.  If  
they don't correspond, they are dropped.  So, you must be able to  
receive the messages from this kernel module (be root) in order to be  
able respond with a message that will be accepted.  I can't completely  
rule out the ability to guess a sequence number, and be able to beat  
the log daemon in responding while the window of that sequence  
number's validity is open though...  If someone could manage to pull  
this off with accuracy, they could disrupt the creation of a device,  
mimic a log device failure, or cause mirror resynchronization to occur  
to a different area that may simultaneously be performing a write  
(potential data corruption of a mirror).  It would be an impressive  
feat to accomplish this, but I very much welcome the patch rather than  
test fate.

Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>

  brassow

On Oct 2, 2009, at 7:40 AM, Philipp Reisner wrote:

> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> ---
> drivers/md/dm-log-userspace-transfer.c |    3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm- 
> log-userspace-transfer.c
> index 1327e1a..54abf9e 100644
> --- a/drivers/md/dm-log-userspace-transfer.c
> +++ b/drivers/md/dm-log-userspace-transfer.c
> @@ -133,6 +133,9 @@ static void cn_ulog_callback(struct cn_msg *msg,  
> struct netlink_skb_parms *nsp)
> {
> 	struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
>
> +	if (!cap_raised(nsp->eff_cap, CAP_SYS_ADMIN))
> +		return;
> +
> 	spin_lock(&receiving_list_lock);
> 	if (msg->len == 0)
> 		fill_pkg(msg, NULL);
> -- 
> 1.6.0.4
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


[-- Attachment #1.2: Type: text/html, Size: 3438 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* Re: SPLICE_F_NONBLOCK semantics...
From: David Miller @ 2009-10-02 16:45 UTC (permalink / raw)
  To: jens.axboe
  Cc: torvalds, eric.dumazet, jgunthorpe, vl, opurdila, netdev,
	linux-kernel
In-Reply-To: <20091002074754.GE14918@kernel.dk>

From: Jens Axboe <jens.axboe@oracle.com>
Date: Fri, 2 Oct 2009 09:47:54 +0200

> The net patch looks fine and correct to me, feel free to add my acked-by
> if you want.

Thanks Jens.

^ permalink raw reply

* Re: [PATCH] net: Fix wrong sizeof
From: David Miller @ 2009-10-02 16:54 UTC (permalink / raw)
  To: khali; +Cc: linux-kernel, netdev, linux-doc, rdunlap, stable
In-Reply-To: <20091002113038.1dc3d284@hyperion.delvare>

From: Jean Delvare <khali@linux-fr.org>
Date: Fri, 2 Oct 2009 11:30:38 +0200

> Which is why I have always preferred sizeof(struct foo) over
> sizeof(var).
> 
> Signed-off-by: Jean Delvare <khali@linux-fr.org>
> Cc: Randy Dunlap <rdunlap@xenotime.net>

Any time you see "&" in a sizeof() expression, it's almost
certainly a bug.  Something for the folks with automated
tools to look for if they haven't already :-)

I'll apply this, thanks.

^ permalink raw reply

* Re: [PATCH 0/8] SECURITY ISSUE with connector
From: David Miller @ 2009-10-02 16:57 UTC (permalink / raw)
  To: philipp.reisner
  Cc: linux-fbdev-devel, greg, linux-kernel, dm-devel, netdev, zbr,
	akpm
In-Reply-To: <200910021754.12940.philipp.reisner@linbit.com>

From: Philipp Reisner <philipp.reisner@linbit.com>
Date: Fri, 2 Oct 2009 17:54:12 +0200

> Think of the connector as a layer on top of netlink that allows more
> than a hard coded number of subsystems to use netlink.

There are no such limits in netlink, we have 'genetlink' which allows
an arbitrary number of subsystems to use netlink.

What connector provides over netlink/genetlink is something different
altogether.

^ permalink raw reply

* Re: [net-2.6 PATCH] e1000e/igb/ixgbe: Don't report an error if devices don't support AER
From: David Miller @ 2009-10-02 17:04 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, elendil
In-Reply-To: <20091002071542.5072.23381.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Fri, 02 Oct 2009 00:15:48 -0700

> From: Frans Pop <elendil@planet.nl>
> 
> The only error returned by pci_{en,dis}able_pcie_error_reporting() is
> -EIO which simply means that Advanced Error Reporting is not supported.
> There is no need to report that, so remove the error check from e1000e,
> igb and ixgbe.
> 
> Signed-off-by: Frans Pop <elendil@planet.nl>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied, thanks.

^ permalink raw reply

* Adding to linux-next?
From: Gregory Haskins @ 2009-10-02 17:08 UTC (permalink / raw)
  To: linux-next, Stephen Rothwell
  Cc: linux-kernel@vger.kernel.org, netdev, David Miller,
	alacrityvm-devel@lists.sourceforge.net

[-- Attachment #1: Type: text/plain, Size: 2554 bytes --]

Hello Stephen, linux-next'ers,

I am looking for some guidance on policy/procedure governing inclusion
of a tree to linux-next.  For instance: Do I have to be arbitrarily
invited (e.g. by some committee on LKML), or do I explicitly request
consideration?  I tried to Google around for answers, and also found the
linux-next wiki, but I was not getting any clear answers.

I have these guest drivers to support IO on top of the AlacrityVM
hypervisor:

http://lkml.org/lkml/2009/8/3/278

The comments have since died down.  I realize this can mean anything
from "no objection" to "no interest" ;), but I assume the former unless
someone pipes up.

I believe I addressed the review comments and received an Ack from the
one maintainer of the tree that overlaps with the work (netdev/davem), here:

http://lkml.org/lkml/2009/8/3/505

Since the rest of the work doesn't really fall into any existing
subsystem, and David conceded that the netdev overlap portion should
carry elsewhere, I offer to fill this role myself from within the
AlacrityVM tree itself.

As such, I have taken the driver series and created a new branch here:

git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git
linux-next

Unlike the original posting, I have excluded the final ethernet patch
since I posted a v3 today (http://lkml.org/lkml/2009/10/2/239) that I
would like to have David re-Ack before including.

Once the driver has been suitably approved by David, and if he still
feels its ok to carry in a tree other than netdev, I will re-add it to
the linux-next branch.

Because I am not really sure of the policies for linux-next, let me
state my intentions of this branch, since I am an unknown in the
maintainership role:

I will only post patches to this branch that:

*) do not fall into an existing maintained subsystem category, unless
the appropriate maintainer has relinquished the patch to carry in my tree.
*) have previously been posted to LKML for suitable review.

IOW: The purpose is not to sneak something in, or subvert a maintained
subsystem.  It is purely to carry pieces that have no other home and are
maintained under the AlacrityVM project.  You can find more details of
the project here:

http://developer.novell.com/wiki/index.php/AlacrityVM

If this is not acceptable, or I need to follow some other procedure,
please advise me on the proper steps.  Perhaps I will update the wiki
FAQ on what I learn from your responses :)

Thank you, and Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Jason Gunthorpe @ 2009-10-02 17:10 UTC (permalink / raw)
  To: Volker Lendecke; +Cc: Eric Dumazet, netdev, Volker Lendecke
In-Reply-To: <E1Mssmb-004RJz-Hf@intern.SerNet.DE>

On Wed, Sep 30, 2009 at 08:37:13AM +0200, Volker Lendecke wrote:
> On Tue, Sep 29, 2009 at 06:48:20PM -0600, Jason Gunthorpe wrote:
> > FWIW, it looks like samba has a splice code now, but doesn't enable it
> > due to this issue?
> 
> Right. What I've learned from the comments is that splice is
> only usable in multi-threaded programs. One thread is
> reading, one is writing from the other end. I deferred using
> splice until we have the proper architecture to do sync
> syscalls in helper threads to make them virtually async.  We
> have some code for that now, but it's not a high priority
> for me at this moment.

So, it looks like thanks to Eric and davem that splice will be changed
so it can be blocking on the TCP and non-blocking on the PIPE.

I'd suggest a construct like the following as a compatability
solution:

struct pollfd pfd = {.fd = tcpfd, events = POLLIN | POLLRDHUP};
while (..) {
   rc = splice(tcpfd,0,pfd[1],0,count,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
   if (rc == -1)
     //...
   if (rc == 0) {
       if (pfd.revents & POLLRDHUP)
          // oops, EOF on TCP

       /* Might be an old kernel that nonblocks on TCP, have to check
          if this is EOF or do blocking. */
       rc = poll(&pfd,1,-1);
       if (rc == -1)
          //...
   }

   rc = splice(pfd[0],0,ofd,0,..., SPLICE_F_MOVE)
}

Which should add no overhead in the new splice blocks case, and falls
back gracefully on older kernels..

Thanks,
Jason

^ permalink raw reply

* Re: [PATCH] Use sk_mark for routing lookup in more places
From: Maciej Żenczykowski @ 2009-10-02 17:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, atis, panther, netdev
In-Reply-To: <4AC598D7.9080900@gmail.com>

Cool!

As I've already pointed out in a post 2 or so weeks ago, we need the
exact same treatment in a ton of places throughout the code (tcp,
ipv6, decnet, etc...).

Maybe it would make more sense to create some constructor-like
functions for the flowi struct?

On Thu, Oct 1, 2009 at 23:08, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Eric Dumazet a écrit :
>> Here is a followup on this area, thanks.
>>
>> [RFC] af_packet: fill skb->mark at xmit
>>
>> skb->mark may be used by classifiers, so fill it in case user
>> set a SO_MARK option on socket.
>>
>
> Maybe a more generic way to handle this for various protocols
> would be to fill skb->mark in sock_alloc_send_pskb()
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: 2.6.32-rc1-git2: Reported regressions from 2.6.31
From: Rafael J. Wysocki @ 2009-10-02 17:32 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Jaswinder Singh Rajput, Linux Kernel Mailing List, Adrian Bunk,
	Andrew Morton, Linus Torvalds, Natalie Protasevich,
	Kernel Testers List, Network Development, Linux ACPI,
	Linux PM List, Linux SCSI List, Linux Wireless List, DRI
In-Reply-To: <4AC5F975.6060505@s5r6.in-berlin.de>

On Friday 02 October 2009, Stefan Richter wrote:
> Jaswinder Singh Rajput wrote:
> > If you add one more entry say "Suspected commit :" then it will be great
> > and will solve regressions much faster.
> 
> Will?  Might.

In fact I add the "First-Bad-Commit" annotation where there is a bisection
result or it's possible to fix things by reverting a specific commit.

> > You can request submitter to
> > submit 'suspected commit' by git bisect and also specify git bisect
> > links like : (for more information about git bisect check
> > http://kerneltrap.org/node/11753)
> 
> I disagree.  A reporter should only be asked to bisect (using git or
> other tools) /if/ a developer determined that bisection may speed up the
> debugging process or is the only remaining option to make progress with
> a bug.
> 
> It would be wrong to steal a reporter's valuable time by asking for
> bisection before anybody familiar with the matter even had a first look
> at the report.

Agreed.

Thanks,
Rafael

^ permalink raw reply

* Re: [PATCH] TCPCT-1: adding a sysctl
From: William Allen Simpson @ 2009-10-02 17:52 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4AC61505.8030701@gmail.com>

William Allen Simpson wrote:
> This is a straightforward re-implementation of an earlier patch, that no
> longer applies cleanly, that was reviewed:
> 
>   http://thread.gmane.org/gmane.linux.network/102586
> 
In that thread, David Miller wrote:

   "This looks mostly fine to me.  I would even advocate not using a config
   option for this."

It would make the code look cleaner, and with the sysctl instead, it
would probably be fine.  But SYN cookies has both.

Before I go much further, I'd like guidance.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox