Netdev List
 help / color / mirror / Atom feed
* [PATCH 9/9] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-07 16:33 UTC (permalink / raw)
  To: chas, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	remi.denis-courmont
  Cc: netdev, security, stable

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>

diff -urp linux-2.6.37-rc1.orig/net/unix/af_unix.c linux-2.6.37-rc1/net/unix/af_unix.c
--- linux-2.6.37-rc1.orig/net/unix/af_unix.c	2010-11-01 07:54:12.000000000 -0400
+++ linux-2.6.37-rc1/net/unix/af_unix.c	2010-11-07 10:46:27.000000000 -0500
@@ -353,7 +353,8 @@ static void unix_sock_destructor(struct 
 	WARN_ON(!sk_unhashed(sk));
 	WARN_ON(sk->sk_socket);
 	if (!sock_flag(sk, SOCK_DEAD)) {
-		printk(KERN_INFO "Attempt to release alive unix socket: %p\n", sk);
+		printk(KERN_INFO "Attempt to release alive unix socket: %lu\n",
+			sock_i_ino(sk));
 		return;
 	}
 
@@ -2198,8 +2199,8 @@ static int unix_seq_show(struct seq_file
 		struct unix_sock *u = unix_sk(s);
 		unix_state_lock(s);
 
-		seq_printf(seq, "%p: %08X %08X %08X %04X %02X %5lu",
-			s,
+		seq_printf(seq, "%d: %08X %08X %08X %04X %02X %5lu",
+			0,
 			atomic_read(&s->sk_refcnt),
 			0,
 			s->sk_state == TCP_LISTEN ? __SO_ACCEPTCON : 0,




^ permalink raw reply

* Netlink limitations
From: Jan Engelhardt @ 2010-11-07 16:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: pablo, netdev

Hi,


we mentioned it only briefly at the Netfilter workshop a few weeks ago, 
but as I am trying to figure out how to use Netlink in Xtables, 
Netlink's limitations really start ruining my day.

The well-known issue is that NL messages the kernel is supposed to 
receive have a max size of 64K, due to nlmsghdr's use of uint16_t. This 
is very problematic because attributes can easily amass more than 64K. 
Think of a chain full of rules, represented by a top-level attribute 
that nests attributes. The problem is bidirectional, a table 
dump has the same problem.

A further problem seems to be that the kernel does not seem to have 
support for receiving NLM_F_MULTI messages, so even assuming chains were 
just 40K, one cannot atomically replace an entire table with 2 chains of 
40K each. Trying to slap transaction support on _top_ of netlink is not 
going to work with the current implementation, because there is no 
notification of when the socket is closed before a NLMSG_DONE has been 
sent.

What I would also like is streaming support, i.e. that I can tag an 
attribute container (one that has nested attrs) with .len = -1 to define 
that the end of the container is given not by .len, but by a stop 
marker.

Hacks like nfnetlink or genetlink also seem unnecessary to me, and the 
limit of MAX_LINKS=32 most likely just stems from nl_table being an 
array that is not very sparse.

Perhaps it is time to replace Netlink by something new?
Trying to elicit some opinions.


Jan

^ permalink raw reply

* Re: [PATCH 0/9] Fix leaking of kernel heap addresses in net/
From: Eric Dumazet @ 2010-11-07 17:03 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: chas, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	remi.denis-courmont, netdev, security, stable
In-Reply-To: <1289147492.3090.137.camel@Dan>

Le dimanche 07 novembre 2010 à 11:31 -0500, Dan Rosenberg a écrit :
> This patch series resolves the leakage of kernel heap addresses to
> userspace via network protocol /proc interfaces and public error
> messages.  Revealing this information is a bad idea from a security
> perspective for a number of reasons, the most obvious of which is it
> provides unprivileged users a mechanism by which to create a structure
> in the kernel heap containing function pointers, obtain the address of
> that structure, and overwrite those function pointers by leveraging
> other vulnerabilities.  It is my hope that by eliminating this
> information leakage, in conjunction with making statically-declared
> function pointer tables read-only (to be done in a separate patch
> series), we can at least add a small hurdle for the exploitation of a
> subset of kernel vulnerabilities.
> 
> To maintain compatibility with userspace programs relying on
> consistent /proc output, the output descriptions and number of fields
> are not changed.  When a unique identifier for the socket is desired,
> the socket address has been replaced with the socket inode number.  When
> the inode number is already present in the output, the address has been
> replaced with a 0.  In these cases, the format specifier has been
> changed to %d, because a %p output of 0 from kernel space is written as
> "(null)", while userspace %p can only parse "(nil)".
> 

NACK

Thats a pretty stupid patch series, sorry.

You are basically ruining a lot of debugging facilities we use every day
to find and fix _real_ bugs. The bugs that happen to crash machines of
our customers.

If you want to avoid a user reading kernel syslog, why dont you fix the
problem for non root users able to "dmesg" ? I personally dont care.

I am a root user on my machine, I _want_ to have some pretty basic
informations so that I can work on it, and I believe my work is useful.

There are pretty easy ways to not disclose "information", but your way
of using '0' for all values is the dumbest idea one could ever had.

A single XOR with a "root only visible, random value chosen at boot"
would be OK. At least we could continue our work, with litle burden.



^ permalink raw reply

* Re: [PATCH 3/9] Fix leaking of kernel heap addresses in net/
From: Eric Dumazet @ 2010-11-07 17:11 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: chas, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	remi.denis-courmont, netdev, security, stable
In-Reply-To: <1289147508.3090.140.camel@Dan>

Le dimanche 07 novembre 2010 à 11:31 -0500, Dan Rosenberg a écrit :
> Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
> 
> diff -urp linux-2.6.37-rc1.orig/net/ipv4/af_inet.c linux-2.6.37-rc1/net/ipv4/af_inet.c
> --- linux-2.6.37-rc1.orig/net/ipv4/af_inet.c	2010-11-01 07:54:12.000000000 -0400
> +++ linux-2.6.37-rc1/net/ipv4/af_inet.c	2010-11-07 10:31:41.000000000 -0500
> @@ -139,12 +139,13 @@ void inet_sock_destruct(struct sock *sk)
>  	sk_mem_reclaim(sk);
>  
>  	if (sk->sk_type == SOCK_STREAM && sk->sk_state != TCP_CLOSE) {
> -		pr_err("Attempt to release TCP socket in state %d %p\n",
> -		       sk->sk_state, sk);
> +		pr_err("Attempt to release TCP socket in state %d %lu\n",
> +		       sk->sk_state, sock_i_ino(sk));
>  		return;
>  	}

Why a non root user can read this debugging stuff, I ask.

I suggest "dmesg" for them sends them a Windows XP system log,
and /proc/cpuinfo pretends machine is a Z80, somewhat modified to have
65536 cores.




^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Ben Hutchings @ 2010-11-07 17:11 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: David Miller, torvalds, drosenberg, chas, kuznet, pekkas, jmorris,
	yoshfuji, kaber, remi.denis-courmont, netdev, security
In-Reply-To: <4CD67F38.1060506@hartkopp.net>

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

On Sun, 2010-11-07 at 11:28 +0100, Oliver Hartkopp wrote:
[...]
> The layout break was ok in this case as the people using the CAN procfs stuff
> do this only when facing problems (with their applications) at runtime.
> 
> A discussed approach that won't break the procfs layout was to set the values
> to "0" and only fill them with real content depending on CONFIG_DEBUG_INFO .
> 
> Would that fit here?
> 
> Or maybe a different config option CONFIG_DEBUG_KERNEL_ADDR would do the job,
> as i don't know which distros enable CONFIG_DEBUG_INFO by default ...

I believe most distributions now enable CONFIG_DEBUG_INFO (though the
debug information is then stripped and shipped in separate packages).  I
also dislike this idea of an implicit trade-off between debugging and
security; such a trade-off may be necessary but then it should be
explicit.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: Netlink limitations
From: Patrick McHardy @ 2010-11-07 17:17 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: David S. Miller, pablo, netdev
In-Reply-To: <alpine.LNX.2.01.1011071743070.21289@obet.zrqbmnf.qr>

On 07.11.2010 17:44, Jan Engelhardt wrote:
> we mentioned it only briefly at the Netfilter workshop a few weeks ago, 
> but as I am trying to figure out how to use Netlink in Xtables, 
> Netlink's limitations really start ruining my day.
> 
> The well-known issue is that NL messages the kernel is supposed to 
> receive have a max size of 64K, due to nlmsghdr's use of uint16_t. This 
> is very problematic because attributes can easily amass more than 64K. 
> Think of a chain full of rules, represented by a top-level attribute 
> that nests attributes. The problem is bidirectional, a table 
> dump has the same problem.

Messages are not limited to 64k, individual attributes are. Holger
started working on a nlattr32, which uses 32 bit for the length
value.

> A further problem seems to be that the kernel does not seem to have 
> support for receiving NLM_F_MULTI messages, so even assuming chains were 
> just 40K, one cannot atomically replace an entire table with 2 chains of 
> 40K each. Trying to slap transaction support on _top_ of netlink is not 
> going to work with the current implementation, because there is no 
> notification of when the socket is closed before a NLMSG_DONE has been 
> sent.

There is, search for NETLINK_URELEASE in af_netlink.c. With 32 bit
attribute lengths this should not be needed anymore however.

> What I would also like is streaming support, i.e. that I can tag an 
> attribute container (one that has nested attrs) with .len = -1 to define 
> that the end of the container is given not by .len, but by a stop 
> marker.

That's somewhat similar to the nlattr32 idea, but a length of 0 makes
more sense since that's currently not used. In that case the length
would be read from a second length field which has 32 bits.

^ permalink raw reply

* [PATCH 1/5] Staging: ath6kl: Fix pointer casts on 64-bit architectures
From: Ben Hutchings @ 2010-11-07 17:19 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: Vipin Mehta, devel, netdev

Remove unnecessary cast of firmware base address to integer before
adding an offset.

Fix direct use of sk_buff::network_header which is an offset rather
than a pointer on 64-bit architectures.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
Compile-tested only.  This should be 2.6.37 material.

Ben.

 drivers/staging/ath6kl/os/linux/ar6000_drv.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/ath6kl/os/linux/ar6000_drv.c b/drivers/staging/ath6kl/os/linux/ar6000_drv.c
index c5a6d6c..a659f70 100644
--- a/drivers/staging/ath6kl/os/linux/ar6000_drv.c
+++ b/drivers/staging/ath6kl/os/linux/ar6000_drv.c
@@ -1126,7 +1126,7 @@ ar6000_transfer_bin_file(AR_SOFTC_T *ar, AR6K_BIN_FILE file, A_UINT32 address, A
         if ((board_ext_address) && (fw_entry->size == (board_data_size + board_ext_data_size))) {
             A_UINT32 param;
 
-            status = BMIWriteMemory(ar->arHifDevice, board_ext_address, (A_UCHAR *)(((A_UINT32)fw_entry->data) + board_data_size), board_ext_data_size);
+            status = BMIWriteMemory(ar->arHifDevice, board_ext_address, (A_UCHAR *)(fw_entry->data + board_data_size), board_ext_data_size);
 
             if (status != A_OK) {
                 AR_DEBUG_PRINTF(ATH_DEBUG_ERR, ("BMI operation failed: %d\n", __LINE__));
@@ -3030,7 +3030,8 @@ ar6000_data_tx(struct sk_buff *skb, struct net_device *dev)
         A_UINT8 csumDest=0;
         A_UINT8 csum=skb->ip_summed;
         if(csumOffload && (csum==CHECKSUM_PARTIAL)){
-            csumStart=skb->csum_start-(skb->network_header-skb->head)+sizeof(ATH_LLC_SNAP_HDR);
+            csumStart = (skb->head + skb->csum_start - skb_network_header(skb) +
+			 sizeof(ATH_LLC_SNAP_HDR));
             csumDest=skb->csum_offset+csumStart;
         }
 #endif
-- 
1.7.2.3




^ permalink raw reply related

* Re: [PATCH 0/9] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-07 17:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: chas, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	remi.denis-courmont, netdev, security, stable
In-Reply-To: <1289149416.2478.143.camel@edumazet-laptop>


> NACK
> 
> Thats a pretty stupid patch series, sorry.
> 

I think it might be more constructive to avoid childish name-calling and
instead try to guide the conversation in a way that produces a patch
that would better fit your needs.  Even if you don't agree with the
approach, it's certainly not "stupid".

> You are basically ruining a lot of debugging facilities we use every day
> to find and fix _real_ bugs. The bugs that happen to crash machines of
> our customers.

I'm going to give you the benefit of the doubt and assume you're not
implying that security issues aren't "real" bugs, because that would be
utterly ridiculous.

> 
> If you want to avoid a user reading kernel syslog, why dont you fix the
> problem for non root users able to "dmesg" ? I personally dont care.
> 

This is simply the reality of the current situation.  At least while the
kernel syslog is available to unprivileged users, we need to be more
careful of what is visible through there.

> I am a root user on my machine, I _want_ to have some pretty basic
> informations so that I can work on it, and I believe my work is useful.
> 
> There are pretty easy ways to not disclose "information", but your way
> of using '0' for all values is the dumbest idea one could ever had.

I'm glad I'm capable of producing "the dumbest idea one could ever had".
You seem to be quite set on convincing unpaid volunteers such as myself
to stop sending in patches.

> 
> A single XOR with a "root only visible, random value chosen at boot"
> would be OK. At least we could continue our work, with litle burden.

Finally, a useful contribution.  I'll consider this option after hearing
from a few more people on the subject.

-Dan


^ permalink raw reply

* Re: [PATCH 0/9] Fix leaking of kernel heap addresses in net/
From: Eric Dumazet @ 2010-11-07 17:40 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: chas, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	remi.denis-courmont, netdev, security, stable
In-Reply-To: <1289150714.3090.158.camel@Dan>

Le dimanche 07 novembre 2010 à 12:25 -0500, Dan Rosenberg a écrit :
> > NACK
> > 
> > Thats a pretty stupid patch series, sorry.
> > 
> 
> I think it might be more constructive to avoid childish name-calling and
> instead try to guide the conversation in a way that produces a patch
> that would better fit your needs.  Even if you don't agree with the
> approach, it's certainly not "stupid".
> 

It is stupid. Really Dan. The idea is stupid, not you.

> > You are basically ruining a lot of debugging facilities we use every day
> > to find and fix _real_ bugs. The bugs that happen to crash machines of
> > our customers.
> 
> I'm going to give you the benefit of the doubt and assume you're not
> implying that security issues aren't "real" bugs, because that would be
> utterly ridiculous.
> 

So what ? Because of security, we must accept even stupid patches ?

> > 
> > If you want to avoid a user reading kernel syslog, why dont you fix the
> > problem for non root users able to "dmesg" ? I personally dont care.
> > 
> 
> This is simply the reality of the current situation.  At least while the
> kernel syslog is available to unprivileged users, we need to be more
> careful of what is visible through there.
> 

So instead of fixing the problem, you are going to change thousand of
kernel printk() ?

> > I am a root user on my machine, I _want_ to have some pretty basic
> > informations so that I can work on it, and I believe my work is useful.
> > 
> > There are pretty easy ways to not disclose "information", but your way
> > of using '0' for all values is the dumbest idea one could ever had.
> 
> I'm glad I'm capable of producing "the dumbest idea one could ever had".
> You seem to be quite set on convincing unpaid volunteers such as myself
> to stop sending in patches.
> 

I am unpaid volunteer too.

I also had stupid ideas, and other guys said so.

So what ? Should I continue contributing to Linux, or assume I am stupid
and stop ?

> > 
> > A single XOR with a "root only visible, random value chosen at boot"
> > would be OK. At least we could continue our work, with litle burden.
> 
> Finally, a useful contribution.  I'll consider this option after hearing
> from a few more people on the subject.

I am glad you like it. But it also may a _very_ stupid idea. You really
want to have a _lot_ of agreement before even considering it.





^ permalink raw reply

* Re: netconf notes and materials
From: Rami Rosen @ 2010-11-07 19:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091007.044718.233642056.davem@davemloft.net>

David,
1)  Great, thanks for the link!

2)  Regarding your "Linux Networking Futures 2010" slides :
 You mention in the fifth slide :
"XFS is in review state, 2.6.38 likely'.

I suppose you probably mean "XPS" patches,
the transmit Packet Steering patches by
Tom Herbert. Or am I wrong and don't know something ?

Rgs,
Rami Rosen



On Wed, Oct 7, 2009 at 1:47 PM, David Miller <davem@davemloft.net> wrote:
>
> Just a note that all of the available notes and slide etc.
> materials are available for netconf2009 at:
>
>        http://vger.kernel.org/netconf2009.html
>
> Enjoy.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH 1/2 v5] xps: Improvements in TX queue selection
From: Tom Herbert @ 2010-11-07 19:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: eric.dumazet

In dev_pick_tx, don't do work in calculating queue index or setting
the index in the sock unless the device has more than one queue.  This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.

We also allow the mapping of a socket to queue to be changed.  To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure.  If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket).  This patch includes the modification in TCP output
for setting this flag.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/skbuff.h |    3 ++-
 net/core/dev.c         |   18 +++++++++++-------
 net/ipv4/tcp_output.c  |    5 ++++-
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e6ba898..19f37a6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -386,9 +386,10 @@ struct sk_buff {
 #else
 	__u8			deliver_no_wcard:1;
 #endif
+	__u8			ooo_okay:1;
 	kmemcheck_bitfield_end(flags2);
 
-	/* 0/14 bit hole */
+	/* 0/13 bit hole */
 
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0dd54a6..0ff9df6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2125,20 +2125,24 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 	int queue_index;
 	const struct net_device_ops *ops = dev->netdev_ops;
 
-	if (ops->ndo_select_queue) {
+	if (dev->real_num_tx_queues == 1)
+		queue_index = 0;
+	else if (ops->ndo_select_queue) {
 		queue_index = ops->ndo_select_queue(dev, skb);
 		queue_index = dev_cap_txqueue(dev, queue_index);
 	} else {
 		struct sock *sk = skb->sk;
 		queue_index = sk_tx_queue_get(sk);
-		if (queue_index < 0 || queue_index >= dev->real_num_tx_queues) {
 
-			queue_index = 0;
-			if (dev->real_num_tx_queues > 1)
-				queue_index = skb_tx_hash(dev, skb);
+		if (queue_index < 0 || skb->ooo_okay ||
+		    queue_index >= dev->real_num_tx_queues) {
+			int old_index = queue_index;
 
-			if (sk) {
-				struct dst_entry *dst = rcu_dereference_check(sk->sk_dst_cache, 1);
+			queue_index = skb_tx_hash(dev, skb);
+
+			if (queue_index != old_index && sk) {
+				struct dst_entry *dst =
+				    rcu_dereference_check(sk->sk_dst_cache, 1);
 
 				if (dst && skb_dst(skb) == dst)
 					sk_tx_queue_set(sk, queue_index);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 05b1ecf..2b6eb36 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -822,8 +822,11 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 							   &md5);
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
 
-	if (tcp_packets_in_flight(tp) == 0)
+	if (tcp_packets_in_flight(tp) == 0) {
 		tcp_ca_event(sk, CA_EVENT_TX_START);
+		skb->ooo_okay = 1;
+	} else
+		skb->ooo_okay = 0;
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
-- 
1.7.3.1


^ permalink raw reply related

* [PATCH 2/2 v5] xps: Transmit Packet Steering
From: Tom Herbert @ 2010-11-07 19:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: eric.dumazet

This patch implements transmit packet steering (XPS) for multiqueue
devices.  XPS selects a transmit queue during packet transmission based
on configuration.  This is done by mapping the CPU transmitting the
packet to a queue.  This is the transmit side analogue to RPS-- where
RPS is selecting a CPU based on receive queue, XPS selects a queue
based on the CPU (previously there was an XPS patch from Eric
Dumazet, but that might more appropriately be called transmit completion
steering).

Each transmit queue can be associated with a number of CPUs which will
use the queue to send packets.  This is configured as a CPU mask on a
per queue basis in:

/sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

The mappings are stored per device in an inverted data structure that
maps CPUs to queues.  In the netdevice structure this is an array of
num_possible_cpu structures where each structure holds and array of
queue_indexes for queues which that CPU can use.

The benefits of XPS are improved locality in the per queue data
structures.  Also, transmit completions are more likely to be done
nearer to the sending thread, so this should promote locality back
to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
cache hierarchy, application load, and other factors.  XPS would
nominally be configured so that a queue would only be shared by CPUs
which are sharing a cache, the degenerative configuration woud be that
each CPU has it's own queue.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR test
with 1 byte req. and resp.

bnx2x on 16 core AMD
   XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
   No XPS (16 queues)                   996K at 100% CPU

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/netdevice.h |   32 ++++
 net/core/dev.c            |   54 +++++++-
 net/core/net-sysfs.c      |  367 ++++++++++++++++++++++++++++++++++++++++++++-
 net/core/net-sysfs.h      |    3 +
 4 files changed, 450 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 072652d..b2ea7c0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -503,6 +503,13 @@ struct netdev_queue {
 	struct Qdisc		*qdisc;
 	unsigned long		state;
 	struct Qdisc		*qdisc_sleeping;
+#ifdef CONFIG_RPS
+	struct netdev_queue	*first;
+	atomic_t		count;
+	struct xps_dev_maps	*xps_maps;
+	struct kobject		kobj;
+#endif
+
 /*
  * write mostly part
  */
@@ -530,6 +537,31 @@ struct rps_map {
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
 
 /*
+ * This structure holds an XPS map which can be of variable length.  The
+ * map is an array of queues.
+ */
+struct xps_map {
+	unsigned int len;
+	unsigned int alloc_len;
+	struct rcu_head rcu;
+	u16 queues[0];
+};
+#define XPS_MAP_SIZE(_num) (sizeof(struct xps_map) + (_num * sizeof(u16)))
+#define XPS_MIN_MAP_ALLOC ((L1_CACHE_BYTES - sizeof(struct xps_map))	\
+    / sizeof(u16))
+
+/*
+ * This structure holds all XPS maps for device.  Maps are indexed by CPU.
+ */
+struct xps_dev_maps {
+	struct rcu_head rcu;
+	struct xps_map *cpu_map[0];
+};
+#define XPS_DEV_MAPS_SIZE (sizeof(struct xps_dev_maps) +		\
+    (nr_cpu_ids * sizeof(struct xps_map *)))
+#define netdev_get_xps_maps(dev) ((dev)->_tx[0].xps_maps)
+
+/*
  * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
  * tail pointer for that CPU's input queue at the time of last enqueue.
  */
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ff9df6..502aa50 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2119,6 +2119,43 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 	return queue_index;
 }
 
+static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_RPS
+	struct xps_dev_maps *dev_maps;
+	struct xps_map *map;
+	int queue_index = -1;
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(netdev_get_xps_maps(dev));
+	if (dev_maps) {
+		map = rcu_dereference(
+		    dev_maps->cpu_map[raw_smp_processor_id()]);
+		if (map) {
+			if (map->len == 1)
+				queue_index = map->queues[0];
+			else {
+				u32 hash;
+				if (skb->sk && skb->sk->sk_hash)
+					hash = skb->sk->sk_hash;
+				else
+					hash = (__force u16) skb->protocol ^
+					    skb->rxhash;
+				hash = jhash_1word(hash, hashrnd);
+				queue_index = map->queues[
+				    ((u64)hash * map->len) >> 32];
+			}
+			if (unlikely(queue_index >= dev->real_num_tx_queues))
+				queue_index = -1;
+		}
+	}
+	rcu_read_unlock();
+
+	return queue_index;
+#endif
+	return -1;
+}
+
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
@@ -2138,7 +2175,9 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 		    queue_index >= dev->real_num_tx_queues) {
 			int old_index = queue_index;
 
-			queue_index = skb_tx_hash(dev, skb);
+			queue_index = get_xps_queue(dev, skb);
+			if (queue_index < 0)
+				queue_index = skb_tx_hash(dev, skb);
 
 			if (queue_index != old_index && sk) {
 				struct dst_entry *dst =
@@ -5057,6 +5096,17 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 		return -ENOMEM;
 	}
 	dev->_tx = tx;
+#ifdef CONFIG_RPS
+	/*
+	 * Set a pointer to first element in the array which holds the
+	 * reference count.
+	 */
+	{
+		int i;
+		for (i = 0; i < count; i++)
+			tx[i].first = tx;
+	}
+#endif
 	return 0;
 }
 
@@ -5621,7 +5671,9 @@ void free_netdev(struct net_device *dev)
 
 	release_net(dev_net(dev));
 
+#ifndef CONFIG_RPS
 	kfree(dev->_tx);
+#endif
 
 	kfree(rcu_dereference_raw(dev->ingress_queue));
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a5ff5a8..aa77d7f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -770,18 +770,375 @@ net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
 	return error;
 }
 
-static int rx_queue_register_kobjects(struct net_device *net)
+/*
+ * netdev_queue sysfs structures and functions.
+ */
+struct netdev_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_queue *queue,
+	    struct netdev_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_queue *queue,
+	    struct netdev_queue_attribute *attr, const char *buf, size_t len);
+};
+#define to_netdev_queue_attr(_attr) container_of(_attr,		\
+    struct netdev_queue_attribute, attr)
+
+#define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj)
+
+static ssize_t netdev_queue_attr_show(struct kobject *kobj,
+				      struct attribute *attr, char *buf)
+{
+	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	return attribute->show(queue, attribute, buf);
+}
+
+static ssize_t netdev_queue_attr_store(struct kobject *kobj,
+				       struct attribute *attr,
+				       const char *buf, size_t count)
+{
+	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	return attribute->store(queue, attribute, buf, count);
+}
+
+static const struct sysfs_ops netdev_queue_sysfs_ops = {
+	.show = netdev_queue_attr_show,
+	.store = netdev_queue_attr_store,
+};
+
+static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
+{
+	struct net_device *dev = queue->dev;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++)
+		if (queue == &dev->_tx[i])
+			break;
+
+	BUG_ON(i >= dev->num_tx_queues);
+
+	return i;
+}
+
+
+static ssize_t show_xps_map(struct netdev_queue *queue,
+			    struct netdev_queue_attribute *attribute, char *buf)
+{
+	struct netdev_queue *first = queue->first;
+	struct xps_dev_maps *dev_maps;
+	cpumask_var_t mask;
+	unsigned long index;
+	size_t len = 0;
+	int i;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(first->xps_maps);
+	if (dev_maps) {
+		for_each_possible_cpu(i) {
+			struct xps_map *map =
+			    rcu_dereference(dev_maps->cpu_map[i]);
+			if (map) {
+				int j;
+				for (j = 0; j < map->len; j++) {
+					if (map->queues[j] == index) {
+						cpumask_set_cpu(i, mask);
+						break;
+					}
+				}
+			}
+		}
+	}
+	len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
+	if (PAGE_SIZE - len < 3) {
+		rcu_read_unlock();
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	free_cpumask_var(mask);
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+static void xps_map_release(struct rcu_head *rcu)
+{
+	struct xps_map *map = container_of(rcu, struct xps_map, rcu);
+
+	kfree(map);
+}
+
+static void xps_dev_maps_release(struct rcu_head *rcu)
+{
+	struct xps_dev_maps *dev_maps =
+	    container_of(rcu, struct xps_dev_maps, rcu);
+
+	kfree(dev_maps);
+}
+
+static DEFINE_MUTEX(xps_map_mutex);
+
+static ssize_t store_xps_map(struct netdev_queue *queue,
+		      struct netdev_queue_attribute *attribute,
+		      const char *buf, size_t len)
+{
+	struct netdev_queue *first = queue->first;
+	cpumask_var_t mask;
+	int err, i, cpu, pos, map_len, alloc_len, need_set;
+	unsigned long index;
+	struct xps_map *map, *new_map;
+	struct xps_dev_maps *dev_maps, *new_dev_maps;
+	int nonempty = 0;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
+	if (err) {
+		free_cpumask_var(mask);
+		return err;
+	}
+
+	new_dev_maps = kzalloc(max_t(unsigned,
+	    XPS_DEV_MAPS_SIZE, L1_CACHE_BYTES), GFP_KERNEL);
+	if (!new_dev_maps) {
+		free_cpumask_var(mask);
+		return err;
+	}
+
+	mutex_unlock(&xps_map_mutex);
+
+	dev_maps = first->xps_maps;
+
+	for_each_possible_cpu(cpu) {
+		new_map = map = dev_maps ? dev_maps->cpu_map[cpu] : NULL;
+
+		if (map) {
+			for (pos = 0; pos < map->len; pos++)
+				if (map->queues[pos] == index)
+					break;
+			map_len = map->len;
+			alloc_len = map->alloc_len;
+		} else
+			pos = map_len = alloc_len = 0;
+
+		need_set = cpu_isset(cpu, *mask) && cpu_online(cpu);
+
+		if (need_set && pos >= map_len) {
+			/* Need to add queue to this CPU's map */
+			if (map_len >= alloc_len) {
+				alloc_len = alloc_len ?
+				    2 * alloc_len : XPS_MIN_MAP_ALLOC;
+				new_map = kzalloc(XPS_MAP_SIZE(alloc_len),
+				    GFP_KERNEL);
+				if (!new_map)
+					goto error;
+				new_map->alloc_len = alloc_len;
+				for (i = 0; i < map_len; i++)
+					new_map->queues[i] = map->queues[i];
+				new_map->len = map_len;
+			}
+			new_map->queues[new_map->len++] = index;
+		} else if (!need_set && pos < map_len) {
+			/* Need to remove queue from this CPU's map */
+			if (map_len > 1)
+				new_map->queues[pos] =
+				    new_map->queues[--new_map->len];
+			else
+				new_map = NULL;
+		}
+		new_dev_maps->cpu_map[cpu] = new_map;
+	}
+
+	/* Cleanup old maps */
+	for_each_possible_cpu(cpu) {
+		map = dev_maps ? dev_maps->cpu_map[cpu] : NULL;
+		if (map && new_dev_maps->cpu_map[cpu] != map)
+			call_rcu(&map->rcu, xps_map_release);
+		if (new_dev_maps->cpu_map[cpu])
+			nonempty = 1;
+	}
+
+	if (nonempty)
+		rcu_assign_pointer(first->xps_maps, new_dev_maps);
+	else {
+		kfree(new_dev_maps);
+		rcu_assign_pointer(first->xps_maps, NULL);
+	}
+
+	if (dev_maps)
+		call_rcu(&dev_maps->rcu, xps_dev_maps_release);
+
+	mutex_unlock(&xps_map_mutex);
+
+	free_cpumask_var(mask);
+	return len;
+
+error:
+	mutex_unlock(&xps_map_mutex);
+
+	if (new_dev_maps)
+		for_each_possible_cpu(i)
+			kfree(new_dev_maps->cpu_map[i]);
+	kfree(new_dev_maps);
+	free_cpumask_var(mask);
+	return -ENOMEM;
+}
+
+static struct netdev_queue_attribute xps_cpus_attribute =
+    __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map);
+
+static struct attribute *netdev_queue_default_attrs[] = {
+	&xps_cpus_attribute.attr,
+	NULL
+};
+
+static void netdev_queue_release(struct kobject *kobj)
 {
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+	struct netdev_queue *first = queue->first;
+	struct xps_dev_maps *dev_maps;
+	struct xps_map *map;
+	unsigned long index;
+	int i, pos, nonempty = 0;
+
+	index = get_netdev_queue_index(queue);
+
+	mutex_lock(&xps_map_mutex);
+	dev_maps = first->xps_maps;
+
+	for_each_possible_cpu(i) {
+		map  = dev_maps ? dev_maps->cpu_map[i] : NULL;
+		if (!map)
+			continue;
+
+		for (pos = 0; pos < map->len; pos++)
+			if (map->queues[pos] == index)
+				break;
+
+		if (pos < map->len) {
+			if (map->len > 1)
+				map->queues[pos] = map->queues[--map->len];
+			else {
+				rcu_assign_pointer(dev_maps->cpu_map[i],
+				    NULL);
+				call_rcu(&map->rcu, xps_map_release);
+				map = NULL;
+			}
+		}
+
+		if (map)
+			nonempty = 1;
+	}
+
+	if (!nonempty) {
+		rcu_assign_pointer(first->xps_maps, NULL);
+		call_rcu(&dev_maps->rcu, xps_dev_maps_release);
+	}
+	mutex_unlock(&xps_map_mutex);
+
+	if (atomic_dec_and_test(&first->count))
+		kfree(first);
+}
+
+static struct kobj_type netdev_queue_ktype = {
+	.sysfs_ops = &netdev_queue_sysfs_ops,
+	.release = netdev_queue_release,
+	.default_attrs = netdev_queue_default_attrs,
+};
+
+static int netdev_queue_add_kobject(struct net_device *net, int index)
+{
+	struct netdev_queue *queue = net->_tx + index;
+	struct netdev_queue *first = queue->first;
+	struct kobject *kobj = &queue->kobj;
+	int error = 0;
+
+	kobj->kset = net->queues_kset;
+	error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL,
+	    "tx-%u", index);
+	if (error) {
+		kobject_put(kobj);
+		return error;
+	}
+
+	kobject_uevent(kobj, KOBJ_ADD);
+	atomic_inc(&first->count);
+
+	return error;
+}
+
+int
+netdev_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
+{
+	int i;
+	int error = 0;
+
+	for (i = old_num; i < new_num; i++) {
+		error = netdev_queue_add_kobject(net, i);
+		if (error) {
+			new_num = old_num;
+			break;
+		}
+	}
+
+	while (--i >= new_num)
+		kobject_put(&net->_rx[i].kobj);
+
+	return error;
+}
+
+static int register_queue_kobjects(struct net_device *net)
+{
+	int error = 0, txq = 0, rxq = 0;
+
 	net->queues_kset = kset_create_and_add("queues",
 	    NULL, &net->dev.kobj);
 	if (!net->queues_kset)
 		return -ENOMEM;
-	return net_rx_queue_update_kobjects(net, 0, net->real_num_rx_queues);
+
+	error = net_rx_queue_update_kobjects(net, 0, net->real_num_rx_queues);
+	if (error)
+		goto error;
+	rxq = net->real_num_rx_queues;
+
+	error = netdev_queue_update_kobjects(net, 0,
+					     net->real_num_tx_queues);
+	if (error)
+		goto error;
+	txq = net->real_num_tx_queues;
+
+	return 0;
+
+error:
+	netdev_queue_update_kobjects(net, txq, 0);
+	net_rx_queue_update_kobjects(net, rxq, 0);
+	return error;
 }
 
-static void rx_queue_remove_kobjects(struct net_device *net)
+static void remove_queue_kobjects(struct net_device *net)
 {
 	net_rx_queue_update_kobjects(net, net->real_num_rx_queues, 0);
+	netdev_queue_update_kobjects(net, net->real_num_tx_queues, 0);
 	kset_unregister(net->queues_kset);
 }
 #endif /* CONFIG_RPS */
@@ -884,7 +1241,7 @@ void netdev_unregister_kobject(struct net_device * net)
 	kobject_get(&dev->kobj);
 
 #ifdef CONFIG_RPS
-	rx_queue_remove_kobjects(net);
+	remove_queue_kobjects(net);
 #endif
 
 	device_del(dev);
@@ -925,7 +1282,7 @@ int netdev_register_kobject(struct net_device *net)
 		return error;
 
 #ifdef CONFIG_RPS
-	error = rx_queue_register_kobjects(net);
+	error = register_queue_kobjects(net);
 	if (error) {
 		device_del(dev);
 		return error;
diff --git a/net/core/net-sysfs.h b/net/core/net-sysfs.h
index 778e157..25ec2ee 100644
--- a/net/core/net-sysfs.h
+++ b/net/core/net-sysfs.h
@@ -6,6 +6,9 @@ int netdev_register_kobject(struct net_device *);
 void netdev_unregister_kobject(struct net_device *);
 #ifdef CONFIG_RPS
 int net_rx_queue_update_kobjects(struct net_device *, int old_num, int new_num);
+int netdev_queue_update_kobjects(struct net_device *net,
+				 int old_num, int new_num);
+
 #endif
 
 #endif
-- 
1.7.3.1


^ permalink raw reply related

* Re: [PATCH 2/2 v5] xps: Transmit Packet Steering
From: Eric Dumazet @ 2010-11-07 20:40 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1011071146390.29978@pokey.mtv.corp.google.com>

Le dimanche 07 novembre 2010 à 11:52 -0800, Tom Herbert a écrit :
> This patch implements transmit packet steering (XPS) for multiqueue
> devices.  XPS selects a transmit queue during packet transmission based
> on configuration.  This is done by mapping the CPU transmitting the
> packet to a queue.  This is the transmit side analogue to RPS-- where
> RPS is selecting a CPU based on receive queue, XPS selects a queue
> based on the CPU (previously there was an XPS patch from Eric
> Dumazet, but that might more appropriately be called transmit completion
> steering).
> 
> Each transmit queue can be associated with a number of CPUs which will
> use the queue to send packets.  This is configured as a CPU mask on a
> per queue basis in:
> 
> /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
> 
> The mappings are stored per device in an inverted data structure that
> maps CPUs to queues.  In the netdevice structure this is an array of
> num_possible_cpu structures where each structure holds and array of
> queue_indexes for queues which that CPU can use.
> 
> The benefits of XPS are improved locality in the per queue data
> structures.  Also, transmit completions are more likely to be done
> nearer to the sending thread, so this should promote locality back
> to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
> cache hierarchy, application load, and other factors.  XPS would
> nominally be configured so that a queue would only be shared by CPUs
> which are sharing a cache, the degenerative configuration woud be that
> each CPU has it's own queue.
> 
> Below are some benchmark results which show the potential benfit of
> this patch.  The netperf test has 500 instances of netperf TCP_RR test
> with 1 byte req. and resp.
> 
> bnx2x on 16 core AMD
>    XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
>    No XPS (16 queues)                   996K at 100% CPU
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/netdevice.h |   32 ++++
>  net/core/dev.c            |   54 +++++++-
>  net/core/net-sysfs.c      |  367 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/core/net-sysfs.h      |    3 +
>  4 files changed, 450 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 072652d..b2ea7c0 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -503,6 +503,13 @@ struct netdev_queue {
>  	struct Qdisc		*qdisc;
>  	unsigned long		state;
>  	struct Qdisc		*qdisc_sleeping;
> +#ifdef CONFIG_RPS
> +	struct netdev_queue	*first;
> +	atomic_t		count;
> +	struct xps_dev_maps	*xps_maps;

Tom, I still dont understand why *xps_maps is here, and not in
net_device ?

I am asking because netdev_get_xps_maps(dev) might be slowed down
because queue 0 state might change often (__QUEUE_STATE_XOFF)

This means _tx[0] becomes a very hot cache line, needed to access all
queues (from get_xps_queue())

Other than that, your patch seems fine (not tested yet)

Thanks



^ permalink raw reply

* [PATCH] SUNRPC: Simplify rpc_alloc_iostats by removing pointless local variable
From: Jesper Juhl @ 2010-11-07 21:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	J. Bruce Fields, Neil Brown, Trond Myklebust, David S. Miller

Hi,

We can simplify net/sunrpc/stats.c::rpc_alloc_iostats() a bit by getting 
rid of the unneeded local variable 'new'.


Please CC me on replies.


Signed-off-by: Jesper Juhl <jj-IYz4IdjRLj0sV2N9l4h3zg@public.gmane.org>
---
 stats.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git 
a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index f71a731..80df89d 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -115,9 +115,7 @@ EXPORT_SYMBOL_GPL(svc_seq_show);
  */
 struct rpc_iostats *rpc_alloc_iostats(struct rpc_clnt *clnt)
 {
-	struct rpc_iostats *new;
-	new = kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
-	return new;
+	return kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(rpc_alloc_iostats);
 


-- 
Jesper Juhl <jj-IYz4IdjRLj0sV2N9l4h3zg@public.gmane.org>             http://www.chaosbits.net/
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Registration of NGO for a Grant,contact me for more details.
From: NGO REGISTRATION @ 2010-11-07 18:32 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=_iso-2022-jp$ESC, Size: 0 bytes --]



^ permalink raw reply

* [PATCH 0/4] firewire-net updates
From: Stefan Richter @ 2010-11-07 21:38 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel

Here is a repost of yesterday's patches for firewire-net.  I dropped the
error logging limitation.

firewire: net: count stats.tx_packets and stats.tx_bytes
firewire: net: fix memory leaks
firewire: net: replace lists by counters
firewire: net: throttle TX queue before running out of tlabels

 drivers/firewire/net.c |  137 ++++++++++++++++++++++++-----------------
 1 file changed, 83 insertions(+), 54 deletions(-)
-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/



------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH 1/4] firewire: net: count stats.tx_packets and stats.tx_bytes
From: Stefan Richter @ 2010-11-07 21:39 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -907,6 +907,7 @@ static int fwnet_send_packet(struct fwne
 static void fwnet_transmit_packet_done(struct fwnet_packet_task *ptask)
 {
 	struct fwnet_device *dev = ptask->dev;
+	struct sk_buff *skb = ptask->skb;
 	unsigned long flags;
 	bool free;
 
@@ -917,8 +918,11 @@ static void fwnet_transmit_packet_done(s
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link));
 
-	if (ptask->outstanding_pkts == 0)
+	if (ptask->outstanding_pkts == 0) {
 		list_del(&ptask->pt_link);
+		dev->netdev->stats.tx_packets++;
+		dev->netdev->stats.tx_bytes += skb->len;
+	}
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -927,7 +931,6 @@ static void fwnet_transmit_packet_done(s
 		u16 fg_off;
 		u16 datagram_label;
 		u16 lf;
-		struct sk_buff *skb;
 
 		/* Update the ptask to point to the next fragment and send it */
 		lf = fwnet_get_hdr_lf(&ptask->hdr);
@@ -954,7 +957,7 @@ static void fwnet_transmit_packet_done(s
 			datagram_label = fwnet_get_hdr_dgl(&ptask->hdr);
 			break;
 		}
-		skb = ptask->skb;
+
 		skb_pull(skb, ptask->max_payload);
 		if (ptask->outstanding_pkts > 1) {
 			fwnet_make_sf_hdr(&ptask->hdr, RFC2374_HDR_INTFRAG,

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/



------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH 2/4] firewire: net: fix memory leaks
From: Stefan Richter @ 2010-11-07 21:40 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

a) fwnet_transmit_packet_done used to poison ptask->pt_link by list_del.
If fwnet_send_packet checked later whether it was responsible to clean
up (in the border case that the TX soft IRQ was outpaced by the AT-req
tasklet on another CPU), it missed this because ptask->pt_link was no
longer shown as empty.

b) If fwnet_write_complete got an rcode other than RCODE_COMPLETE, we
missed to free the skb and ptask entirely.

Also, count stats.tx_dropped and stats.tx_errors when rcode != 0.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |   35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -917,9 +917,10 @@ static void fwnet_transmit_packet_done(s
 
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link));
+	if (free)
+		list_del(&ptask->pt_link);
 
 	if (ptask->outstanding_pkts == 0) {
-		list_del(&ptask->pt_link);
 		dev->netdev->stats.tx_packets++;
 		dev->netdev->stats.tx_bytes += skb->len;
 	}
@@ -974,6 +975,31 @@ static void fwnet_transmit_packet_done(s
 		fwnet_free_ptask(ptask);
 }
 
+static void fwnet_transmit_packet_failed(struct fwnet_packet_task *ptask)
+{
+	struct fwnet_device *dev = ptask->dev;
+	unsigned long flags;
+	bool free;
+
+	spin_lock_irqsave(&dev->lock, flags);
+
+	/* One fragment failed; don't try to send remaining fragments. */
+	ptask->outstanding_pkts = 0;
+
+	/* Check whether we or the networking TX soft-IRQ is last user. */
+	free = !list_empty(&ptask->pt_link);
+	if (free)
+		list_del(&ptask->pt_link);
+
+	dev->netdev->stats.tx_dropped++;
+	dev->netdev->stats.tx_errors++;
+
+	spin_unlock_irqrestore(&dev->lock, flags);
+
+	if (free)
+		fwnet_free_ptask(ptask);
+}
+
 static void fwnet_write_complete(struct fw_card *card, int rcode,
 				 void *payload, size_t length, void *data)
 {
@@ -981,11 +1007,12 @@ static void fwnet_write_complete(struct 
 
 	ptask = data;
 
-	if (rcode == RCODE_COMPLETE)
+	if (rcode == RCODE_COMPLETE) {
 		fwnet_transmit_packet_done(ptask);
-	else
+	} else {
 		fw_error("fwnet_write_complete: failed: %x\n", rcode);
-		/* ??? error recovery */
+		fwnet_transmit_packet_failed(ptask);
+	}
 }
 
 static int fwnet_send_packet(struct fwnet_packet_task *ptask)

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/


------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH 3/4] firewire: net: replace lists by counters
From: Stefan Richter @ 2010-11-07 21:40 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel, netdev
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

The current transmit code does not at all make use of
  - fwnet_device.packet_list
and only very limited use of
  - fwnet_device.broadcasted_list,
  - fwnet_device.queued_packets.
Their current function is to track whether the TX soft-IRQ finished
dealing with an skb when the AT-req tasklet takes over, and to discard
pending tx datagrams (if there are any) when the local node is removed.

The latter does actually contain a race condition bug with TX soft-IRQ
and AT-req tasklet.

Instead of these lists and the corresponding link in fwnet_packet_task,
  - a flag in fwnet_packet_task to track whether fwnet_tx is done,
  - a counter of queued datagrams in fwnet_device
do the job as well.

The above mentioned theoretic race condition is resolved by letting
fwnet_remove sleep until all datagrams were flushed.  It may sleep
almost arbitrarily long since fwnet_remove is executed in the context of
a multithreaded (concurrency managed) workqueue.

The type of max_payload is changed to u16 here to avoid waste in struct
fwnet_packet_task.  This value cannot exceed 4096 per IEEE 1394:2008
table 16-18 (or 32678 per specification of packet headers, if there is
ever going to be something else than beta mode).

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |   73 +++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 47 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/bug.h>
+#include <linux/delay.h>
 #include <linux/device.h>
 #include <linux/ethtool.h>
 #include <linux/firewire.h>
@@ -170,15 +171,8 @@ struct fwnet_device {
 	struct fw_address_handler handler;
 	u64 local_fifo;
 
-	/* List of packets to be sent */
-	struct list_head packet_list;
-	/*
-	 * List of packets that were broadcasted.  When we get an ISO interrupt
-	 * one of them has been sent
-	 */
-	struct list_head broadcasted_list;
-	/* List of packets that have been sent but not yet acked */
-	struct list_head sent_list;
+	/* Number of tx datagrams that have been queued but not yet acked */
+	int queued_datagrams;
 
 	struct list_head peer_list;
 	struct fw_card *card;
@@ -196,7 +190,7 @@ struct fwnet_peer {
 	unsigned pdg_size;        /* pd_list size */
 
 	u16 datagram_label;       /* outgoing datagram label */
-	unsigned max_payload;     /* includes RFC2374_FRAG_HDR_SIZE overhead */
+	u16 max_payload;          /* includes RFC2374_FRAG_HDR_SIZE overhead */
 	int node_id;
 	int generation;
 	unsigned speed;
@@ -204,22 +198,18 @@ struct fwnet_peer {
 
 /* This is our task struct. It's used for the packet complete callback.  */
 struct fwnet_packet_task {
-	/*
-	 * ptask can actually be on dev->packet_list, dev->broadcasted_list,
-	 * or dev->sent_list depending on its current state.
-	 */
-	struct list_head pt_link;
 	struct fw_transaction transaction;
 	struct rfc2734_header hdr;
 	struct sk_buff *skb;
 	struct fwnet_device *dev;
 
 	int outstanding_pkts;
-	unsigned max_payload;
 	u64 fifo_addr;
 	u16 dest_node;
+	u16 max_payload;
 	u8 generation;
 	u8 speed;
+	u8 enqueued;
 };
 
 /*
@@ -916,9 +906,9 @@ static void fwnet_transmit_packet_done(s
 	ptask->outstanding_pkts--;
 
 	/* Check whether we or the networking TX soft-IRQ is last user. */
-	free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link));
+	free = (ptask->outstanding_pkts == 0 && ptask->enqueued);
 	if (free)
-		list_del(&ptask->pt_link);
+		dev->queued_datagrams--;
 
 	if (ptask->outstanding_pkts == 0) {
 		dev->netdev->stats.tx_packets++;
@@ -987,9 +977,9 @@ static void fwnet_transmit_packet_failed
 	ptask->outstanding_pkts = 0;
 
 	/* Check whether we or the networking TX soft-IRQ is last user. */
-	free = !list_empty(&ptask->pt_link);
+	free = ptask->enqueued;
 	if (free)
-		list_del(&ptask->pt_link);
+		dev->queued_datagrams--;
 
 	dev->netdev->stats.tx_dropped++;
 	dev->netdev->stats.tx_errors++;
@@ -1070,9 +1060,11 @@ static int fwnet_send_packet(struct fwne
 		spin_lock_irqsave(&dev->lock, flags);
 
 		/* If the AT tasklet already ran, we may be last user. */
-		free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link));
+		free = (ptask->outstanding_pkts == 0 && !ptask->enqueued);
 		if (!free)
-			list_add_tail(&ptask->pt_link, &dev->broadcasted_list);
+			ptask->enqueued = true;
+		else
+			dev->queued_datagrams--;
 
 		spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1087,9 +1079,11 @@ static int fwnet_send_packet(struct fwne
 	spin_lock_irqsave(&dev->lock, flags);
 
 	/* If the AT tasklet already ran, we may be last user. */
-	free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link));
+	free = (ptask->outstanding_pkts == 0 && !ptask->enqueued);
 	if (!free)
-		list_add_tail(&ptask->pt_link, &dev->sent_list);
+		ptask->enqueued = true;
+	else
+		dev->queued_datagrams--;
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1351,10 +1345,12 @@ static netdev_tx_t fwnet_tx(struct sk_bu
 		max_payload += RFC2374_FRAG_HDR_SIZE;
 	}
 
+	dev->queued_datagrams++;
+
 	spin_unlock_irqrestore(&dev->lock, flags);
 
 	ptask->max_payload = max_payload;
-	INIT_LIST_HEAD(&ptask->pt_link);
+	ptask->enqueued    = 0;
 
 	fwnet_send_packet(ptask);
 
@@ -1500,14 +1496,9 @@ static int fwnet_probe(struct device *_d
 	dev->broadcast_rcv_context = NULL;
 	dev->broadcast_xmt_max_payload = 0;
 	dev->broadcast_xmt_datagramlabel = 0;
-
 	dev->local_fifo = FWNET_NO_FIFO_ADDR;
-
-	INIT_LIST_HEAD(&dev->packet_list);
-	INIT_LIST_HEAD(&dev->broadcasted_list);
-	INIT_LIST_HEAD(&dev->sent_list);
+	dev->queued_datagrams = 0;
 	INIT_LIST_HEAD(&dev->peer_list);
-
 	dev->card = card;
 	dev->netdev = net;
 
@@ -1565,7 +1556,7 @@ static int fwnet_remove(struct device *_
 	struct fwnet_peer *peer = dev_get_drvdata(_dev);
 	struct fwnet_device *dev = peer->dev;
 	struct net_device *net;
-	struct fwnet_packet_task *ptask, *pt_next;
+	int i;
 
 	mutex_lock(&fwnet_device_mutex);
 
@@ -1583,21 +1574,9 @@ static int fwnet_remove(struct device *_
 					      dev->card);
 			fw_iso_context_destroy(dev->broadcast_rcv_context);
 		}
-		list_for_each_entry_safe(ptask, pt_next,
-					 &dev->packet_list, pt_link) {
-			dev_kfree_skb_any(ptask->skb);
-			kmem_cache_free(fwnet_packet_task_cache, ptask);
-		}
-		list_for_each_entry_safe(ptask, pt_next,
-					 &dev->broadcasted_list, pt_link) {
-			dev_kfree_skb_any(ptask->skb);
-			kmem_cache_free(fwnet_packet_task_cache, ptask);
-		}
-		list_for_each_entry_safe(ptask, pt_next,
-					 &dev->sent_list, pt_link) {
-			dev_kfree_skb_any(ptask->skb);
-			kmem_cache_free(fwnet_packet_task_cache, ptask);
-		}
+		for (i = 0; dev->queued_datagrams && i < 5; i++)
+			ssleep(1);
+		WARN_ON(dev->queued_datagrams);
 		list_del(&dev->dev_link);
 
 		free_netdev(net);

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/

^ permalink raw reply

* [PATCH 4/4] firewire: net: throttle TX queue before running out of tlabels
From: Stefan Richter @ 2010-11-07 21:41 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

This prevents firewire-net from submitting write requests in fast
succession until failure due to all 64 transaction labels used up for
unfinished split transactions.  The netif_stop/wake_queue API is used
for this purpose.

The high watermark is set to considerably less than 64 (I chose 8) in
order to put less pressure on to responder peers.  Still, responders
which run Linux firewire-net still need to improved to keep up with
even this small amount of outstanding split transactions.  (This is
considering the case of only one responder at a time, or a primary
responder, because this is the most likely case with IP over 1394.)

This is based on the count of unfinished TX datagrams.  Maybe it should
rather be based on the count of unfinished TX fragments?  Many 1394a
CardBus cards accept only 1024k sized packets, i.e. require two fragments
per datagram at the standard MTU of 1500.  However, the chosen watermark
is still well below the maximum number of tlabels.

Without this stop/wake mechanism, datagrams were simply lost whenever
the tlabel pool was exhausted.

I wonder what a good net_device.tx_queue_len value is.  I just set it
to the same value as the chosen watermark for now.  Note that the net
scheduler still lets us exceed the high watermark occasionally.  This
could be prevented by an earlier queued_datagrams check in fwnet_tx with
a NETDEV_TX_BUSY return but such an early check did not actually
improve performance in my testing, i.e. did not reduce busy responder
situations.

Results:
  - I did not see changes to resulting throughput that were discernible
    from the usual measuring noise.
  - With this change, other requesters on the same node now have a
    chance to get spare transaction labels while firewire-net is
    transmitting.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |   36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -28,8 +28,14 @@
 #include <asm/unaligned.h>
 #include <net/arp.h>
 
-#define FWNET_MAX_FRAGMENTS	25	/* arbitrary limit */
-#define FWNET_ISO_PAGE_COUNT	(PAGE_SIZE < 16 * 1024 ? 4 : 2)
+/* rx limits */
+#define FWNET_MAX_FRAGMENTS		25 /* arbitrary limit */
+#define FWNET_ISO_PAGE_COUNT		(PAGE_SIZE < 16 * 1024 ? 4 : 2)
+
+/* tx limits */
+#define FWNET_TX_QUEUE_LEN		8 /* ? */
+#define FWNET_MAX_QUEUED_DATAGRAMS	8 /* should keep AT DMA busy enough */
+#define FWNET_MIN_QUEUED_DATAGRAMS	2
 
 #define IEEE1394_BROADCAST_CHANNEL	31
 #define IEEE1394_ALL_NODES		(0xffc0 | 0x003f)
@@ -892,6 +898,20 @@ static void fwnet_free_ptask(struct fwne
 	kmem_cache_free(fwnet_packet_task_cache, ptask);
 }
 
+/* caller must hold dev->lock */
+static void inc_queued_datagrams(struct fwnet_device *dev)
+{
+	if (++dev->queued_datagrams >= FWNET_MAX_QUEUED_DATAGRAMS)
+		netif_stop_queue(dev->netdev);
+}
+
+/* caller must hold dev->lock */
+static void dec_queued_datagrams(struct fwnet_device *dev)
+{
+	if (--dev->queued_datagrams == FWNET_MIN_QUEUED_DATAGRAMS)
+		netif_wake_queue(dev->netdev);
+}
+
 static int fwnet_send_packet(struct fwnet_packet_task *ptask);
 
 static void fwnet_transmit_packet_done(struct fwnet_packet_task *ptask)
@@ -908,7 +928,7 @@ static void fwnet_transmit_packet_done(s
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = (ptask->outstanding_pkts == 0 && ptask->enqueued);
 	if (free)
-		dev->queued_datagrams--;
+		dec_queued_datagrams(dev);
 
 	if (ptask->outstanding_pkts == 0) {
 		dev->netdev->stats.tx_packets++;
@@ -979,7 +999,7 @@ static void fwnet_transmit_packet_failed
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = ptask->enqueued;
 	if (free)
-		dev->queued_datagrams--;
+		dec_queued_datagrams(dev);
 
 	dev->netdev->stats.tx_dropped++;
 	dev->netdev->stats.tx_errors++;
@@ -1064,7 +1084,7 @@ static int fwnet_send_packet(struct fwne
 		if (!free)
 			ptask->enqueued = true;
 		else
-			dev->queued_datagrams--;
+			dec_queued_datagrams(dev);
 
 		spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1083,7 +1103,7 @@ static int fwnet_send_packet(struct fwne
 	if (!free)
 		ptask->enqueued = true;
 	else
-		dev->queued_datagrams--;
+		dec_queued_datagrams(dev);
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1345,7 +1365,7 @@ static netdev_tx_t fwnet_tx(struct sk_bu
 		max_payload += RFC2374_FRAG_HDR_SIZE;
 	}
 
-	dev->queued_datagrams++;
+	inc_queued_datagrams(dev);
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1415,7 +1435,7 @@ static void fwnet_init_dev(struct net_de
 	net->addr_len		= FWNET_ALEN;
 	net->hard_header_len	= FWNET_HLEN;
 	net->type		= ARPHRD_IEEE1394;
-	net->tx_queue_len	= 10;
+	net->tx_queue_len	= FWNET_TX_QUEUE_LEN;
 	SET_ETHTOOL_OPS(net, &fwnet_ethtool_ops);
 }
 

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/


------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [BUG]: skge not working (as module) in 2.6.37-rc1
From: Marin Mitov @ 2010-11-07 21:45 UTC (permalink / raw)
  To: Stephen Hemminger, Stephen Hemminger, netdev, linux-kernel,
	David S. Miller

Hi Stephen,

skge as in 2.6.36 (and before) is working.
As in 2.6.37-rc1 it is not:

kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
kernel: IP: [<ffffffffa0005d20>] skge_devinit+0x270/0x2a0 [skge]
kernel: PGD d8657067 PUD d8658067 PMD 0
kernel: Oops: 0002 [#1] PREEMPT SMP
kernel: last sysfs file: /sys/devices/platform/mga_warp.0/firmware/mga_warp.0/loading
kernel: CPU 1
kernel: Modules linked in: skge(+)
kernel:
kernel: Pid: 2005, comm: insmod Not tainted 2.6.37-rc1 #2 A8V/System Product Name
kernel: RIP: 0010:[<ffffffffa0005d20>]  [<ffffffffa0005d20>] skge_devinit+0x270/0x2a0 [skge]
kernel: RSP: 0018:ffff8800ce477cb8  EFLAGS: 00010292
kernel: RAX: 0000000000000000 RBX: ffff88011f2cc800 RCX: ffffffff815ab260
kernel: RDX: ffffffff814e82a8 RSI: 0000000000000046 RDI: ffffffff815ab154
kernel: RBP: ffff8800ce477cd8 R08: 00000000ffffffff R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800daa27480
kernel: R13: 0000000000000000 R14: ffff88011f2ccd80 R15: 0000000000000000
kernel: FS:  00007f2af73f3700(0000) GS:ffff8800dfd00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
kernel: CR2: 0000000000000010 CR3: 00000000d865a000 CR4: 00000000000006e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: Process insmod (pid: 2005, threadinfo ffff8800ce476000, task ffff880109875370)
kernel: Stack:
kernel:  ffff88011fe5f800 0000000000000000 ffff8800daa27480 ffff8800daa274e8
kernel:  ffff8800ce477d28 ffffffffa0006f2d ffff8800ce477d08 ffffffff8119555a
kernel:  ffff8800ce477d38 ffff88011fe5f888 ffff88011fe5f800 ffffffffa0008120
kernel: Call Trace:
kernel:  [<ffffffffa0006f2d>] skge_probe+0x27c/0x4a7 [skge]
kernel:  [<ffffffff8119555a>] ? kobject_get+0x1a/0x30
kernel:  [<ffffffff811aa812>] local_pci_probe+0x12/0x20
kernel:  [<ffffffff811aaae0>] pci_device_probe+0x80/0xb0
kernel:  [<ffffffff812337fa>] ? driver_sysfs_add+0x7a/0xb0
kernel:  [<ffffffff81233931>] driver_probe_device+0x81/0x1a0
kernel:  [<ffffffff81233ae3>] __driver_attach+0x93/0xa0
kernel:  [<ffffffff81233a50>] ? __driver_attach+0x0/0xa0
kernel:  [<ffffffff8123301c>] bus_for_each_dev+0x5c/0x90
kernel:  [<ffffffff81233779>] driver_attach+0x19/0x20
kernel:  [<ffffffff81232908>] bus_add_driver+0x198/0x250
kernel:  [<ffffffffa000c000>] ? skge_init_module+0x0/0x3a [skge]
kernel:  [<ffffffff81233dd8>] driver_register+0x78/0x140
kernel:  [<ffffffffa000c000>] ? skge_init_module+0x0/0x3a [skge]
kernel:  [<ffffffff811aad91>] __pci_register_driver+0x51/0xd0
kernel:  [<ffffffff812d4840>] ? dmi_check_system+0x20/0x50
kernel:  [<ffffffffa000c038>] skge_init_module+0x38/0x3a [skge]
kernel:  [<ffffffff810001de>] do_one_initcall+0x3e/0x170
kernel:  [<ffffffff81066192>] sys_init_module+0xb2/0x200
kernel:  [<ffffffff810024ab>] system_call_fastpath+0x16/0x1b
kernel: Code: 39 e1 48 89 df e8 81 a7 33 e1 ba 15 0f 00 00 48 c7 c6 08 7c 00 a0 48 c7 c7 48 73 00 a0 31 c0 e8 c1 aa 39 e1 48 8b 83 00 03 00 00 <f0> 80 48 10 01 ba 17 0f 00 00 48 c7 c6 08 7c 00 a0 48 c7 c7 48
kernel: RIP  [<ffffffffa0005d20>] skge_devinit+0x270/0x2a0 [skge]
kernel:  RSP <ffff8800ce477cb8>
kernel: CR2: 0000000000000010
kernel: ---[ end trace ef29176d9e5b71a4 ]---

Reverting the changes in skge.c (2.6.36 -> 2.6.37-rc1) does not help.
Debugging with many printk embedded in skge_devinit() found the problem is in
netif_stop_queue(dev). Removing the statement (see the patch) helps - skge is working.

But I am not expert in the networking, so I am not sure I have solved the real problem.
May be some changes in the core networking are the real cause of the problem.

And by the way, why one should stop a queue that is not yet (at least explicitly) started? :-)

Best regards.

Marin Mitov

Signed-off-by: Marin Mitov <mitov@issp.bas.bg>

===========================================================
--- a/drivers/net/skge.c	2010-11-07 10:55:22.000000000 +0200
+++ b/drivers/net/skge.c	2010-11-07 20:55:43.000000000 +0200
@@ -3858,7 +3858,6 @@ static struct net_device *skge_devinit(s
 
 	/* device is off until link detection */
 	netif_carrier_off(dev);
-	netif_stop_queue(dev);
 
 	return dev;
 }

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 21:52 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Linus Torvalds, Dan Rosenberg, chas@cmf.nrl.navy.mil,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <20101106234840.GD2935@thunk.org>

Ted Ts'o <tytso@mit.edu> writes:

> Are there any userspace programs that might be reasonably expected to
> _use_ this information?  If there is, we could just pick a random
> number at boot time, and then XOR the heap adddress with that random
> number.

If any of the addresses can be guessed ever (and that is likely if it's
allocated at boot) determining the random value will be trivial
for everyone.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH 0/9] Fix leaking of kernel heap addresses in net/
From: Urs Thuermann @ 2010-11-07 21:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dan Rosenberg, chas, davem, kuznet, pekkas, jmorris, yoshfuji,
	kaber, remi.denis-courmont, netdev, security, stable
In-Reply-To: <1289149416.2478.143.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> You are basically ruining a lot of debugging facilities we use every
> day to find and fix _real_ bugs. The bugs that happen to crash
> machines of our customers.
> 
> If you want to avoid a user reading kernel syslog, why dont you fix
> the problem for non root users able to "dmesg" ? I personally dont
> care.

dmesg and syslog aren't the only problem.  There are files in
/proc/net/ that must be readable by all users in order for a number of
tools to work, e.g. netstat(8), and those files contain kernel
addresses of sock structs.

> There are pretty easy ways to not disclose "information", but your
> way of using '0' for all values is the dumbest idea one could ever
> had.
> 
> A single XOR with a "root only visible, random value chosen at boot"
> would be OK. At least we could continue our work, with litle burden.

Simple XOR with a secret constant doesn't help very much.  Many bits
of that hidden constant can be revealed quite easily.  On x86-32 for
example all addresses are in the high 1 GB address space so the
highest two address bits are 1.  Many structs of type sock are aligned
to 64 or higher powers of two so you can easily infer the lower 6 or
more bits of that hidden XOR constant.  If you know two structs of
alignment 2^n are contiguous in memory you can infer the lower n+1
bits.

I suggest either one of the following two solutions:

1. Make the contents of the proc-files in question dependent on some
   capability, e.g. CAP_SYS_ADMIN or some new CAP_KERNEL_DEBUG.  If
   the process opening the file owns that capability it will see the
   kernel address, otherwise it will see 0.

2. Replace the kernel address by the socket inode number or some other
   identifying value (as our patch for CAN does) and provide some
   mechanism for root only to translate that value to the
   corresponding kernel address.


urs

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Chas Williams (CONTRACTOR) @ 2010-11-07 22:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ted Ts'o, Linus Torvalds, Dan Rosenberg, davem@davemloft.net,
	kuznet@ms2.inr.ac.ru, pekkas@netcore.fi, jmorris@namei.org,
	yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <87sjzcssx5.fsf@basil.nowhere.org>

In message <87sjzcssx5.fsf@basil.nowhere.org>,Andi Kleen writes:
>Ted Ts'o <tytso@mit.edu> writes:
>
>> Are there any userspace programs that might be reasonably expected to
>> _use_ this information?  If there is, we could just pick a random
>> number at boot time, and then XOR the heap adddress with that random
>> number.
>
>If any of the addresses can be guessed ever (and that is likely if it's
>allocated at boot) determining the random value will be trivial
>for everyone.

i suppose one could use idr to map the pointers to unique values.  the
infiniband code uses this technique>

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Rusty Russell @ 2010-11-07 23:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Krishna Kumar2, davem, netdev, yvugenfi
In-Reply-To: <20101104122424.GA29830@redhat.com>

On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> I thought about this some more.  I think the original
> code is actually correct in returning ENOSPC: indirect
> buffers are nice, but it's a mistake
> to rely on them as a memory allocation might fail.
> 
> And if you look at virtio-net, it is dropping packets
> under memory pressure which is not really a happy outcome:
> the packet will get freed, reallocated and we get another one,
> adding pressure on the allocator instead of releasing it
> until we free up some buffers.
> 
> So I now think we should calculate the capacity
> assuming non-indirect entries, and if we manage to
> use indirect, all the better.

I've long said it's a weakness in the network stack that it insists
drivers stop the tx queue before they *might* run out of room, leading to
worst-case assumptions and underutilization of the tx ring.

However, I lost that debate, and so your patch is the way it's supposed to
work.  The other main indirect user (block) doesn't care as its queue
allows for post-attempt blocking.

I enhanced your commentry a little:

Subject: virtio: return correct capacity to users
Date: Thu, 4 Nov 2010 14:24:24 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>

We can't rely on indirect buffers for capacity
calculations because they need a memory allocation
which might fail.  In particular, virtio_net can get
into this situation under stress, and it drops packets
and performs badly.

So return the number of buffers we can guarantee users.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>

Thanks!
Rusty.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox