Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net/core: Save the port number a netdevice uses
From: David Miller @ 2010-05-26  8:39 UTC (permalink / raw)
  To: eli; +Cc: netdev, linux-rdma, rdreier, yevgenyp
In-Reply-To: <20100526081702.GA17160@mtldesk030.lab.mtl.com>

From: Eli Cohen <eli@mellanox.co.il>
Date: Wed, 26 May 2010 11:17:02 +0300

> Today, there are no means to know which port of a hardware device a netdev
> interface uses. This patch adds a new field to struct net_device that is used
> to store this value. The network driver should use the SET_NETDEV_PORT_NUM()
> macro to set the port number for the device it manages. For drivers that do not
> set a value, a default value of 1 is set at alloc_netdev_mq().
> This patch also makes use of this feature in the mlx4_en driver.
> 
> Signed-off-by: Eli Cohen <eli@mellanox.co.il>

We have an existing dev_id, use it.

^ permalink raw reply

* [PATCH] net/core: Save the port number a netdevice uses
From: Eli Cohen @ 2010-05-26  8:17 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-rdma, rdreier, yevgenyp

Today, there are no means to know which port of a hardware device a netdev
interface uses. This patch adds a new field to struct net_device that is used
to store this value. The network driver should use the SET_NETDEV_PORT_NUM()
macro to set the port number for the device it manages. For drivers that do not
set a value, a default value of 1 is set at alloc_netdev_mq().
This patch also makes use of this feature in the mlx4_en driver.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
---
 drivers/net/mlx4/en_netdev.c |    1 +
 include/linux/netdevice.h    |    6 ++++++
 net/core/dev.c               |    1 +
 net/core/net-sysfs.c         |    2 ++
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/net/mlx4/en_netdev.c b/drivers/net/mlx4/en_netdev.c
index 6c2b15b..d3df609 100644
--- a/drivers/net/mlx4/en_netdev.c
+++ b/drivers/net/mlx4/en_netdev.c
@@ -978,6 +978,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	}
 
 	SET_NETDEV_DEV(dev, &mdev->dev->pdev->dev);
+	SET_NETDEV_PORT_NUM(dev, port);
 
 	/*
 	 * Initialize driver private data
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3857517..2a52a6a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -843,6 +843,7 @@ struct net_device {
 	unsigned char		perm_addr[MAX_ADDR_LEN]; /* permanent hw address */
 	unsigned char		addr_len;	/* hardware address length	*/
 	unsigned short          dev_id;		/* for shared network cards */
+	unsigned short          port_num;	/* for multiport devices */
 
 	struct netdev_hw_addr_list	uc;	/* Secondary unicast
 						   mac addresses */
@@ -1080,6 +1081,11 @@ static inline void *netdev_priv(const struct net_device *dev)
  */
 #define SET_NETDEV_DEVTYPE(net, devtype)	((net)->dev.type = (devtype))
 
+/*
+ * Set the port number of the physical device that this port net device uses
+ */
+#define SET_NETDEV_PORT_NUM(net, portnum)	((net)->port_num = (portnum))
+
 /**
  *	netif_napi_add - initialize a napi context
  *	@dev:  network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..8e2f5df 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5471,6 +5471,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	dev->_tx = tx;
 	dev->num_tx_queues = queue_count;
 	dev->real_num_tx_queues = queue_count;
+	dev->port_num = 1;
 
 	dev->gso_max_size = GSO_MAX_SIZE;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 59cfc7d..c3d9b39 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -97,6 +97,7 @@ NETDEVICE_SHOW(ifindex, fmt_dec);
 NETDEVICE_SHOW(features, fmt_long_hex);
 NETDEVICE_SHOW(type, fmt_dec);
 NETDEVICE_SHOW(link_mode, fmt_dec);
+NETDEVICE_SHOW(port_num, fmt_dec);
 
 /* use same locking rules as GIFHWADDR ioctl's */
 static ssize_t show_address(struct device *dev, struct device_attribute *attr,
@@ -299,6 +300,7 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(features, S_IRUGO, show_features, NULL),
 	__ATTR(type, S_IRUGO, show_type, NULL),
 	__ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
+	__ATTR(port_num, S_IRUGO, show_port_num, NULL),
 	__ATTR(address, S_IRUGO, show_address, NULL),
 	__ATTR(broadcast, S_IRUGO, show_broadcast, NULL),
 	__ATTR(carrier, S_IRUGO, show_carrier, NULL),
-- 
1.7.1


^ permalink raw reply related

* Re: Warning in net/ipv4/af_inet.c:154
From: David Miller @ 2010-05-26  7:56 UTC (permalink / raw)
  To: anton; +Cc: eric.dumazet, netdev
In-Reply-To: <20100526031943.GA28295@kryten>

From: Anton Blanchard <anton@samba.org>
Date: Wed, 26 May 2010 13:19:43 +1000

> I notice we update sk_forward_alloc in sk_mem_charge and sk_mem_uncharge.
> Since it isn't an atomic variable I went looking for a lock somewhere in
> the call chain (first thought was the socket lock). I couldn't find
> anything, but I could easily be missing something.

We take the lock properly for all of the skb_queue_rcv_skb() cases
but this rule isn't followed properly for skb_queue_err_skb().

Eric, look at even things like skb_tstamp_tx().  Nothing locks the
socket in those cases, yet we dip down into sock_queue_err_skb() and
thus invoke skb_set_owner_r which goes into sk_mem_charge() and does
the non-atomic update on ->sk_forward_alloc.

I am sure there are other cases with this problem involving
sock_queue_err_skb()...  ip_icmp_error() (via __udp4_lib_err()),
ipv6_icmp_error(), etc.

^ permalink raw reply

* Re: [patch] sctp: dubious bitfields in sctp_transport
From: David Miller @ 2010-05-26  7:40 UTC (permalink / raw)
  To: error27; +Cc: vladislav.yasevich, sri, yjwei, linux-sctp, netdev,
	kernel-janitors
In-Reply-To: <20100522202024.GL22515@bicker>

From: Dan Carpenter <error27@gmail.com>
Date: Sat, 22 May 2010 22:20:24 +0200

> Sparse complains because these one-bit bitfields are signed.
>   include/net/sctp/structs.h:879:24: error: dubious one-bit signed bitfield
>   include/net/sctp/structs.h:889:31: error: dubious one-bit signed bitfield
>   include/net/sctp/structs.h:895:26: error: dubious one-bit signed bitfield
>   include/net/sctp/structs.h:898:31: error: dubious one-bit signed bitfield
>   include/net/sctp/structs.h:901:27: error: dubious one-bit signed bitfield
> 
> It doesn't cause a problem in the current code, but it would be better
> to clean it up.  This was introduced by c0058a35aacc7: "sctp: Save some
> room in the sctp_transport by using bitfields".
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Applied, thanks Dan.

^ permalink raw reply

* Re: [patch] ipmr: off by one in __ipmr_fill_mroute()
From: David Miller @ 2010-05-26  7:39 UTC (permalink / raw)
  To: error27; +Cc: netdev, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet
In-Reply-To: <20100522175357.GF22515@bicker>

From: Dan Carpenter <error27@gmail.com>
Date: Sat, 22 May 2010 19:54:48 +0200

> This fixes a smatch warning:
> 	net/ipv4/ipmr.c +1917 __ipmr_fill_mroute(12) error: buffer overflow
> 	'(mrt)->vif_table' 32 <= 32
> 
> The ipv6 version had the same issue.
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Introduced by commit "7438189b".

Applied, thanks Dan.

^ permalink raw reply

* Re: [PATCH] be2net: increase POST timeout for EEH recovery
From: David Miller @ 2010-05-26  7:33 UTC (permalink / raw)
  To: sathyap; +Cc: netdev
In-Reply-To: <20100526070048.GA2312@serverengines.com>

From: Sathya Perla <sathyap@serverengines.com>
Date: Wed, 26 May 2010 12:30:48 +0530

> Sometimes BE requires longer time for POST completion after an EEH reset.
> Increasing the timeout value accordingly.
> 
> Signed-off-by: Sathya Perla <sathyap@serverengines.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] tcp: Socket option to set congestion window
From: David Miller @ 2010-05-26  7:33 UTC (permalink / raw)
  To: therbert; +Cc: shemminger, netdev, ycheng
In-Reply-To: <AANLkTinHGMtfw4Oydfgx0w7QLb-HyYSKdI-4smD-BEkq@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Wed, 26 May 2010 00:06:35 -0700

> It's really not that simple.  In the application with multiple
> connections, congestion may only affect some number of connections, so
> more of the aggregate window may be preserved.  This is an unfairness
> issue between 1 and N connection scenarios which is a real problem.

If this is true, then by all account your patch allows things to be
even worse.

Because now applications can still open up N connections, but with an
even larger initial CWND, with potentially exponential ramifications
on network congestion.

So yet another reason not to consider this feature seriously.  It's
not an application level attribute, it's a network path one.  Please
take it seriously because I really mean it.

^ permalink raw reply

* [PATCH] be2net: increase POST timeout for EEH recovery
From: Sathya Perla @ 2010-05-26  7:00 UTC (permalink / raw)
  To: netdev

Sometimes BE requires longer time for POST completion after an EEH reset.
Increasing the timeout value accordingly.

Signed-off-by: Sathya Perla <sathyap@serverengines.com>
---
 drivers/net/benet/be_cmds.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/benet/be_cmds.c b/drivers/net/benet/be_cmds.c
index c911bfb..9d11dbf 100644
--- a/drivers/net/benet/be_cmds.c
+++ b/drivers/net/benet/be_cmds.c
@@ -294,7 +294,7 @@ int be_cmd_POST(struct be_adapter *adapter)
 		} else {
 			return 0;
 		}
-	} while (timeout < 20);
+	} while (timeout < 40);
 
 	dev_err(&adapter->pdev->dev, "POST timeout; stage=0x%x\n", stage);
 	return -1;
-- 
1.6.5.2


^ permalink raw reply related

* Re: [PATCH] tcp: Socket option to set congestion window
From: Tom Herbert @ 2010-05-26  7:06 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev, ycheng
In-Reply-To: <20100525.225236.226781050.davem@davemloft.net>

On Tue, May 25, 2010 at 10:52 PM, David Miller <davem@davemloft.net> wrote:
>
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Tue, 25 May 2010 22:08:58 -0700
>
> > The IETF TCP maintainers already think Linux TCP allows unsafe
> > operation, this will just allow more possible misuse and prove
> > their argument.  Until/unless this behavior was approved by
> > a wider set of research, I don't think it should be accepted at
> > this time.
>
> Yes, and two other points I'd like to add.
>
> 1) Stop pretending a network path characteristic can be made into
>   an application level one, else I'll stop reading your patches.
>
>   You can try to use smoke and mirrors to make your justification by
>   saying that an application can circumvent things right now by
>   openning up multiple connections.  But guess what?  If that act
>   overflows a network queue, we'll pull the CWND back on all of those
>   connections while their CWNDs are still small and therefore way
>   before things get out of hand.
>
It's really not that simple.  In the application with multiple
connections, congestion may only affect some number of connections, so
more of the aggregate window may be preserved.  This is an unfairness
issue between 1 and N connection scenarios which is a real problem.

>
>   Whereas if you set the initial window high, the CWND is wildly out
>   of control before we are even started.
>
>   And even after your patch the "abuse" ability is still there.  So
>   since your patch doesn't prevent the "abuse", you really don't care
>   about CWND abuse.  Instead, you simply want to pimp your feature.
>
> 2) The very last application I'd want to use something like this is a
>   damn web browser.
>

Right, this should be fixed in the server not at the browsers.
Unfortunately, web browsers seem to have lost any self control in
limiting the number of simultaneous connections that can be opened (we
managed to get IE8 to open over 100 of them).  So the cat's way out of
the bag.  Server's can rein this problem in by only allowing fewer
connections, but at the cost of losing latency is not much incentive!

>   Maybe a program, which is extremely sophisticated, like a database
>   or caching manager, that runs privileged and somehow has complete
>   and constantly updated knowledge of the network topology from end
>   to end.  And iff, and only iff, we only would let privileged
>   applications make the setting.
>
> Right now we only allow to do this via a route setting, exactly because:
>
> 1) It is a network path characteristic, full stop.
>
Thanks to NAT, the concept of a network path or even host specific
path is a weakened concept.  On the Internet this may be a path
characteristic per client, which unfortunately has no visibility in
the kernel other than per connection state.  When a single IP address
may have thousands of hosts behind it, caching TCP parameters for that
IP address is implicitly doing a huge aggregation-- probably dicey...

>
> 2) Only humans can really know what the exact end to end path
>   characteristics are on a per-route basis, and given that whether it
>   is safe to increase the initial CWND as a result.

In all but the most trivial networks, I do not believes humans are
capable of making an intelligent decision about this.  Don't get me
wrong, it's great that it can be set in the route, but there's nothing
at all that prevents naive abuse (2009 study showed that 15%
connections of connections on the Internet violate icw standards
anyway).  We have proposed in iETF to raise the initial congestion
window, but dynamic mechanisms that algorithmically determine safe
values are still of interest and may be safer which is what this patch
would allow.

Thanks for your comments!

^ permalink raw reply

* Re: [PATCH 0/2] fixes to arp_notify for virtual machine migration use case
From: David Miller @ 2010-05-26  6:08 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: netdev, shemminger, Jeremy.Fitzhardinge, stable
In-Reply-To: <1274779201.24218.7164.camel@zakaz.uk.xensource.com>

From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Tue, 25 May 2010 10:20:01 +0100

> Anyway, assuming the fact that arp_notify is disabled by default hasn't
> changed your mind, would adding NETDEV_NOTIFY_PEERS triggered by
> netif_notify_peers() be appropriate or would it be preferable to simply
> add netif_notify_peers() which generates the existing NETDEV_CHANGEADDR?

A seperate NETDEV_NOTIFY_PEERS seems the best idea to me.

> I don't think there's any need for a new sysctl so I'll gate the new
> option on the existing arp_notify one.

Yep, sounds good.

^ permalink raw reply

* Re: [PATCH] tcp: Socket option to set congestion window
From: David Miller @ 2010-05-26  5:52 UTC (permalink / raw)
  To: shemminger; +Cc: therbert, netdev, ycheng
In-Reply-To: <20100525220858.1071f238@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 25 May 2010 22:08:58 -0700

> The IETF TCP maintainers already think Linux TCP allows unsafe
> operation, this will just allow more possible misuse and prove
> their argument.  Until/unless this behavior was approved by
> a wider set of research, I don't think it should be accepted at
> this time.

Yes, and two other points I'd like to add.

1) Stop pretending a network path characteristic can be made into
   an application level one, else I'll stop reading your patches.

   You can try to use smoke and mirrors to make your justification by
   saying that an application can circumvent things right now by
   openning up multiple connections.  But guess what?  If that act
   overflows a network queue, we'll pull the CWND back on all of those
   connections while their CWNDs are still small and therefore way
   before things get out of hand.

   Whereas if you set the initial window high, the CWND is wildly out
   of control before we are even started.

   And even after your patch the "abuse" ability is still there.  So
   since your patch doesn't prevent the "abuse", you really don't care
   about CWND abuse.  Instead, you simply want to pimp your feature.

2) The very last application I'd want to use something like this is a
   damn web browser.

   Maybe a program, which is extremely sophisticated, like a database
   or caching manager, that runs privileged and somehow has complete
   and constantly updated knowledge of the network topology from end
   to end.  And iff, and only iff, we only would let privileged
   applications make the setting.

Right now we only allow to do this via a route setting, exactly because:

1) It is a network path characteristic, full stop.

2) Only humans can really know what the exact end to end path
   characteristics are on a per-route basis, and given that whether it
   is safe to increase the initial CWND as a result.

^ permalink raw reply

* RE: NULL Pointer Deference: NFS & Telnet
From: Eric Dumazet @ 2010-05-26  5:29 UTC (permalink / raw)
  To: Arce, Abraham
  Cc: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-omap-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	tony-4v6yS6AI5VpBDgjK7y7TUQ@public.gmane.org, Shilimkar, Santosh
In-Reply-To: <27F9C60D11D683428E133F85D2BB4A53043E3EDFF1-lTKHBJngVwKIQmiDNMet8wC/G2K4zDHf@public.gmane.org>

Le mardi 25 mai 2010 à 21:02 -0500, Arce, Abraham a écrit :
> Thanks David,
> 
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index f8abf68..eb81f76 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -334,7 +334,7 @@ static void skb_release_data(struct sk_buff *skb)
> > >  	if (!skb->cloned ||
> > >  	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
> > >  			       &skb_shinfo(skb)->dataref)) {
> > > -		if (skb_shinfo(skb)->nr_frags) {
> > > +		if (skb_shinfo(skb)->nr_frags && skb_has_frags(skb)) {
> > >  			int i;
> > >  			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> > >  				put_page(skb_shinfo(skb)->frags[i].page);
> > 
> > skb_shinfo(skb)->nr_frags counts the number of entries contained
> > in the skb_shinfo(skb)->frags[] array.
> > 
> > This has nothing to do with the frag list pointer,
> > skb_shinfo(skb)->frag_list, which is what skb_has_frags()
> > tests.
> > 
> > You've got some kind of memory corruption going on and it
> > appears to have nothing to do with the code paths you're
> > playing with here.
> 
> Do you have any recommendation on debugging technique/tool for this memory corruption issue?
> 
> Best Regards
> Abraham
> --

It seems quite strange. You have a skb->nr_frags > 0 value, but a
frags[i].page = 0 value

You might add following function :

shinfo_check(struct sk_buff *skb)
{
	struct skb_shared_info *shinfo = skb_shinfo(skb);
	int i;

	WARN_ON(shinfo->nr_frags >= MAX_SKB_FRAGS);
	for (i = 0; i < shinfo->nr_frags; i++)
		WARN_ON(!shinfo->frags[i].page);
}

And call it from various points, to check who corrupts your skb.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Warning in net/ipv4/af_inet.c:154
From: Eric Dumazet @ 2010-05-26  5:18 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: netdev
In-Reply-To: <20100526031943.GA28295@kryten>

Le mercredi 26 mai 2010 à 13:19 +1000, Anton Blanchard a écrit :
> Hi,
> 
> > > Which is:
> > > 
> > >         WARN_ON(sk->sk_forward_alloc);
> > > 
> > 
> > Yes, the infamous one :)
> > 
> > Is it reproductible ? What kind of workload is it ?
> > What is the NIC involved ?
> 
> It was running sysbench against a postgresql database over localhost. In
> each case I checked, sk_forward_alloc was less than one page.

ok. I am a bit surprised postgresql uses UDP

> 
> I notice we update sk_forward_alloc in sk_mem_charge and sk_mem_uncharge.
> Since it isn't an atomic variable I went looking for a lock somewhere in
> the call chain (first thought was the socket lock). I couldn't find
> anything, but I could easily be missing something.
> 

UDP path uses socket lock indeed to protect sk_forward_alloc non atomic
updates.

Check skb_free_datagram_locked() for example : sk_mem_reclaim_partial()
is called inside the lock_sock_bh/unlock_sock_bh protected section, and
*after* the uncharge done by skb_orphan(). This function was changed
recently, so maybe your platform hit some bug somewhere.

In receive path, we hold the socket lock while calling
sock_queue_rcv_skb()




^ permalink raw reply

* Re: [PATCH] tcp: Socket option to set congestion window
From: Stephen Hemminger @ 2010-05-26  5:08 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, ycheng
In-Reply-To: <alpine.DEB.1.00.1005252157150.27170@pokey.mtv.corp.google.com>

On Tue, 25 May 2010 22:01:13 -0700 (PDT)
Tom Herbert <therbert@google.com> wrote:

> This patch allows an application to set the TCP congestion window
> for a connection through a socket option.  The maximum value that
> may set is specified in a sysctl value.  When the sysctl is set to
> zero, the default value, the socket option is disabled.
> 
> The socket option is most useful to set the initial congestion
> window for a connection to a larger value than the default in
> order to improve latency.  This socket option would typically be
> used by an "intelligent" application which might have better knowledge
> than the kernel as to what an appropriate initial congestion window is.
> 
> One use of this might be with an application which maintains per
> client path characteristics.  This could allow setting the congestion
> window more precisely than which could be achieved through the
> route command.
> 
> A second use of this might be to reduce the number of simultaneous
> connections that a client might open to the server; for instance
> when a web browser opens multiple connections to a server.  With multiple
> connections the aggregate congestion window is larger than that of a
> single connecton (num_conns * cwnd), this effectively can be used to
> circumvent slowstart and improve latency.  With this socket option, a
> single connection with a large initial congestion window could be used,
> which retains the latency properties of multiple connections but
> nicely reducing # of connections (load) on the network.
> 
> The systctl to enable and control this feature is
> 
>   net.ipv4.tcp_user_cwnd_max
> 
> The socket option call would be:
> 
>   setsockopt(fd, IPPROTO_TCP, TCP_CWND, &val, sizeof (val))
> 
> where val is the congestion window in # MSS.
> 

The IETF TCP maintainers already think Linux TCP allows unsafe
operation, this will just allow more possible misuse and prove
their argument.  Until/unless this behavior was approved by
a wider set of research, I don't think it should be accepted at
this time.


-- 

^ permalink raw reply

* [PATCH] tcp: Socket option to set congestion window
From: Tom Herbert @ 2010-05-26  5:01 UTC (permalink / raw)
  To: davem; +Cc: netdev, ycheng

This patch allows an application to set the TCP congestion window
for a connection through a socket option.  The maximum value that
may set is specified in a sysctl value.  When the sysctl is set to
zero, the default value, the socket option is disabled.

The socket option is most useful to set the initial congestion
window for a connection to a larger value than the default in
order to improve latency.  This socket option would typically be
used by an "intelligent" application which might have better knowledge
than the kernel as to what an appropriate initial congestion window is.

One use of this might be with an application which maintains per
client path characteristics.  This could allow setting the congestion
window more precisely than which could be achieved through the
route command.

A second use of this might be to reduce the number of simultaneous
connections that a client might open to the server; for instance
when a web browser opens multiple connections to a server.  With multiple
connections the aggregate congestion window is larger than that of a
single connecton (num_conns * cwnd), this effectively can be used to
circumvent slowstart and improve latency.  With this socket option, a
single connection with a large initial congestion window could be used,
which retains the latency properties of multiple connections but
nicely reducing # of connections (load) on the network.

The systctl to enable and control this feature is

  net.ipv4.tcp_user_cwnd_max

The socket option call would be:

  setsockopt(fd, IPPROTO_TCP, TCP_CWND, &val, sizeof (val))

where val is the congestion window in # MSS.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..9e9692f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -105,6 +105,7 @@ enum {
 #define TCP_COOKIE_TRANSACTIONS	15	/* TCP Cookie Transactions */
 #define TCP_THIN_LINEAR_TIMEOUTS 16      /* Use linear timeouts for thin streams*/
 #define TCP_THIN_DUPACK         17      /* Fast retrans. after 1 dupack */
+#define TCP_CWND		18	/* Set congestion window */

 /* for TCP_INFO socket option */
 #define TCPI_OPT_TIMESTAMPS	1
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a144914..3d1f934 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -246,6 +246,7 @@ extern int sysctl_tcp_max_ssthresh;
 extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_user_cwnd_max;

 extern atomic_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d96c1da..b35d18f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -597,6 +597,13 @@ static struct ctl_table ipv4_table[] = {
 		.mode           = 0644,
 		.proc_handler   = proc_dointvec
 	},
+        {
+		.procname       = "tcp_user_cwnd_max",
+		.data           = &sysctl_tcp_user_cwnd_max,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec
+	},
 	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6596b4f..0ca9832 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2370,6 +2370,24 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;

+	case TCP_CWND:
+		if (sysctl_tcp_user_cwnd_max <= 0)
+			err = -EPERM;
+		else if (val > 0 && sk->sk_state == TCP_ESTABLISHED &&
+		    icsk->icsk_ca_state == TCP_CA_Open) {
+			u32 cwnd = val;
+			cwnd = min(cwnd, (u32)sysctl_tcp_user_cwnd_max);
+			cwnd = min(cwnd, tp->snd_cwnd_clamp);
+
+			if (tp->snd_cwnd != cwnd) {
+				tp->snd_cwnd = cwnd;
+				tp->snd_cwnd_stamp = tcp_time_stamp;
+				tp->snd_cwnd_cnt = 0;
+			}
+		} else
+			err = -EINVAL;
+		break;
+
 #ifdef CONFIG_TCP_MD5SIG
 	case TCP_MD5SIG:
 		/* Read the IP->Key mappings from userspace */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b4ed957..2d10a44 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -60,6 +60,8 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;

+int sysctl_tcp_user_cwnd_max __read_mostly;
+
 int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
 EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);

^ permalink raw reply related

* Re: Warning in net/ipv4/af_inet.c:154
From: Anton Blanchard @ 2010-05-26  3:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1274801229.5020.80.camel@edumazet-laptop>

Hi,

> > Which is:
> > 
> >         WARN_ON(sk->sk_forward_alloc);
> > 
> 
> Yes, the infamous one :)
> 
> Is it reproductible ? What kind of workload is it ?
> What is the NIC involved ?

It was running sysbench against a postgresql database over localhost. In
each case I checked, sk_forward_alloc was less than one page.

I notice we update sk_forward_alloc in sk_mem_charge and sk_mem_uncharge.
Since it isn't an atomic variable I went looking for a lock somewhere in
the call chain (first thought was the socket lock). I couldn't find
anything, but I could easily be missing something.

Anton

^ permalink raw reply

* [RFC] IFLA_PORT_* iproute2 cmd line
From: Scott Feldman @ 2010-05-26  3:19 UTC (permalink / raw)
  To: netdev; +Cc: Chris Wright, Stephen Hemminger, Arnd Bergmann

I need to provide an iproute2 patch for IFLA_PORT_* and I wanted to hash out
the cmd line before I submit it.  Here's what I think would work based on
previous input from Arnd:

Usage:  ip port associate DEVICE [ vf NUM ] {PROFILE|VSI}
        ip port pre-associate DEVICE [ vf NUM ] VSI
        ip port pre-associate-rr DEVICE [ vf NUM ] VSI
        ip port dis-associate DEVICE [ vf NUM ]
        ip port show [ DEVICE [ vf NUM ] ]

        PROFILE := port-profile PORT-PROFILE
                   [ instance-uuid INSTANCE-UUID ]
                   [ host-uuid HOST-UUID ]

        VSI := vsi managerid MGR typeid VTID typeidversion VER
               [ instance-uuid INSTANCE-UUID ]

Comments?

-scott


^ permalink raw reply

* Re: linux-next: build warning in Linus' tree
From: Stephen Rothwell @ 2010-05-26  2:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-next, linux-kernel, NeilJay
In-Reply-To: <20100525.162446.183058933.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 530 bytes --]

Hi Dave,

On Tue, 25 May 2010 16:24:46 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> From: David Miller <davem@davemloft.net>
> Date: Tue, 25 May 2010 16:19:29 -0700 (PDT)
> 
> > Here is how I fixed this:
> > 
> > --------------------
> > drivers/net/usb/asix.c: Fix pointer cast.
> 
> Sorry, that only took care of one of the two warnings :-)
> 
> This patch is better.

Thanks, looks good.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* RE: NULL Pointer Deference: NFS & Telnet
From: Arce, Abraham @ 2010-05-26  2:02 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-omap-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	tony-4v6yS6AI5VpBDgjK7y7TUQ@public.gmane.org, Shilimkar, Santosh
In-Reply-To: <20100525.185236.193707791.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

Thanks David,

> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index f8abf68..eb81f76 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -334,7 +334,7 @@ static void skb_release_data(struct sk_buff *skb)
> >  	if (!skb->cloned ||
> >  	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
> >  			       &skb_shinfo(skb)->dataref)) {
> > -		if (skb_shinfo(skb)->nr_frags) {
> > +		if (skb_shinfo(skb)->nr_frags && skb_has_frags(skb)) {
> >  			int i;
> >  			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> >  				put_page(skb_shinfo(skb)->frags[i].page);
> 
> skb_shinfo(skb)->nr_frags counts the number of entries contained
> in the skb_shinfo(skb)->frags[] array.
> 
> This has nothing to do with the frag list pointer,
> skb_shinfo(skb)->frag_list, which is what skb_has_frags()
> tests.
> 
> You've got some kind of memory corruption going on and it
> appears to have nothing to do with the code paths you're
> playing with here.

Do you have any recommendation on debugging technique/tool for this memory corruption issue?

Best Regards
Abraham
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: linux-next: build warning in Linus' tree
From: David Miller @ 2010-05-26  1:54 UTC (permalink / raw)
  To: herbert; +Cc: sfr, netdev, linux-next, linux-kernel, torvalds
In-Reply-To: <20100526015110.GA21587@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 26 May 2010 11:51:10 +1000

> cls_cgroup: Initialise classid when module is absent
> 
> When the cls_cgroup module is not loaded, task_cls_classid will
> return an uninitialised classid instead of zero.
> 
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks Herbert.

^ permalink raw reply

* Re: NULL Pointer Deference: NFS & Telnet
From: David Miller @ 2010-05-26  1:52 UTC (permalink / raw)
  To: x0066660-l0cyMroinI0
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-omap-u79uwXL29TY76Z2rM5mHXA, tony-4v6yS6AI5VpBDgjK7y7TUQ,
	santosh.shilimkar-l0cyMroinI0
In-Reply-To: <27F9C60D11D683428E133F85D2BB4A53043E3EDFE6-lTKHBJngVwKIQmiDNMet8wC/G2K4zDHf@public.gmane.org>

From: "Arce, Abraham" <x0066660-l0cyMroinI0@public.gmane.org>
Date: Tue, 25 May 2010 20:48:02 -0500

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f8abf68..eb81f76 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -334,7 +334,7 @@ static void skb_release_data(struct sk_buff *skb)
>  	if (!skb->cloned ||
>  	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
>  			       &skb_shinfo(skb)->dataref)) {
> -		if (skb_shinfo(skb)->nr_frags) {
> +		if (skb_shinfo(skb)->nr_frags && skb_has_frags(skb)) {
>  			int i;
>  			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>  				put_page(skb_shinfo(skb)->frags[i].page);

skb_shinfo(skb)->nr_frags counts the number of entries contained
in the skb_shinfo(skb)->frags[] array.

This has nothing to do with the frag list pointer,
skb_shinfo(skb)->frag_list, which is what skb_has_frags()
tests.

You've got some kind of memory corruption going on and it
appears to have nothing to do with the code paths you're
playing with here.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: linux-next: build warning in Linus' tree
From: Herbert Xu @ 2010-05-26  1:51 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: David Miller, netdev, linux-next, linux-kernel, Linus
In-Reply-To: <20100526114306.5fa6fa0e.sfr@canb.auug.org.au>

On Wed, May 26, 2010 at 11:43:06AM +1000, Stephen Rothwell wrote:
> Hi Dave,
> 
> Today's linux-next build (x86_64 allmodconfig) produced this warning:
> 
> net/core/sock.c: In function 'sock_update_classid':
> include/net/cls_cgroup.h:42: warning: 'classid' may be used uninitialized in this function
> include/net/cls_cgroup.h:42: note: 'classid' was declared here
> 
> In the case that rcu_dereference() returns a value < 0, classid will not
> be assigned in task_cls_classid().  I don't know if this is possible - if
> not, then why is the test there?

This is a genuine bug.  I don't know why my gcc didn't warn about
it.

cls_cgroup: Initialise classid when module is absent

When the cls_cgroup module is not loaded, task_cls_classid will
return an uninitialised classid instead of zero.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
index 6cf4486..726cc35 100644
--- a/include/net/cls_cgroup.h
+++ b/include/net/cls_cgroup.h
@@ -39,7 +39,7 @@ extern int net_cls_subsys_id;
 static inline u32 task_cls_classid(struct task_struct *p)
 {
 	int id;
-	u32 classid;
+	u32 classid = 0;
 
 	if (in_interrupt())
 		return 0;

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* RE: NULL Pointer Deference: NFS & Telnet
From: Arce, Abraham @ 2010-05-26  1:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David Miller
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-omap-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tony Lindgren,
	Shilimkar, Santosh
In-Reply-To: <27F9C60D11D683428E133F85D2BB4A53043E33A997-lTKHBJngVwKIQmiDNMet8wC/G2K4zDHf@public.gmane.org>

Hi,

I am able to avoid the NULL pointer dereference but not sure if the handling
is the correct one... find the patch below...

> I have 2 scenarios in which I am getting a NULL pointer dereference:
> 
> 1) root filesystem over nfs
> 2) telnet connection
> 
> The issue appeared on this commit
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
> 2.6.git;a=commit;h=f8965467f366fd18f01feafb5db10512d7b4422c
> 
> The driver I am working with is drivers/net/ks8851.c
> Any help will be highly appreciated...
>
> ---
> 
> Scenario 1 | root filesystem over nfs
> 
> Looking up port of RPC 100005/1 on 10.87.231.229
> VFS: Mounted root (nfs filesystem) on device 0:10.
> Freeing init memory: 128K
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> [..]
> PC is at put_page+0xc/0x120
> LR is at skb_release_data+0x74/0xb8
> [..]
> Backtrace:
> [<c0086dd0>] (put_page+0x0/0x120)
> [<c01d0a48>] (skb_release_data+0x0/0xb8)
> [<c01d1044>] (skb_release_all+0x0/0x20)
> [<c01d075c>] (__kfree_skb+0x0/0xbc)
> [<c01d0818>] (consume_skb+0x0/0x58)
> [<c01d39cc>] (skb_free_datagram+0x0/0x40)
> [<c023ff74>] (xs_udp_data_ready+0x0/0x1e8)
> [<c01ce034>] (sock_queue_rcv_skb+0x0/0x1c0)
> [<c01fbba8>] (ip_queue_rcv_skb+0x0/0x58)
> [<c02176c0>] (__udp_queue_rcv_skb+0x0/0x18c)
> [<c0218e28>] (udp_queue_rcv_skb+0x0/0x348)
> [<c02195a4>] (__udp4_lib_rcv+0x0/0x564)
> [<c0219b08>] (udp_rcv+0x0/0x20)
> [<c01f5f34>] (ip_local_deliver+0x0/0x264)
> [<c01f586c>] (ip_rcv+0x0/0x6c8)
> [<c01d7ec0>] (__netif_receive_skb+0x0/0x2d0)
> [<c01d8190>] (process_backlog+0x0/0x16c)
> [<c01d8e14>] (net_rx_action+0x0/0x18c)
> [<c00521a0>] (__do_softirq+0x0/0x12c)
> [<c00522cc>] (irq_exit+0x0/0x70)
> [<c0028000>] (asm_do_IRQ+0x0/0xc8)
> 
> Complete log at http://pastebin.mozilla.org/728027
> 
> ---
> 
> Scenario 2
> 
>  1. Root filesystem booted in ram
>  2. eth0 brought up
>  3. telnetd daemon started
>  4. tried to connect through telnet
> 
> # Unable to handle kernel NULL pointer dereference at virtual address 00000000
> pgd = d98e8000
> [..]
> PC is at put_page+0xc/0x120
> LR is at skb_release_data+0x74/0xb8
> [..]
> Backtrace:
> [<c0086dd0>] (put_page+0x0/0x120)
> [<c01d0a48>] (skb_release_data+0x0/0xb8)
> [<c01d1044>] (skb_release_all+0x0/0x20)
> [<c01d075c>] (__kfree_skb+0x0/0xbc)
> [<c0202444>] (tcp_recvmsg+0x0/0x93c)
> [<c02201e8>] (inet_recvmsg+0x0/0xec)
> [<c01c7fd0>] (sock_aio_read+0x0/0xf8)
> [<c00ab3ac>] (do_sync_read+0x0/0xec)
> [<c00abfbc>] (vfs_read+0x0/0x164)
> [<c00ac1a0>] (sys_read+0x0/0x70)
> [<c0029100>] (ret_fast_syscall+0x0/0x30)
> 
> Complete log at http://pastebin.mozilla.org/728028
> 

Check for NULL data in sk_buff before sending to put_page

Signed-off-by: Abraham Arce <x0066660-l0cyMroinI0@public.gmane.org>
---
 net/core/skbuff.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f8abf68..eb81f76 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -334,7 +334,7 @@ static void skb_release_data(struct sk_buff *skb)
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
-		if (skb_shinfo(skb)->nr_frags) {
+		if (skb_shinfo(skb)->nr_frags && skb_has_frags(skb)) {
 			int i;
 			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 				put_page(skb_shinfo(skb)->frags[i].page);
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* linux-next: build warning in Linus' tree
From: Stephen Rothwell @ 2010-05-26  1:43 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: linux-next, linux-kernel, Linus, Herbert Xu

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

Hi Dave,

Today's linux-next build (x86_64 allmodconfig) produced this warning:

net/core/sock.c: In function 'sock_update_classid':
include/net/cls_cgroup.h:42: warning: 'classid' may be used uninitialized in this function
include/net/cls_cgroup.h:42: note: 'classid' was declared here

In the case that rcu_dereference() returns a value < 0, classid will not
be assigned in task_cls_classid().  I don't know if this is possible - if
not, then why is the test there?
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [PATCH] ethtool: Fix list of hash options in manual page
From: Ben Hutchings @ 2010-05-26  1:15 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Aníbal Monsalve Salazar, netdev

'p' is not a valid option.
The 'm' option was missing a preceding 'B' for bold.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
--- ethtool-2.6.34.orig/ethtool.8
+++ ethtool-2.6.34/ethtool.8
@@ -47,7 +47,7 @@
 .\"
 .\"	\(*HO - hash options
 .\"
-.ds HO \fBp\fP|\fm\fP|\fBv\fP|\fBt\fP|\fBs\fP|\fBd\fP|\fBf\fP|\fBn\fP|\fBr\fP...
+.ds HO \fBm\fP|\fBv\fP|\fBt\fP|\fBs\fP|\fBd\fP|\fBf\fP|\fBn\fP|\fBr\fP...
 .TH ETHTOOL 8 "July 2007" "Ethtool version 6"
 .SH NAME
 ethtool \- Display or change ethernet card settings

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox