Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 4/5] smsc95xx: refactor entering suspend modes
From: Bjørn Mork @ 2012-11-26 13:48 UTC (permalink / raw)
  To: Steve Glendinning
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1353607526-19307-5-git-send-email-steve.glendinning-nksJyM/082jR7s880joybQ@public.gmane.org>

[adding linux-usb to CC as this is very USB specific]

Steve Glendinning <steve.glendinning-nksJyM/082jR7s880joybQ@public.gmane.org> writes:

> +	smsc95xx_set_feature(dev, USB_DEVICE_REMOTE_WAKEUP);

That does look a bit strange to me.  This is a USB interface driver.
The USB device is handled by the generic "usb" USB device driver, which
will DTRT for you.  I don't think you need to set any USB device
features here.

Sorry for not commenting on this earlier.... It took me a while to
understand why that part surprised me.

Bjørn
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] irda: irttp: fix memory leak in irttp_open_tsap() error path
From: Tommi Rantala @ 2012-11-26 14:16 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Ortiz, David S. Miller, Dave Jones, Tommi Rantala

Cleanup the memory we allocated earlier in irttp_open_tsap() when we hit
this error path. The leak goes back to at least 1da177e4
("Linux-2.6.12-rc2").

Discovered with Trinity (the syscall fuzzer).

Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
---
 net/irda/irttp.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/irda/irttp.c b/net/irda/irttp.c
index 1002e33..ae43c62 100644
--- a/net/irda/irttp.c
+++ b/net/irda/irttp.c
@@ -441,6 +441,7 @@ struct tsap_cb *irttp_open_tsap(__u8 stsap_sel, int credit, notify_t *notify)
 	lsap = irlmp_open_lsap(stsap_sel, &ttp_notify, 0);
 	if (lsap == NULL) {
 		IRDA_DEBUG(0, "%s: unable to allocate LSAP!!\n", __func__);
+		__irttp_close_tsap(self);
 		return NULL;
 	}

-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH v2] atm: br2684: Fix excessive queue bloat
From: Jesper Dangaard Brouer @ 2012-11-26 14:16 UTC (permalink / raw)
  To: David Woodhouse
  Cc: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR),
	Jesper Brouer
In-Reply-To: <1353881212.26346.303.camel@shinybook.infradead.org>


Nice work David Woodhouse.  What OpenWRT supported box have this 
hardware? (I want one so I can play with it ;-))

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Jesper Dangaard Brouer @ 2012-11-26 14:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu
In-Reply-To: <1353859891.30446.634.camel@edumazet-glaptop>

On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> 
> > Yes, for the default large 64k packets size, its just a "fake"
> > benchmark.  And notice with my fixes, we are even faster than the
> > none-frag/single-UDP packet case... but its because we are getting a
> > GSO/GRO effect.
> 
> Could you elaborate on this GSO/GRO effect ?

On the big system, I saw none-frag UDP (1472 bytes) throughput of:
  7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s

While with UDP fragments size 65507 bytes I saw:
  9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s

Fragmented UDP is faster by:
 37668.26 - 29308.21 = 8360.05 Mbit/s

The 65507 bytes UDP size is just a benchmark test, and have no real-life
relevance.  As performance starts to drop (below none-frag/normal case)
when the frag size is decreased, to more realistic sizes...


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] sctp: fix memory leak in sctp_datamsg_from_user() when copy from user space fails
From: Vlad Yasevich @ 2012-11-26 14:52 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Neil Horman, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590491-12166-1-git-send-email-tt.rantala@gmail.com>

On 11/22/2012 08:21 AM, Tommi Rantala wrote:
> Trinity (the syscall fuzzer) discovered a memory leak in SCTP,
> reproducible e.g. with the sendto() syscall by passing invalid
> user space pointer in the second argument:
>
>   #include <string.h>
>   #include <arpa/inet.h>
>   #include <sys/socket.h>
>
>   int main(void)
>   {
>           int fd;
>           struct sockaddr_in sa;
>
>           fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>           if (fd < 0)
>                   return 1;
>
>           memset(&sa, 0, sizeof(sa));
>           sa.sin_family = AF_INET;
>           sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>           sa.sin_port = htons(11111);
>
>           sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
>           return 0;
>   }
>
> As far as I can tell, the leak has been around since ~2003.
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
>   net/sctp/chunk.c |    7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 7c2df9c..d241ef5 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -284,7 +284,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>   			goto errout;
>   		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>   		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>
>   		offset += len;
>
> @@ -324,7 +324,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>   		__skb_pull(chunk->skb, (__u8 *)chunk->chunk_hdr
>   			   - (__u8 *)chunk->skb->data);
>   		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>
>   		sctp_datamsg_assign(msg, chunk);
>   		list_add_tail(&chunk->frag_list, &msg->chunks);
> @@ -332,6 +332,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   	return msg;
>
> +errout_chunk_put:
> +	sctp_chunk_put(chunk);
> +
>   errout:
>   	list_for_each_safe(pos, temp, &msg->chunks) {
>   		list_del_init(pos);
>

Should be using sctp_chunk_free().  Good find.

-vlad

^ permalink raw reply

* Re: [PATCH] sctp: fix -ENOMEM result with invalid user space pointer in sendto() syscall
From: Vlad Yasevich @ 2012-11-26 14:56 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Neil Horman, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590596-12216-1-git-send-email-tt.rantala@gmail.com>

On 11/22/2012 08:23 AM, Tommi Rantala wrote:
> Consider the following program, that sets the second argument to the
> sendto() syscall incorrectly:
>
>   #include <string.h>
>   #include <arpa/inet.h>
>   #include <sys/socket.h>
>
>   int main(void)
>   {
>           int fd;
>           struct sockaddr_in sa;
>
>           fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>           if (fd < 0)
>                   return 1;
>
>           memset(&sa, 0, sizeof(sa));
>           sa.sin_family = AF_INET;
>           sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>           sa.sin_port = htons(11111);
>
>           sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
>           return 0;
>   }
>
> We get -ENOMEM:
>
>   $ strace -e sendto ./demo
>   sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate memory)
>
> Propagate the error code from sctp_user_addto_chunk(), so that we will
> tell user space what actually went wrong:
>
>   $ strace -e sendto ./demo
>   sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EFAULT (Bad address)
>
> Noticed while running Trinity (the syscall fuzzer).
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>

Looks good

Acked-by: Vlad Yasevich <vyasevich@gmail.com>

-vlad

> ---
>   net/sctp/chunk.c  |   13 +++++++++----
>   net/sctp/socket.c |    4 ++--
>   2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index d241ef5..3952ca9 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -183,7 +183,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   	msg = sctp_datamsg_new(GFP_KERNEL);
>   	if (!msg)
> -		return NULL;
> +		return ERR_PTR(-ENOMEM);
>
>   	/* Note: Calculate this outside of the loop, so that all fragments
>   	 * have the same expiration.
> @@ -280,8 +280,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   		chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag, 0);
>
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>   			goto errout;
> +		}
> +
>   		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>   		if (err < 0)
>   			goto errout_chunk_put;
> @@ -315,8 +318,10 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   		chunk = sctp_make_datafrag_empty(asoc, sinfo, over, frag, 0);
>
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>   			goto errout;
> +		}
>
>   		err = sctp_user_addto_chunk(chunk, offset, over,msgh->msg_iov);
>
> @@ -342,7 +347,7 @@ errout:
>   		sctp_chunk_free(chunk);
>   	}
>   	sctp_datamsg_put(msg);
> -	return NULL;
> +	return ERR_PTR(err);
>   }
>
>   /* Check whether this message has expired. */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index a60d1f8..406d957 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1915,8 +1915,8 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>
>   	/* Break the message into multiple chunks of maximum size. */
>   	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> -	if (!datamsg) {
> -		err = -ENOMEM;
> +	if (IS_ERR(datamsg)) {
> +		err = PTR_ERR(datamsg);
>   		goto out_free;
>   	}
>
>

^ permalink raw reply

* Re: [PATCH net-next] gro: Handle inline VLAN tags
From: Andrew Gallatin @ 2012-11-26 15:04 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, bhutchings, netdev, linux-net-drivers, herbert
In-Reply-To: <20121119.191002.1995098917961576324.davem@davemloft.net>

On 11/19/12 19:10, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 16 Nov 2012 17:09:19 -0800
>
>> On Sat, 2012-11-17 at 00:32 +0000, Ben Hutchings wrote:
>>> On Fri, 2012-11-16 at 16:16 -0800, Eric Dumazet wrote:
>>>> On Sat, 2012-11-17 at 00:00 +0000, Ben Hutchings wrote:
>>>>
>>>>> I'm not sure what you mean by this.  Is your point that the
>>>>> copy-on-write is never needed?  It is still possible for pskb_may_pull()
>>>>> to fail.
>>>>>
>>>>
>>>> A packet sniffer should have a copy of bad frames, even if dropped later
>>>> in our stacks.
>>>>
>>>> GRO layer is not allowed to drop a frame, even if not 'correct'.
>>>
>>> What do you think the accelerated hardware does with frames that have a
>>> truncated VLAN tag?
>>
>> The hardware should send us the frame, exactly like when RX checksum is
>> wrong.
>
> I agree with Eric, and therefore will not apply this patch.
>

David,

How do you feel about the patchset I posted on 11/14/2012
([PATCH net-next 0/3] myri10ge: LRO to GRO conversion,
http://marc.info/?l=linux-netdev&m=135289838223920&w=2)
which moves myri10ge from LRO to GRO?

Specifically, if doing vlan decap in GRO is not OK, then how
about doing it in the driver?

BTW, if I have bungled something it the myri10ge patchset submission,
I do apologize. I don't frequently submit patches, and it is likely
I screwed up some convention..

Thanks,

Drew

^ permalink raw reply

* Re: [PATCH v2 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Brian Haley @ 2012-11-26 15:10 UTC (permalink / raw)
  To: David Miller; +Cc: xemul, eric.dumazet, netdev
In-Reply-To: <20121120.135842.249477087130415954.davem@davemloft.net>

On 11/20/2012 01:58 PM, David Miller wrote:
>> v2: Added seqlock protection while copying device name.
>>
>> Signed-off-by: Brian Haley <brian.haley@hp.com>
> 
> Brian I was going to apply this, but something about how you email
> patches results in them being corrupted.
> 
> Go to:
> 
> http://patchwork.ozlabs.org/patch/199732/
> 
> Click on Download "mbox", and try to apply that to the net-next tree
> to see what I mean.

I'll take a look why that was wrapping and send a v3, been away...

-Brian

^ permalink raw reply

* Re: [PATCH net-next v2] net: clean up locking in inet_frag_find()
From: Eric Dumazet @ 2012-11-26 15:12 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Patrick McHardy, Pablo Neira Ayuso, David S. Miller
In-Reply-To: <1353914786-10426-1-git-send-email-amwang@redhat.com>

On Mon, 2012-11-26 at 15:26 +0800, Cong Wang wrote:
> It is weird to take the read lock outside of inet_frag_find()
> but release it inside...  This can be improved by refactoring
> the code, that is, introducing inet{4,6}_frag_find() which call
> the their own hash function, inet{4,6}_hash_frag(), hiding the
> details from their callers.
> 
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <amwang@redhat.com>
> 
> ---
>  include/net/inet_frag.h                 |   17 +++++-
>  include/net/ipv6.h                      |    3 -
>  net/ipv4/inet_fragment.c                |   82 +++++++++++++++++++++++++++++--
>  net/ipv4/ip_fragment.c                  |   16 +-----
>  net/ipv6/netfilter/nf_conntrack_reasm.c |    7 +--
>  net/ipv6/reassembly.c                   |   34 +------------
>  6 files changed, 97 insertions(+), 62 deletions(-)

Not clear to me its a win, as it adds 35 LOC. Nobody really complained
of this locking schem in the past.

Also Jesper is working on this stuff, so you dont really ease its work.

^ permalink raw reply

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Eric Dumazet @ 2012-11-26 15:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu
In-Reply-To: <1353940930.11754.221.camel@localhost>

On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > Yes, for the default large 64k packets size, its just a "fake"
> > > benchmark.  And notice with my fixes, we are even faster than the
> > > none-frag/single-UDP packet case... but its because we are getting a
> > > GSO/GRO effect.
> > 
> > Could you elaborate on this GSO/GRO effect ?
> 
> On the big system, I saw none-frag UDP (1472 bytes) throughput of:
>   7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> 
> While with UDP fragments size 65507 bytes I saw:
>   9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> 
> Fragmented UDP is faster by:
>  37668.26 - 29308.21 = 8360.05 Mbit/s
> 
> The 65507 bytes UDP size is just a benchmark test, and have no real-life
> relevance.  As performance starts to drop (below none-frag/normal case)
> when the frag size is decreased, to more realistic sizes...

Yes, but I doubt GRO / GSO are the reason you get better performance.

GRO doesnt aggregate UDP frames.

^ permalink raw reply

* Re: [PATCH RFC 0/5] Containerize syslog
From: Eric W. Biederman @ 2012-11-26 15:16 UTC (permalink / raw)
  To: Rui Xiang; +Cc: Serge E. Hallyn, serge.hallyn, containers, netdev
In-Reply-To: <50ACA05F.7080005@gmail.com>

Rui Xiang <leo.ruixiang@gmail.com> writes:

> On 2012-11-19 22:37, Serge E. Hallyn wrote:

>> I understand that user namespaces aren't 100% usable yet, but looking
>> long term, is there a reason to have the syslog namespace separate
>> from user namespace?
>
> Actually we don't have strong preference. We'll think more about it. Hope we can make
> consensus with Eric.

I hope I am not hard to work with.  My primary concern is reasonable
looking code and good long term maintainable semantics.

I really don't care in which namespace where we file the kernel log
statements.

I care much more about which kernel log print statements we want filed
differently.

Eric

^ permalink raw reply

* [PATCH v3 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Brian Haley @ 2012-11-26 15:21 UTC (permalink / raw)
  To: David Miller; +Cc: Pavel Emelyanov, Eric Dumazet, netdev@vger.kernel.org

Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
will then require another call like if_indextoname() to get the actual interface
name, have it return the name directly.

This also matches the existing man page description on socket(7) which mentions
the argument being an interface name.

If the value has not been set, zero is returned and optlen will be set to zero
to indicate there is no interface name present.

Added a seqlock to protect this code path, and dev_ifname(), from someone
changing the device name via dev_change_name().

v2: Added seqlock protection while copying device name.

v3: Fixed word wrap in patch.

Signed-off-by: Brian Haley <brian.haley@hp.com>

--

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e46c830..e9929ab 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1567,6 +1567,8 @@ extern int call_netdevice_notifiers(unsigned long val, struct net_device *dev);
 
 extern rwlock_t				dev_base_lock;		/* Device list lock */
 
+extern seqlock_t	devnet_rename_seq;	/* Device rename lock */
+
 
 #define for_each_netdev(net, d)		\
 		list_for_each_entry(d, &(net)->dev_base_head, dev_list)
diff --git a/net/core/dev.c b/net/core/dev.c
index 7304ea8..2a5f558 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -203,6 +203,8 @@ static struct list_head offload_base __read_mostly;
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+DEFINE_SEQLOCK(devnet_rename_seq);
+
 static inline void dev_base_seq_inc(struct net *net)
 {
 	while (++net->dev_base_seq == 0);
@@ -1091,22 +1093,31 @@ int dev_change_name(struct net_device *dev, const char *newname)
 	if (dev->flags & IFF_UP)
 		return -EBUSY;
 
-	if (strncmp(newname, dev->name, IFNAMSIZ) == 0)
+	write_seqlock(&devnet_rename_seq);
+
+	if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
+		write_sequnlock(&devnet_rename_seq);
 		return 0;
+	}
 
 	memcpy(oldname, dev->name, IFNAMSIZ);
 
 	err = dev_get_valid_name(net, dev, newname);
-	if (err < 0)
+	if (err < 0) {
+		write_sequnlock(&devnet_rename_seq);
 		return err;
+	}
 
 rollback:
 	ret = device_rename(&dev->dev, dev->name);
 	if (ret) {
 		memcpy(dev->name, oldname, IFNAMSIZ);
+		write_sequnlock(&devnet_rename_seq);
 		return ret;
 	}
 
+	write_sequnlock(&devnet_rename_seq);
+
 	write_lock_bh(&dev_base_lock);
 	hlist_del_rcu(&dev->name_hlist);
 	write_unlock_bh(&dev_base_lock);
@@ -1124,6 +1135,7 @@ rollback:
 		/* err >= 0 after dev_alloc_name() or stores the first errno */
 		if (err >= 0) {
 			err = ret;
+			write_seqlock(&devnet_rename_seq);
 			memcpy(dev->name, oldname, IFNAMSIZ);
 			goto rollback;
 		} else {
@@ -4148,6 +4160,7 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 {
 	struct net_device *dev;
 	struct ifreq ifr;
+	unsigned seq;
 
 	/*
 	 *	Fetch the caller's info block.
@@ -4156,6 +4169,8 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 	if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
 		return -EFAULT;
 
+retry:
+	seq = read_seqbegin(&devnet_rename_seq);
 	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, ifr.ifr_ifindex);
 	if (!dev) {
@@ -4165,6 +4180,8 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 
 	strcpy(ifr.ifr_name, dev->name);
 	rcu_read_unlock();
+	if (read_seqretry(&devnet_rename_seq, seq))
+		goto retry;
 
 	if (copy_to_user(arg, &ifr, sizeof(struct ifreq)))
 		return -EFAULT;
diff --git a/net/core/sock.c b/net/core/sock.c
index d4f7b58..a692ef4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -505,7 +505,8 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
 }
 EXPORT_SYMBOL(sk_dst_check);
 
-static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen)
+static int sock_setbindtodevice(struct sock *sk, char __user *optval,
+				int optlen)
 {
 	int ret = -ENOPROTOOPT;
 #ifdef CONFIG_NETDEVICES
@@ -562,6 +563,59 @@ out:
 	return ret;
 }
 
+static int sock_getbindtodevice(struct sock *sk, char __user *optval,
+				int __user *optlen, int len)
+{
+	int ret = -ENOPROTOOPT;
+#ifdef CONFIG_NETDEVICES
+	struct net *net = sock_net(sk);
+	struct net_device *dev;
+	char devname[IFNAMSIZ];
+	unsigned seq;
+
+	if (sk->sk_bound_dev_if == 0) {
+		len = 0;
+		goto zero;
+	}
+
+	ret = -EINVAL;
+	if (len < IFNAMSIZ)
+		goto out;
+
+retry:
+	seq = read_seqbegin(&devnet_rename_seq);
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if);
+	ret = -ENODEV;
+	if (!dev) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	strcpy(devname, dev->name);
+	rcu_read_unlock();
+	if (read_seqretry(&devnet_rename_seq, seq))
+		goto retry;
+
+	len = strlen(devname) + 1;
+
+	ret = -EFAULT;
+	if (copy_to_user(optval, devname, len))
+		goto out;
+
+zero:
+	ret = -EFAULT;
+	if (put_user(len, optlen))
+		goto out;
+
+	ret = 0;
+
+out:
+#endif
+
+	return ret;
+}
+
 static inline void sock_valbool_flag(struct sock *sk, int bit, int valbool)
 {
 	if (valbool)
@@ -589,7 +643,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	 */
 
 	if (optname == SO_BINDTODEVICE)
-		return sock_bindtodevice(sk, optval, optlen);
+		return sock_setbindtodevice(sk, optval, optlen);
 
 	if (optlen < sizeof(int))
 		return -EINVAL;
@@ -1075,15 +1129,17 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 	case SO_NOFCS:
 		v.val = sock_flag(sk, SOCK_NOFCS);
 		break;
+
 	case SO_BINDTODEVICE:
-		v.val = sk->sk_bound_dev_if;
-		break;
+		return sock_getbindtodevice(sk, optval, optlen, len);
+
 	case SO_GET_FILTER:
 		len = sk_get_filter(sk, (struct sock_filter __user *)optval, len);
 		if (len < 0)
 			return len;
 
 		goto lenout;
+
 	default:
 		return -ENOPROTOOPT;
 	}

^ permalink raw reply related

* Re: [PATCH] sctp: fix memory leak in sctp_datamsg_from_user() when copy from user space fails
From: Neil Horman @ 2012-11-26 15:23 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Vlad Yasevich, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590491-12166-1-git-send-email-tt.rantala@gmail.com>

On Thu, Nov 22, 2012 at 03:21:31PM +0200, Tommi Rantala wrote:
> Trinity (the syscall fuzzer) discovered a memory leak in SCTP,
> reproducible e.g. with the sendto() syscall by passing invalid
> user space pointer in the second argument:
> 
>  #include <string.h>
>  #include <arpa/inet.h>
>  #include <sys/socket.h>
> 
>  int main(void)
>  {
>          int fd;
>          struct sockaddr_in sa;
> 
>          fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>          if (fd < 0)
>                  return 1;
> 
>          memset(&sa, 0, sizeof(sa));
>          sa.sin_family = AF_INET;
>          sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>          sa.sin_port = htons(11111);
> 
>          sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
> 
>          return 0;
>  }
> 
> As far as I can tell, the leak has been around since ~2003.
> 
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
>  net/sctp/chunk.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 7c2df9c..d241ef5 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -284,7 +284,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  			goto errout;
>  		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>  		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>  
>  		offset += len;
>  
> @@ -324,7 +324,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  		__skb_pull(chunk->skb, (__u8 *)chunk->chunk_hdr
>  			   - (__u8 *)chunk->skb->data);
>  		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>  
>  		sctp_datamsg_assign(msg, chunk);
>  		list_add_tail(&chunk->frag_list, &msg->chunks);
> @@ -332,6 +332,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  	return msg;
>  
> +errout_chunk_put:
> +	sctp_chunk_put(chunk);
> +
>  errout:
>  	list_for_each_safe(pos, temp, &msg->chunks) {
>  		list_del_init(pos);
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I'm fine with it the way it is, but it might be nicer if you instead just moved
the list_add_tail call up between the if (!chunk) check and the
sctp_user_addto_chunk call.  That way the unwind loop at the errout label can
just free the chunk without the need for an extra label.

Neil

^ permalink raw reply

* Re: [PATCH] sctp: fix -ENOMEM result with invalid user space pointer in sendto() syscall
From: Neil Horman @ 2012-11-26 15:25 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Vlad Yasevich, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590596-12216-1-git-send-email-tt.rantala@gmail.com>

On Thu, Nov 22, 2012 at 03:23:16PM +0200, Tommi Rantala wrote:
> Consider the following program, that sets the second argument to the
> sendto() syscall incorrectly:
> 
>  #include <string.h>
>  #include <arpa/inet.h>
>  #include <sys/socket.h>
> 
>  int main(void)
>  {
>          int fd;
>          struct sockaddr_in sa;
> 
>          fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>          if (fd < 0)
>                  return 1;
> 
>          memset(&sa, 0, sizeof(sa));
>          sa.sin_family = AF_INET;
>          sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>          sa.sin_port = htons(11111);
> 
>          sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
> 
>          return 0;
>  }
> 
> We get -ENOMEM:
> 
>  $ strace -e sendto ./demo
>  sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate memory)
> 
> Propagate the error code from sctp_user_addto_chunk(), so that we will
> tell user space what actually went wrong:
> 
>  $ strace -e sendto ./demo
>  sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EFAULT (Bad address)
> 
> Noticed while running Trinity (the syscall fuzzer).
> 
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
>  net/sctp/chunk.c  |   13 +++++++++----
>  net/sctp/socket.c |    4 ++--
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index d241ef5..3952ca9 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -183,7 +183,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  	msg = sctp_datamsg_new(GFP_KERNEL);
>  	if (!msg)
> -		return NULL;
> +		return ERR_PTR(-ENOMEM);
>  
>  	/* Note: Calculate this outside of the loop, so that all fragments
>  	 * have the same expiration.
> @@ -280,8 +280,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  		chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag, 0);
>  
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>  			goto errout;
> +		}
> +
>  		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>  		if (err < 0)
>  			goto errout_chunk_put;
> @@ -315,8 +318,10 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  		chunk = sctp_make_datafrag_empty(asoc, sinfo, over, frag, 0);
>  
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>  			goto errout;
> +		}
>  
>  		err = sctp_user_addto_chunk(chunk, offset, over,msgh->msg_iov);
>  
> @@ -342,7 +347,7 @@ errout:
>  		sctp_chunk_free(chunk);
>  	}
>  	sctp_datamsg_put(msg);
> -	return NULL;
> +	return ERR_PTR(err);
>  }
>  
>  /* Check whether this message has expired. */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index a60d1f8..406d957 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1915,8 +1915,8 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>  
>  	/* Break the message into multiple chunks of maximum size. */
>  	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> -	if (!datamsg) {
> -		err = -ENOMEM;
> +	if (IS_ERR(datamsg)) {
> +		err = PTR_ERR(datamsg);
>  		goto out_free;
>  	}
>  
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* performance regression on HiperSockets depending on MTU size
From: Frank Blaschka @ 2012-11-26 15:32 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, linux-s390

Hi Eric,

since kernel 3.6 we see a massive performance regression on s390
HiperSockets devices.

HiperSockets differ from normal devices by the fact they support
large MTU sizes (up to 56K). Here are some iperf numbers to show
the problem depended on MTU size:

# ifconfig hsi0 mtu 1500
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 47.6 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec

# ifconfig hsi0 mtu 9000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 97.0 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.26 GBytes  1.94 Gbits/sec

# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size:   322 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.3 sec  3.12 MBytes  2.53 Mbits/sec

Prior the regression throughput grows with the MTU size but now it drops
to a few Mbits if the MTU is bigger then 15000. It is interesting to see
if 2 or more connections are running in parallel the regression is gone.

# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2 -P2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size:   322 KByte (default)
------------------------------------------------------------
[  4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
[  3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  2.19 GBytes  1.88 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.17 GBytes  1.87 Gbits/sec
[SUM]  0.0-10.0 sec  4.36 GBytes  3.75 Gbits/sec

I bisected the problem to following patch:

commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed Jul 11 05:50:31 2012 +0000

    tcp: TCP Small Queues

    This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
(e.g. 640000) seems to fix the problem.

How does MTU influence/effects TSQ?
Why is the problem gone if there are more connections?
Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
Finally is this expected behavior or is there a bug depending on the big
MTU? What can I do to check ... ?

Thx for your help

Frank

^ permalink raw reply

* [PATCH] vhost: fix length for cross region descriptor
From: Michael S. Tsirkin @ 2012-11-26 15:57 UTC (permalink / raw)
  To: netdev, David Miller; +Cc: Jason Wang, linux-kernel

If a single descriptor crosses a region, the
second chunk length should be decremented
by size translated so far, instead it includes
the full descriptor length.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ef8f598..5a3d0f1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1049,7 +1049,7 @@ static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
 		}
 		_iov = iov + ret;
 		size = reg->memory_size - addr + reg->guest_phys_addr;
-		_iov->iov_len = min((u64)len, size);
+		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
 			(reg->userspace_addr + addr - reg->guest_phys_addr);
 		s += size;
-- 
MST

^ permalink raw reply related

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Jesper Dangaard Brouer @ 2012-11-26 15:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu
In-Reply-To: <1353942950.30446.1772.camel@edumazet-glaptop>

On Mon, 2012-11-26 at 07:15 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> > On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > > 
> > > > Yes, for the default large 64k packets size, its just a "fake"
> > > > benchmark.  And notice with my fixes, we are even faster than the
> > > > none-frag/single-UDP packet case... but its because we are getting a
> > > > GSO/GRO effect.
> > > 
> > > Could you elaborate on this GSO/GRO effect ?
> > 
> > On the big system, I saw none-frag UDP (1472 bytes) throughput of:
> >   7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> > 
> > While with UDP fragments size 65507 bytes I saw:
> >   9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> > 
> > Fragmented UDP is faster by:
> >  37668.26 - 29308.21 = 8360.05 Mbit/s
> > 
> > The 65507 bytes UDP size is just a benchmark test, and have no real-life
> > relevance.  As performance starts to drop (below none-frag/normal case)
> > when the frag size is decreased, to more realistic sizes...
> 
> Yes, but I doubt GRO / GSO are the reason you get better performance.
> GRO doesnt aggregate UDP frames.

Oh, now I think I understand your question.

I don't think GRO is helping me.  Its the same "effect" as GRO.  As (I
think) that the reasm frag SKB will be a "bigger" SKB, which is passed
to the rest of the stack.  Thus, less (but) bigger SKBs get the overhead
of the rest of the stack.  It was actually Herbert that mentioned it to
me...

--Jesper

^ permalink raw reply

* Re: performance regression on HiperSockets depending on MTU size
From: Eric Dumazet @ 2012-11-26 16:12 UTC (permalink / raw)
  To: Frank Blaschka; +Cc: netdev, linux-s390
In-Reply-To: <20121126153242.GA61652@tuxmaker.boeblingen.de.ibm.com>

On Mon, 2012-11-26 at 16:32 +0100, Frank Blaschka wrote:
> Hi Eric,
> 
> since kernel 3.6 we see a massive performance regression on s390
> HiperSockets devices.
> 
> HiperSockets differ from normal devices by the fact they support
> large MTU sizes (up to 56K). Here are some iperf numbers to show
> the problem depended on MTU size:
> 
> # ifconfig hsi0 mtu 1500
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 47.6 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec
> 
> # ifconfig hsi0 mtu 9000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 97.0 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.26 GBytes  1.94 Gbits/sec
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.3 sec  3.12 MBytes  2.53 Mbits/sec
> 
> Prior the regression throughput grows with the MTU size but now it drops
> to a few Mbits if the MTU is bigger then 15000. It is interesting to see
> if 2 or more connections are running in parallel the regression is gone.
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2 -P2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
> [  3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  4]  0.0-10.0 sec  2.19 GBytes  1.88 Gbits/sec
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.17 GBytes  1.87 Gbits/sec
> [SUM]  0.0-10.0 sec  4.36 GBytes  3.75 Gbits/sec
> 
> I bisected the problem to following patch:
> 
> commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed Jul 11 05:50:31 2012 +0000
> 
>     tcp: TCP Small Queues
> 
>     This introduce TSQ (TCP Small Queues)
> 
>     TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
>     device queues), to reduce RTT and cwnd bias, part of the bufferbloat
>     problem.
> 
> Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
> (e.g. 640000) seems to fix the problem.
> 
> How does MTU influence/effects TSQ?
> Why is the problem gone if there are more connections?
> Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
> Finally is this expected behavior or is there a bug depending on the big
> MTU? What can I do to check ... ?
> 

Hi Frank, thanks for this report.

You could tweak tcp_limit_output_bytes, but IMO the root of the problem
is in the driver itself.

For example, I had to change mlx4 driver for the same problem : Make
sure a TX packet can be "TX completed" in a short amount of time.

In the case of mlx4, the wait time was 128 us, but I suspect on your
case its more like an infinite time or several ms.
 
The driver is delaying the free of TX skb by a fixed amount of time,
or relies on following transmits to perform the TX completion


Check for an example :

commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 5 16:20:42 2012 +0000

    mlx4: change TX coalescing defaults
    
    mlx4 currently uses a too high tx coalescing setting, deferring
    TX completion interrupts by up to 128 us.
    
    With the recent skb_orphan() removal in commit 8112ec3b872,
    performance of a single TCP flow is capped to ~4 Gbps, unless
    we increase tcp_limit_output_bytes.
    
    I suggest using 16 us instead of 128 us, allowing a finer control.
    
    Performance of a single TCP flow is restored to previous levels,
    while keeping TCP small queues fully enabled with default sysctl.
    
    This patch is also a BQL prereq.
    
    Reported-by: Vimalkumar <j.vimal@gmail.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [REGRESSION] r8169: jumbo fixes caused jumbo regressions!
From: Kirill Smelkov @ 2012-11-26 16:19 UTC (permalink / raw)
  To: Hayes Wang, Realtek linux nic maintainers
  Cc: Francois Romieu, David S. Miller, Greg Kroah-Hartman, netdev
In-Reply-To: <20121114092530.GA22323@tugrik.mns.mnsspb.ru>

On Wed, Nov 14, 2012 at 01:25:30PM +0400, Kirill Smelkov wrote:
> On Tue, Nov 13, 2012 at 11:35:12PM +0100, Francois Romieu wrote:
> > Kirill Smelkov <kirr@mns.spb.ru> :
> > [...]
> > > My test is to stream raw video from 8 PAL cameras to net - 4 for 720x576@25 and
> > > 4 for 360x288@25 which for YUYV format occupies ~ 860 Mbps of bandwidth. The
> > > program to transmit/receive video is here: http://repo.or.cz/w/rawv.git

[...]
> > > (by the way, on atom system, without tx csum offload, half of cpu time
> > > is spent only to calculate checksums...)
> > 
> > :o(
> 
> yes, that large. In top, my workload is
> 
>                                 %sy     %id     %si
>     
>     default driver load         ~25     ~45     ~27
>     (ethtool -k shows
>      tx-checksumming: off)
> 
>     after                        ~8     ~81     ~11
>     `ethtool -K eth0 tx on`
>      
> 
> that's why the issue is important.
> 
> 
> > > Now I wonder, where that 6K limit came from and why they say it is now
> > > not possible to use jumbos together with tx csum offload ?
> > 
> > Here is an excerpt from a mail where Hayes explained the rules of
> > engagement back in may 2011 (John Lumby and Chris Friesen were Cced then):
> 
> Can't find that mail in gmane netdev archive and on google, to restore
> full context. Was that in private?
> 
> 
> > ! The Max tx sizes for 8168 series are as following:
> > ! 
> > ! 8168B is 4K bytes.
> > ! 8168C and 8168CP are 6K bytes.
> > ! 8168D and later are 9K bytes.
> > ! 
> > ! Note that these sizes all include head size. That is, the mtu must less than
> > ! these values.
> > ! You have to enable Jumbo frame feature when the tx size is large, otherwise the
> > ! packet would not be sent. Because the hw doesn't provide the threshold, the
> > ! checking for MTU > 1500 is just for convenience for sw.
> 
> This part is clear.
> 
> 
> > ! The TSO couldn't work with some feature which need to disable hw checksum, such
> > ! as Jumbo frame. The hw checksum have to be disabled in certain situations, so
> > ! the TSO feature should be checked in these situations, too.
> 
> I don't enable TSO nor I need it. The text indirectly says that hw
> checksum should be disabled when jumbo frames are used.

[...]

> ~~~~
> 
> Hayes, Realtek linux nic maintainers,
> 
>     could you please confirm that for all 8168C and 8168CP jumbo_max is
>     6K and that when jumbos are used, tx checksumming should be off?
> 
>     If so, how come my two chips work stable with ~7K jumbos and tx checksum
>     offload on (tested this night again for ~16 hours without any problem).
> 
>     thanks beforehand.

Dear Hayes, Realtek linux nic maintainers,

Two years ago, for current products, I've specifically choosed
motherboard with RTL8111CP, because Linux driver supported large-enough
Jumbo-frames and tx/rx offload.

Now they say that jumbo-frames should be lowered in length and tx
offload is gone, but my nics still work without problems with old ~7K
jumbos and tx checksum offload. To keep current systems working I either
have to choose another hardware, or patch the driver in contrast to what
people say was the info from the manufacturer.

Neither I like to apply risky patches nor change already proved hardware
to something else without a good reason. So please, as Realtek
representatives,

    could you please confirm that for all 8168C and 8168CP jumbo_max is
    6K and that when jumbos are used, tx checksumming should be off?


Thanks beforehand,
Kirill


P.S. If so, how come my two chips work stable with ~7K jumbos and tx
     checksum offload on (last time tested for ~16 hours without any problem)?

^ permalink raw reply

* Re: [PATCH v2] atm: br2684: Fix excessive queue bloat
From: David Woodhouse @ 2012-11-26 16:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR),
	Jesper Brouer
In-Reply-To: <Pine.LNX.4.64.1211261504240.17614@ask.diku.dk>

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

On Mon, 2012-11-26 at 15:16 +0100, Jesper Dangaard Brouer wrote:
> Nice work David Woodhouse.  What OpenWRT supported box have this 
> hardware? (I want one so I can play with it ;-))

I'm using a Traverse Geos:
http://www.traverse.com.au/geos11-adsl2-x86-router-appliance
http://www.traverse.com.au/geos21-dual-adsl2-x86-router-appliance

Other boxes with real ADSL (like lantiq boxes) should also work.

Note that I don't actually *use* PPPoEoA (of which this is the EoA
part). I just do PPPoA, which is far more sensible and doesn't have MTU
issues.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Fujinaka, Todd @ 2012-11-26 16:23 UTC (permalink / raw)
  To: Joe Jin, Dave, Tushar N
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org, Mary Mcgrath
In-Reply-To: <50AB8471.7080607@oracle.com>

On Tue, 20 Nov 2012, Joe Jin wrote:

> On 11/20/12 16:59, Dave, Tushar N wrote:
>> Have you power off the system completely after modifying eeprom? If not please do so.
>
> Hi Tushar,
>
> Seems not works for me, would you please help to check what is wrong of my operations?

...

> # lspci -s 0000:52:00.1 -vvv
> 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> <--snip-->
>       Capabilities: [e0] Express (v1) Endpoint, MSI 00
>               DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                ^^^^^^^^^^^^^^^^^^^^
>                       ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
>               DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
>                       RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                       MaxPayload 128 bytes, MaxReadReq 4096 bytes
>           ^^^^^^^^^^^^^^^^^^^^^
>
 <--snip-->

If you look at the previous section, DevCap, you'll see that it's
correctly advertising 256 bytes but the system is negotiating 128 for
the link to the Ethernet controller. Things on the "other" side of the
link are controlled outside of the e1000 driver.

Tushar's first suggestion was to check the PCIe payload settings in the
entire chain. Have you done that? Mismatches will cause hangs.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujinaka@intel.com
(503) 712-4565

^ permalink raw reply

* Re: BQL support in gianfar causes network hickup
From: Eric Dumazet @ 2012-11-26 16:34 UTC (permalink / raw)
  To: Tino Keitel; +Cc: Paul Gortmaker, netdev, Keitel, Tino (ALC NetworX GmbH)
In-Reply-To: <20121126100111.GA3728@mac.home>

On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
> 
> [...]
> 
> > Hmm, I wonder if BQL makes a particular bug showing more often.
> > 
> > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > drivers use 5 seconds.
> > 
> > What happens if you change this to 5 seconds ?
> 
> I still got the trace and a failing ptp client.
> 

Thanks. Is this bug easy to trigger ?

I suspect a core issue and a race, likely to happen on your (non x86)
hardware

Could you add the following debugging patch ?

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index aefc150..a8859ec 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -117,7 +117,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 	int ret = NETDEV_TX_BUSY;
 
 	/* And release qdisc */
-	spin_unlock(root_lock);
+//	spin_unlock(root_lock);
 
 	HARD_TX_LOCK(dev, txq, smp_processor_id());
 	if (!netif_xmit_frozen_or_stopped(txq))
@@ -125,7 +125,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 	HARD_TX_UNLOCK(dev, txq);
 
-	spin_lock(root_lock);
+//	spin_lock(root_lock);
 
 	if (dev_xmit_complete(ret)) {
 		/* Driver sent out skb successfully or skb was consumed */

^ permalink raw reply related

* Re: BQL support in gianfar causes network hickup
From: Keitel, Tino (ALC NetworX GmbH) @ 2012-11-26 17:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tino Keitel, Paul Gortmaker, netdev@vger.kernel.org
In-Reply-To: <1353947677.7553.2.camel@edumazet-glaptop>

On Mo, 2012-11-26 at 08:34 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> > On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
> > 
> > [...]
> > 
> > > Hmm, I wonder if BQL makes a particular bug showing more often.
> > > 
> > > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > > drivers use 5 seconds.
> > > 
> > > What happens if you change this to 5 seconds ?
> > 
> > I still got the trace and a failing ptp client.
> > 
> 
> Thanks. Is this bug easy to trigger ?
> 
> I suspect a core issue and a race, likely to happen on your (non x86)
> hardware
> 
> Could you add the following debugging patch ?

No visible difference:

NETDEV WATCHDOG: eth1 (fsl-gianfar): transmit queue 0 timed out
------------[ cut here ]------------
WARNING:
at /home/keitelt1/src/git/linux-stable/net/sched/sch_generic.c:255
Modules linked in:
NIP: c02448b0 LR: c02448b0 CTR: c01c19b8
REGS: c7ffbe40 TRAP: 0700   Not tainted  (3.7.0-rc6-dirty)
MSR: 00029032 <EE,ME,IR,DR,RI>  CR: 22002044  XER: 20000000
TASK = c03dd370[0] 'swapper' THREAD: c03fe000
GPR00: c02448b0 c7ffbef0 c03dd370 0000003f 00000001 c001aea8 00000000
00000001
GPR08: 00000001 c03e0000 00000000 0000009d 22002084 1008eb5c 07ffb000
ffffffff
GPR16: 00000004 c0362c7c c03dfbf8 00200000 c0411ed0 c0411cd0 c0411ad0
ffffffff
GPR24: 00000000 c749e1d8 00000004 c783c1b0 c0400000 c03e0000 c749e000
00000000
NIP [c02448b0] dev_watchdog+0x288/0x298
LR [c02448b0] dev_watchdog+0x288/0x298
Call Trace:
[c7ffbef0] [c02448b0] dev_watchdog+0x288/0x298 (unreliable)
[c7ffbf20] [c00267f8] call_timer_fn+0x6c/0xd8
[c7ffbf50] [c00269e4] run_timer_softirq+0x180/0x1f8
[c7ffbfa0] [c0021144] __do_softirq+0xc4/0x160
[c7ffbff0] [c000d0b8] call_do_softirq+0x14/0x24
[c03ffe00] [c00058e8] do_softirq+0x8c/0xb8
[c03ffe20] [c0021358] irq_exit+0x98/0xb4
[c03ffe30] [c0009fb0] timer_interrupt+0x158/0x170
[c03ffe50] [c000f02c] ret_from_except+0x0/0x14
--- Exception: 901 at _raw_spin_unlock_irq+0x3c/0x78
    LR = _raw_spin_unlock_irq+0x2c/0x78
[c03fff20] [c00434c8] finish_task_switch.constprop.69+0x5c/0xdc
[c03fff40] [c02d354c] __schedule+0x1e0/0x410
[c03fff90] [c02d3a78] schedule_preempt_disabled+0x18/0x30
[c03fffa0] [c000898c] cpu_idle+0xfc/0x100
[c03fffc0] [c03b37b0] start_kernel+0x2dc/0x2f0
[c03ffff0] [00003438] 0x3438
Instruction dump:
7d2903a6 4e800421 80fe01fc 4bffff74 7fc3f378 4bfecb7d 7fc4f378 7fe6fb78 
7c651b78 3c60c038 38637280 48090e51 <0fe00000> 39200001 993cc7c9
4bffffb8 
---[ end trace c170f56a0503cdd2 ]---

Regards,
Tino


^ permalink raw reply

* Re: BQL support in gianfar causes network hickup
From: Eric Dumazet @ 2012-11-26 17:17 UTC (permalink / raw)
  To: Keitel, Tino (ALC NetworX GmbH)
  Cc: Tino Keitel, Paul Gortmaker, netdev@vger.kernel.org
In-Reply-To: <9AA65D849A88EB44B5D9B6A8BA098E23040A60D6EE6F@Exchange1.lawo.de>

On Mon, 2012-11-26 at 18:08 +0100, Keitel, Tino (ALC NetworX GmbH)
wrote:
> On Mo, 2012-11-26 at 08:34 -0800, Eric Dumazet wrote:
> > On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> > > On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
> > > 
> > > [...]
> > > 
> > > > Hmm, I wonder if BQL makes a particular bug showing more often.
> > > > 
> > > > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > > > drivers use 5 seconds.
> > > > 
> > > > What happens if you change this to 5 seconds ?
> > > 
> > > I still got the trace and a failing ptp client.
> > > 
> > 
> > Thanks. Is this bug easy to trigger ?
> > 
> > I suspect a core issue and a race, likely to happen on your (non x86)
> > hardware
> > 
> > Could you add the following debugging patch ?
> 
> No visible difference:

OK it seems you trigger the problem fast !

Please try the following as well :

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 19ac096..77190bf 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -101,7 +101,7 @@
 
 #include "gianfar.h"
 
-#define TX_TIMEOUT      (1*HZ)
+#define TX_TIMEOUT      (5*HZ)
 
 const char gfar_driver_version[] = "1.3";
 
@@ -2465,9 +2465,9 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 	struct sk_buff *skb;
 	int skb_dirtytx;
 	int tx_ring_size = tx_queue->tx_ring_size;
-	int frags = 0, nr_txbds = 0;
+	int frags, nr_txbds;
 	int i;
-	int howmany = 0;
+	int howmany = 0, total_txbds = 0;
 	int tqi = tx_queue->qindex;
 	unsigned int bytes_sent = 0;
 	u32 lstatus;
@@ -2479,7 +2479,6 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 	skb_dirtytx = tx_queue->skb_dirtytx;
 
 	while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) {
-		unsigned long flags;
 
 		frags = skb_shinfo(skb)->nr_frags;
 
@@ -2541,20 +2540,24 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 			      TX_RING_MOD_MASK(tx_ring_size);
 
 		howmany++;
-		spin_lock_irqsave(&tx_queue->txlock, flags);
-		tx_queue->num_txbdfree += nr_txbds;
-		spin_unlock_irqrestore(&tx_queue->txlock, flags);
+		total_txbds += nr_txbds;
 	}
 
-	/* If we freed a buffer, we can restart transmission, if necessary */
-	if (netif_tx_queue_stopped(txq) && tx_queue->num_txbdfree)
-		netif_wake_subqueue(dev, tqi);
+	if (howmany) {
+		unsigned long flags;
 
+		spin_lock_irqsave(&tx_queue->txlock, flags);
+		tx_queue->num_txbdfree += total_txbds;
+		/* If we freed a buffer, we can restart transmission, if necessary */
+		if (netif_tx_queue_stopped(txq))
+			netif_tx_wake_queue(txq);
+		netdev_tx_completed_queue(txq, howmany, bytes_sent);
+		spin_unlock_irqrestore(&tx_queue->txlock, flags);
+	}
 	/* Update dirty indicators */
 	tx_queue->skb_dirtytx = skb_dirtytx;
 	tx_queue->dirty_tx = bdp;
 
-	netdev_tx_completed_queue(txq, howmany, bytes_sent);
 
 	return howmany;
 }

^ permalink raw reply related

* Re: [Pv-drivers] [PATCH 086/493] net: remove use of __devexit_p
From: Dmitry Torokhov @ 2012-11-26 17:35 UTC (permalink / raw)
  To: Bill Pemberton
  Cc: xen-devel, Michael S. Tsirkin, VMware, Inc., gregkh,
	virtualization, Francois Romieu, netdev
In-Reply-To: <1353349642-3677-86-git-send-email-wfp5p@virginia.edu>

On Mon, Nov 19, 2012 at 01:20:35PM -0500, Bill Pemberton wrote:
> CONFIG_HOTPLUG is going away as an option so __devexit_p is no longer
> needed.
> 
> Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
> Cc: Rusty Russell <rusty@rustcorp.com.au> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com> 
> Cc: Shreyas Bhatewara <sbhatewara@vmware.com> 
> Cc: "VMware, Inc." <pv-drivers@vmware.com> 
> Cc: Francois Romieu <romieu@fr.zoreil.com> 
> Cc: virtualization@lists.linux-foundation.org 
> Cc: netdev@vger.kernel.org 
> Cc: xen-devel@lists.xensource.com 
> ---
>  drivers/net/virtio_net.c          | 2 +-
>  drivers/net/vmxnet3/vmxnet3_drv.c | 2 +-

For VMXNET3:
 
Acked-by: Dmitry Torokhov <dtor@vmware.com>

Thanks,
Dmitry

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox