Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [REGRESSION] r8169: jumbo fixes caused jumbo regressions!
From: Kirill Smelkov @ 2012-11-26 16:19 UTC (permalink / raw)
  To: Hayes Wang, Realtek linux nic maintainers
  Cc: Francois Romieu, David S. Miller, Greg Kroah-Hartman, netdev
In-Reply-To: <20121114092530.GA22323@tugrik.mns.mnsspb.ru>

On Wed, Nov 14, 2012 at 01:25:30PM +0400, Kirill Smelkov wrote:
> On Tue, Nov 13, 2012 at 11:35:12PM +0100, Francois Romieu wrote:
> > Kirill Smelkov <kirr@mns.spb.ru> :
> > [...]
> > > My test is to stream raw video from 8 PAL cameras to net - 4 for 720x576@25 and
> > > 4 for 360x288@25 which for YUYV format occupies ~ 860 Mbps of bandwidth. The
> > > program to transmit/receive video is here: http://repo.or.cz/w/rawv.git

[...]
> > > (by the way, on atom system, without tx csum offload, half of cpu time
> > > is spent only to calculate checksums...)
> > 
> > :o(
> 
> yes, that large. In top, my workload is
> 
>                                 %sy     %id     %si
>     
>     default driver load         ~25     ~45     ~27
>     (ethtool -k shows
>      tx-checksumming: off)
> 
>     after                        ~8     ~81     ~11
>     `ethtool -K eth0 tx on`
>      
> 
> that's why the issue is important.
> 
> 
> > > Now I wonder, where that 6K limit came from and why they say it is now
> > > not possible to use jumbos together with tx csum offload ?
> > 
> > Here is an excerpt from a mail where Hayes explained the rules of
> > engagement back in may 2011 (John Lumby and Chris Friesen were Cced then):
> 
> Can't find that mail in gmane netdev archive and on google, to restore
> full context. Was that in private?
> 
> 
> > ! The Max tx sizes for 8168 series are as following:
> > ! 
> > ! 8168B is 4K bytes.
> > ! 8168C and 8168CP are 6K bytes.
> > ! 8168D and later are 9K bytes.
> > ! 
> > ! Note that these sizes all include head size. That is, the mtu must less than
> > ! these values.
> > ! You have to enable Jumbo frame feature when the tx size is large, otherwise the
> > ! packet would not be sent. Because the hw doesn't provide the threshold, the
> > ! checking for MTU > 1500 is just for convenience for sw.
> 
> This part is clear.
> 
> 
> > ! The TSO couldn't work with some feature which need to disable hw checksum, such
> > ! as Jumbo frame. The hw checksum have to be disabled in certain situations, so
> > ! the TSO feature should be checked in these situations, too.
> 
> I don't enable TSO nor I need it. The text indirectly says that hw
> checksum should be disabled when jumbo frames are used.

[...]

> ~~~~
> 
> Hayes, Realtek linux nic maintainers,
> 
>     could you please confirm that for all 8168C and 8168CP jumbo_max is
>     6K and that when jumbos are used, tx checksumming should be off?
> 
>     If so, how come my two chips work stable with ~7K jumbos and tx checksum
>     offload on (tested this night again for ~16 hours without any problem).
> 
>     thanks beforehand.

Dear Hayes, Realtek linux nic maintainers,

Two years ago, for current products, I've specifically choosed
motherboard with RTL8111CP, because Linux driver supported large-enough
Jumbo-frames and tx/rx offload.

Now they say that jumbo-frames should be lowered in length and tx
offload is gone, but my nics still work without problems with old ~7K
jumbos and tx checksum offload. To keep current systems working I either
have to choose another hardware, or patch the driver in contrast to what
people say was the info from the manufacturer.

Neither I like to apply risky patches nor change already proved hardware
to something else without a good reason. So please, as Realtek
representatives,

    could you please confirm that for all 8168C and 8168CP jumbo_max is
    6K and that when jumbos are used, tx checksumming should be off?


Thanks beforehand,
Kirill


P.S. If so, how come my two chips work stable with ~7K jumbos and tx
     checksum offload on (last time tested for ~16 hours without any problem)?

^ permalink raw reply

* Re: performance regression on HiperSockets depending on MTU size
From: Eric Dumazet @ 2012-11-26 16:12 UTC (permalink / raw)
  To: Frank Blaschka; +Cc: netdev, linux-s390
In-Reply-To: <20121126153242.GA61652@tuxmaker.boeblingen.de.ibm.com>

On Mon, 2012-11-26 at 16:32 +0100, Frank Blaschka wrote:
> Hi Eric,
> 
> since kernel 3.6 we see a massive performance regression on s390
> HiperSockets devices.
> 
> HiperSockets differ from normal devices by the fact they support
> large MTU sizes (up to 56K). Here are some iperf numbers to show
> the problem depended on MTU size:
> 
> # ifconfig hsi0 mtu 1500
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 47.6 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec
> 
> # ifconfig hsi0 mtu 9000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 97.0 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.26 GBytes  1.94 Gbits/sec
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.3 sec  3.12 MBytes  2.53 Mbits/sec
> 
> Prior the regression throughput grows with the MTU size but now it drops
> to a few Mbits if the MTU is bigger then 15000. It is interesting to see
> if 2 or more connections are running in parallel the regression is gone.
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2 -P2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
> [  3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  4]  0.0-10.0 sec  2.19 GBytes  1.88 Gbits/sec
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.17 GBytes  1.87 Gbits/sec
> [SUM]  0.0-10.0 sec  4.36 GBytes  3.75 Gbits/sec
> 
> I bisected the problem to following patch:
> 
> commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed Jul 11 05:50:31 2012 +0000
> 
>     tcp: TCP Small Queues
> 
>     This introduce TSQ (TCP Small Queues)
> 
>     TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
>     device queues), to reduce RTT and cwnd bias, part of the bufferbloat
>     problem.
> 
> Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
> (e.g. 640000) seems to fix the problem.
> 
> How does MTU influence/effects TSQ?
> Why is the problem gone if there are more connections?
> Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
> Finally is this expected behavior or is there a bug depending on the big
> MTU? What can I do to check ... ?
> 

Hi Frank, thanks for this report.

You could tweak tcp_limit_output_bytes, but IMO the root of the problem
is in the driver itself.

For example, I had to change mlx4 driver for the same problem : Make
sure a TX packet can be "TX completed" in a short amount of time.

In the case of mlx4, the wait time was 128 us, but I suspect on your
case its more like an infinite time or several ms.
 
The driver is delaying the free of TX skb by a fixed amount of time,
or relies on following transmits to perform the TX completion


Check for an example :

commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 5 16:20:42 2012 +0000

    mlx4: change TX coalescing defaults
    
    mlx4 currently uses a too high tx coalescing setting, deferring
    TX completion interrupts by up to 128 us.
    
    With the recent skb_orphan() removal in commit 8112ec3b872,
    performance of a single TCP flow is capped to ~4 Gbps, unless
    we increase tcp_limit_output_bytes.
    
    I suggest using 16 us instead of 128 us, allowing a finer control.
    
    Performance of a single TCP flow is restored to previous levels,
    while keeping TCP small queues fully enabled with default sysctl.
    
    This patch is also a BQL prereq.
    
    Reported-by: Vimalkumar <j.vimal@gmail.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Jesper Dangaard Brouer @ 2012-11-26 15:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu
In-Reply-To: <1353942950.30446.1772.camel@edumazet-glaptop>

On Mon, 2012-11-26 at 07:15 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> > On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > > 
> > > > Yes, for the default large 64k packets size, its just a "fake"
> > > > benchmark.  And notice with my fixes, we are even faster than the
> > > > none-frag/single-UDP packet case... but its because we are getting a
> > > > GSO/GRO effect.
> > > 
> > > Could you elaborate on this GSO/GRO effect ?
> > 
> > On the big system, I saw none-frag UDP (1472 bytes) throughput of:
> >   7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> > 
> > While with UDP fragments size 65507 bytes I saw:
> >   9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> > 
> > Fragmented UDP is faster by:
> >  37668.26 - 29308.21 = 8360.05 Mbit/s
> > 
> > The 65507 bytes UDP size is just a benchmark test, and have no real-life
> > relevance.  As performance starts to drop (below none-frag/normal case)
> > when the frag size is decreased, to more realistic sizes...
> 
> Yes, but I doubt GRO / GSO are the reason you get better performance.
> GRO doesnt aggregate UDP frames.

Oh, now I think I understand your question.

I don't think GRO is helping me.  Its the same "effect" as GRO.  As (I
think) that the reasm frag SKB will be a "bigger" SKB, which is passed
to the rest of the stack.  Thus, less (but) bigger SKBs get the overhead
of the rest of the stack.  It was actually Herbert that mentioned it to
me...

--Jesper

^ permalink raw reply

* [PATCH] vhost: fix length for cross region descriptor
From: Michael S. Tsirkin @ 2012-11-26 15:57 UTC (permalink / raw)
  To: netdev, David Miller; +Cc: Jason Wang, linux-kernel

If a single descriptor crosses a region, the
second chunk length should be decremented
by size translated so far, instead it includes
the full descriptor length.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ef8f598..5a3d0f1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1049,7 +1049,7 @@ static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
 		}
 		_iov = iov + ret;
 		size = reg->memory_size - addr + reg->guest_phys_addr;
-		_iov->iov_len = min((u64)len, size);
+		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
 			(reg->userspace_addr + addr - reg->guest_phys_addr);
 		s += size;
-- 
MST

^ permalink raw reply related

* performance regression on HiperSockets depending on MTU size
From: Frank Blaschka @ 2012-11-26 15:32 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, linux-s390

Hi Eric,

since kernel 3.6 we see a massive performance regression on s390
HiperSockets devices.

HiperSockets differ from normal devices by the fact they support
large MTU sizes (up to 56K). Here are some iperf numbers to show
the problem depended on MTU size:

# ifconfig hsi0 mtu 1500
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 47.6 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec

# ifconfig hsi0 mtu 9000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 97.0 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.26 GBytes  1.94 Gbits/sec

# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size:   322 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.3 sec  3.12 MBytes  2.53 Mbits/sec

Prior the regression throughput grows with the MTU size but now it drops
to a few Mbits if the MTU is bigger then 15000. It is interesting to see
if 2 or more connections are running in parallel the regression is gone.

# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2 -P2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size:   322 KByte (default)
------------------------------------------------------------
[  4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
[  3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  2.19 GBytes  1.88 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.17 GBytes  1.87 Gbits/sec
[SUM]  0.0-10.0 sec  4.36 GBytes  3.75 Gbits/sec

I bisected the problem to following patch:

commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed Jul 11 05:50:31 2012 +0000

    tcp: TCP Small Queues

    This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
(e.g. 640000) seems to fix the problem.

How does MTU influence/effects TSQ?
Why is the problem gone if there are more connections?
Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
Finally is this expected behavior or is there a bug depending on the big
MTU? What can I do to check ... ?

Thx for your help

Frank

^ permalink raw reply

* Re: [PATCH] sctp: fix -ENOMEM result with invalid user space pointer in sendto() syscall
From: Neil Horman @ 2012-11-26 15:25 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Vlad Yasevich, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590596-12216-1-git-send-email-tt.rantala@gmail.com>

On Thu, Nov 22, 2012 at 03:23:16PM +0200, Tommi Rantala wrote:
> Consider the following program, that sets the second argument to the
> sendto() syscall incorrectly:
> 
>  #include <string.h>
>  #include <arpa/inet.h>
>  #include <sys/socket.h>
> 
>  int main(void)
>  {
>          int fd;
>          struct sockaddr_in sa;
> 
>          fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>          if (fd < 0)
>                  return 1;
> 
>          memset(&sa, 0, sizeof(sa));
>          sa.sin_family = AF_INET;
>          sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>          sa.sin_port = htons(11111);
> 
>          sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
> 
>          return 0;
>  }
> 
> We get -ENOMEM:
> 
>  $ strace -e sendto ./demo
>  sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate memory)
> 
> Propagate the error code from sctp_user_addto_chunk(), so that we will
> tell user space what actually went wrong:
> 
>  $ strace -e sendto ./demo
>  sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EFAULT (Bad address)
> 
> Noticed while running Trinity (the syscall fuzzer).
> 
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
>  net/sctp/chunk.c  |   13 +++++++++----
>  net/sctp/socket.c |    4 ++--
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index d241ef5..3952ca9 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -183,7 +183,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  	msg = sctp_datamsg_new(GFP_KERNEL);
>  	if (!msg)
> -		return NULL;
> +		return ERR_PTR(-ENOMEM);
>  
>  	/* Note: Calculate this outside of the loop, so that all fragments
>  	 * have the same expiration.
> @@ -280,8 +280,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  		chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag, 0);
>  
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>  			goto errout;
> +		}
> +
>  		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>  		if (err < 0)
>  			goto errout_chunk_put;
> @@ -315,8 +318,10 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  		chunk = sctp_make_datafrag_empty(asoc, sinfo, over, frag, 0);
>  
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>  			goto errout;
> +		}
>  
>  		err = sctp_user_addto_chunk(chunk, offset, over,msgh->msg_iov);
>  
> @@ -342,7 +347,7 @@ errout:
>  		sctp_chunk_free(chunk);
>  	}
>  	sctp_datamsg_put(msg);
> -	return NULL;
> +	return ERR_PTR(err);
>  }
>  
>  /* Check whether this message has expired. */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index a60d1f8..406d957 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1915,8 +1915,8 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>  
>  	/* Break the message into multiple chunks of maximum size. */
>  	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> -	if (!datamsg) {
> -		err = -ENOMEM;
> +	if (IS_ERR(datamsg)) {
> +		err = PTR_ERR(datamsg);
>  		goto out_free;
>  	}
>  
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [PATCH] sctp: fix memory leak in sctp_datamsg_from_user() when copy from user space fails
From: Neil Horman @ 2012-11-26 15:23 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Vlad Yasevich, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590491-12166-1-git-send-email-tt.rantala@gmail.com>

On Thu, Nov 22, 2012 at 03:21:31PM +0200, Tommi Rantala wrote:
> Trinity (the syscall fuzzer) discovered a memory leak in SCTP,
> reproducible e.g. with the sendto() syscall by passing invalid
> user space pointer in the second argument:
> 
>  #include <string.h>
>  #include <arpa/inet.h>
>  #include <sys/socket.h>
> 
>  int main(void)
>  {
>          int fd;
>          struct sockaddr_in sa;
> 
>          fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>          if (fd < 0)
>                  return 1;
> 
>          memset(&sa, 0, sizeof(sa));
>          sa.sin_family = AF_INET;
>          sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>          sa.sin_port = htons(11111);
> 
>          sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
> 
>          return 0;
>  }
> 
> As far as I can tell, the leak has been around since ~2003.
> 
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
>  net/sctp/chunk.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 7c2df9c..d241ef5 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -284,7 +284,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  			goto errout;
>  		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>  		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>  
>  		offset += len;
>  
> @@ -324,7 +324,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  		__skb_pull(chunk->skb, (__u8 *)chunk->chunk_hdr
>  			   - (__u8 *)chunk->skb->data);
>  		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>  
>  		sctp_datamsg_assign(msg, chunk);
>  		list_add_tail(&chunk->frag_list, &msg->chunks);
> @@ -332,6 +332,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>  
>  	return msg;
>  
> +errout_chunk_put:
> +	sctp_chunk_put(chunk);
> +
>  errout:
>  	list_for_each_safe(pos, temp, &msg->chunks) {
>  		list_del_init(pos);
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I'm fine with it the way it is, but it might be nicer if you instead just moved
the list_add_tail call up between the if (!chunk) check and the
sctp_user_addto_chunk call.  That way the unwind loop at the errout label can
just free the chunk without the need for an extra label.

Neil

^ permalink raw reply

* [PATCH v3 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Brian Haley @ 2012-11-26 15:21 UTC (permalink / raw)
  To: David Miller; +Cc: Pavel Emelyanov, Eric Dumazet, netdev@vger.kernel.org

Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
will then require another call like if_indextoname() to get the actual interface
name, have it return the name directly.

This also matches the existing man page description on socket(7) which mentions
the argument being an interface name.

If the value has not been set, zero is returned and optlen will be set to zero
to indicate there is no interface name present.

Added a seqlock to protect this code path, and dev_ifname(), from someone
changing the device name via dev_change_name().

v2: Added seqlock protection while copying device name.

v3: Fixed word wrap in patch.

Signed-off-by: Brian Haley <brian.haley@hp.com>

--

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e46c830..e9929ab 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1567,6 +1567,8 @@ extern int call_netdevice_notifiers(unsigned long val, struct net_device *dev);
 
 extern rwlock_t				dev_base_lock;		/* Device list lock */
 
+extern seqlock_t	devnet_rename_seq;	/* Device rename lock */
+
 
 #define for_each_netdev(net, d)		\
 		list_for_each_entry(d, &(net)->dev_base_head, dev_list)
diff --git a/net/core/dev.c b/net/core/dev.c
index 7304ea8..2a5f558 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -203,6 +203,8 @@ static struct list_head offload_base __read_mostly;
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+DEFINE_SEQLOCK(devnet_rename_seq);
+
 static inline void dev_base_seq_inc(struct net *net)
 {
 	while (++net->dev_base_seq == 0);
@@ -1091,22 +1093,31 @@ int dev_change_name(struct net_device *dev, const char *newname)
 	if (dev->flags & IFF_UP)
 		return -EBUSY;
 
-	if (strncmp(newname, dev->name, IFNAMSIZ) == 0)
+	write_seqlock(&devnet_rename_seq);
+
+	if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
+		write_sequnlock(&devnet_rename_seq);
 		return 0;
+	}
 
 	memcpy(oldname, dev->name, IFNAMSIZ);
 
 	err = dev_get_valid_name(net, dev, newname);
-	if (err < 0)
+	if (err < 0) {
+		write_sequnlock(&devnet_rename_seq);
 		return err;
+	}
 
 rollback:
 	ret = device_rename(&dev->dev, dev->name);
 	if (ret) {
 		memcpy(dev->name, oldname, IFNAMSIZ);
+		write_sequnlock(&devnet_rename_seq);
 		return ret;
 	}
 
+	write_sequnlock(&devnet_rename_seq);
+
 	write_lock_bh(&dev_base_lock);
 	hlist_del_rcu(&dev->name_hlist);
 	write_unlock_bh(&dev_base_lock);
@@ -1124,6 +1135,7 @@ rollback:
 		/* err >= 0 after dev_alloc_name() or stores the first errno */
 		if (err >= 0) {
 			err = ret;
+			write_seqlock(&devnet_rename_seq);
 			memcpy(dev->name, oldname, IFNAMSIZ);
 			goto rollback;
 		} else {
@@ -4148,6 +4160,7 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 {
 	struct net_device *dev;
 	struct ifreq ifr;
+	unsigned seq;
 
 	/*
 	 *	Fetch the caller's info block.
@@ -4156,6 +4169,8 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 	if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
 		return -EFAULT;
 
+retry:
+	seq = read_seqbegin(&devnet_rename_seq);
 	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, ifr.ifr_ifindex);
 	if (!dev) {
@@ -4165,6 +4180,8 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 
 	strcpy(ifr.ifr_name, dev->name);
 	rcu_read_unlock();
+	if (read_seqretry(&devnet_rename_seq, seq))
+		goto retry;
 
 	if (copy_to_user(arg, &ifr, sizeof(struct ifreq)))
 		return -EFAULT;
diff --git a/net/core/sock.c b/net/core/sock.c
index d4f7b58..a692ef4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -505,7 +505,8 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
 }
 EXPORT_SYMBOL(sk_dst_check);
 
-static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen)
+static int sock_setbindtodevice(struct sock *sk, char __user *optval,
+				int optlen)
 {
 	int ret = -ENOPROTOOPT;
 #ifdef CONFIG_NETDEVICES
@@ -562,6 +563,59 @@ out:
 	return ret;
 }
 
+static int sock_getbindtodevice(struct sock *sk, char __user *optval,
+				int __user *optlen, int len)
+{
+	int ret = -ENOPROTOOPT;
+#ifdef CONFIG_NETDEVICES
+	struct net *net = sock_net(sk);
+	struct net_device *dev;
+	char devname[IFNAMSIZ];
+	unsigned seq;
+
+	if (sk->sk_bound_dev_if == 0) {
+		len = 0;
+		goto zero;
+	}
+
+	ret = -EINVAL;
+	if (len < IFNAMSIZ)
+		goto out;
+
+retry:
+	seq = read_seqbegin(&devnet_rename_seq);
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if);
+	ret = -ENODEV;
+	if (!dev) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	strcpy(devname, dev->name);
+	rcu_read_unlock();
+	if (read_seqretry(&devnet_rename_seq, seq))
+		goto retry;
+
+	len = strlen(devname) + 1;
+
+	ret = -EFAULT;
+	if (copy_to_user(optval, devname, len))
+		goto out;
+
+zero:
+	ret = -EFAULT;
+	if (put_user(len, optlen))
+		goto out;
+
+	ret = 0;
+
+out:
+#endif
+
+	return ret;
+}
+
 static inline void sock_valbool_flag(struct sock *sk, int bit, int valbool)
 {
 	if (valbool)
@@ -589,7 +643,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	 */
 
 	if (optname == SO_BINDTODEVICE)
-		return sock_bindtodevice(sk, optval, optlen);
+		return sock_setbindtodevice(sk, optval, optlen);
 
 	if (optlen < sizeof(int))
 		return -EINVAL;
@@ -1075,15 +1129,17 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 	case SO_NOFCS:
 		v.val = sock_flag(sk, SOCK_NOFCS);
 		break;
+
 	case SO_BINDTODEVICE:
-		v.val = sk->sk_bound_dev_if;
-		break;
+		return sock_getbindtodevice(sk, optval, optlen, len);
+
 	case SO_GET_FILTER:
 		len = sk_get_filter(sk, (struct sock_filter __user *)optval, len);
 		if (len < 0)
 			return len;
 
 		goto lenout;
+
 	default:
 		return -ENOPROTOOPT;
 	}

^ permalink raw reply related

* Re: [PATCH RFC 0/5] Containerize syslog
From: Eric W. Biederman @ 2012-11-26 15:16 UTC (permalink / raw)
  To: Rui Xiang; +Cc: Serge E. Hallyn, serge.hallyn, containers, netdev
In-Reply-To: <50ACA05F.7080005@gmail.com>

Rui Xiang <leo.ruixiang@gmail.com> writes:

> On 2012-11-19 22:37, Serge E. Hallyn wrote:

>> I understand that user namespaces aren't 100% usable yet, but looking
>> long term, is there a reason to have the syslog namespace separate
>> from user namespace?
>
> Actually we don't have strong preference. We'll think more about it. Hope we can make
> consensus with Eric.

I hope I am not hard to work with.  My primary concern is reasonable
looking code and good long term maintainable semantics.

I really don't care in which namespace where we file the kernel log
statements.

I care much more about which kernel log print statements we want filed
differently.

Eric

^ permalink raw reply

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Eric Dumazet @ 2012-11-26 15:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu
In-Reply-To: <1353940930.11754.221.camel@localhost>

On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > Yes, for the default large 64k packets size, its just a "fake"
> > > benchmark.  And notice with my fixes, we are even faster than the
> > > none-frag/single-UDP packet case... but its because we are getting a
> > > GSO/GRO effect.
> > 
> > Could you elaborate on this GSO/GRO effect ?
> 
> On the big system, I saw none-frag UDP (1472 bytes) throughput of:
>   7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> 
> While with UDP fragments size 65507 bytes I saw:
>   9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> 
> Fragmented UDP is faster by:
>  37668.26 - 29308.21 = 8360.05 Mbit/s
> 
> The 65507 bytes UDP size is just a benchmark test, and have no real-life
> relevance.  As performance starts to drop (below none-frag/normal case)
> when the frag size is decreased, to more realistic sizes...

Yes, but I doubt GRO / GSO are the reason you get better performance.

GRO doesnt aggregate UDP frames.

^ permalink raw reply

* Re: [PATCH net-next v2] net: clean up locking in inet_frag_find()
From: Eric Dumazet @ 2012-11-26 15:12 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Patrick McHardy, Pablo Neira Ayuso, David S. Miller
In-Reply-To: <1353914786-10426-1-git-send-email-amwang@redhat.com>

On Mon, 2012-11-26 at 15:26 +0800, Cong Wang wrote:
> It is weird to take the read lock outside of inet_frag_find()
> but release it inside...  This can be improved by refactoring
> the code, that is, introducing inet{4,6}_frag_find() which call
> the their own hash function, inet{4,6}_hash_frag(), hiding the
> details from their callers.
> 
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <amwang@redhat.com>
> 
> ---
>  include/net/inet_frag.h                 |   17 +++++-
>  include/net/ipv6.h                      |    3 -
>  net/ipv4/inet_fragment.c                |   82 +++++++++++++++++++++++++++++--
>  net/ipv4/ip_fragment.c                  |   16 +-----
>  net/ipv6/netfilter/nf_conntrack_reasm.c |    7 +--
>  net/ipv6/reassembly.c                   |   34 +------------
>  6 files changed, 97 insertions(+), 62 deletions(-)

Not clear to me its a win, as it adds 35 LOC. Nobody really complained
of this locking schem in the past.

Also Jesper is working on this stuff, so you dont really ease its work.

^ permalink raw reply

* Re: [PATCH v2 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Brian Haley @ 2012-11-26 15:10 UTC (permalink / raw)
  To: David Miller; +Cc: xemul, eric.dumazet, netdev
In-Reply-To: <20121120.135842.249477087130415954.davem@davemloft.net>

On 11/20/2012 01:58 PM, David Miller wrote:
>> v2: Added seqlock protection while copying device name.
>>
>> Signed-off-by: Brian Haley <brian.haley@hp.com>
> 
> Brian I was going to apply this, but something about how you email
> patches results in them being corrupted.
> 
> Go to:
> 
> http://patchwork.ozlabs.org/patch/199732/
> 
> Click on Download "mbox", and try to apply that to the net-next tree
> to see what I mean.

I'll take a look why that was wrapping and send a v3, been away...

-Brian

^ permalink raw reply

* Re: [PATCH net-next] gro: Handle inline VLAN tags
From: Andrew Gallatin @ 2012-11-26 15:04 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, bhutchings, netdev, linux-net-drivers, herbert
In-Reply-To: <20121119.191002.1995098917961576324.davem@davemloft.net>

On 11/19/12 19:10, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 16 Nov 2012 17:09:19 -0800
>
>> On Sat, 2012-11-17 at 00:32 +0000, Ben Hutchings wrote:
>>> On Fri, 2012-11-16 at 16:16 -0800, Eric Dumazet wrote:
>>>> On Sat, 2012-11-17 at 00:00 +0000, Ben Hutchings wrote:
>>>>
>>>>> I'm not sure what you mean by this.  Is your point that the
>>>>> copy-on-write is never needed?  It is still possible for pskb_may_pull()
>>>>> to fail.
>>>>>
>>>>
>>>> A packet sniffer should have a copy of bad frames, even if dropped later
>>>> in our stacks.
>>>>
>>>> GRO layer is not allowed to drop a frame, even if not 'correct'.
>>>
>>> What do you think the accelerated hardware does with frames that have a
>>> truncated VLAN tag?
>>
>> The hardware should send us the frame, exactly like when RX checksum is
>> wrong.
>
> I agree with Eric, and therefore will not apply this patch.
>

David,

How do you feel about the patchset I posted on 11/14/2012
([PATCH net-next 0/3] myri10ge: LRO to GRO conversion,
http://marc.info/?l=linux-netdev&m=135289838223920&w=2)
which moves myri10ge from LRO to GRO?

Specifically, if doing vlan decap in GRO is not OK, then how
about doing it in the driver?

BTW, if I have bungled something it the myri10ge patchset submission,
I do apologize. I don't frequently submit patches, and it is likely
I screwed up some convention..

Thanks,

Drew

^ permalink raw reply

* Re: [PATCH] sctp: fix -ENOMEM result with invalid user space pointer in sendto() syscall
From: Vlad Yasevich @ 2012-11-26 14:56 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Neil Horman, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590596-12216-1-git-send-email-tt.rantala@gmail.com>

On 11/22/2012 08:23 AM, Tommi Rantala wrote:
> Consider the following program, that sets the second argument to the
> sendto() syscall incorrectly:
>
>   #include <string.h>
>   #include <arpa/inet.h>
>   #include <sys/socket.h>
>
>   int main(void)
>   {
>           int fd;
>           struct sockaddr_in sa;
>
>           fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>           if (fd < 0)
>                   return 1;
>
>           memset(&sa, 0, sizeof(sa));
>           sa.sin_family = AF_INET;
>           sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>           sa.sin_port = htons(11111);
>
>           sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
>           return 0;
>   }
>
> We get -ENOMEM:
>
>   $ strace -e sendto ./demo
>   sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate memory)
>
> Propagate the error code from sctp_user_addto_chunk(), so that we will
> tell user space what actually went wrong:
>
>   $ strace -e sendto ./demo
>   sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EFAULT (Bad address)
>
> Noticed while running Trinity (the syscall fuzzer).
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>

Looks good

Acked-by: Vlad Yasevich <vyasevich@gmail.com>

-vlad

> ---
>   net/sctp/chunk.c  |   13 +++++++++----
>   net/sctp/socket.c |    4 ++--
>   2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index d241ef5..3952ca9 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -183,7 +183,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   	msg = sctp_datamsg_new(GFP_KERNEL);
>   	if (!msg)
> -		return NULL;
> +		return ERR_PTR(-ENOMEM);
>
>   	/* Note: Calculate this outside of the loop, so that all fragments
>   	 * have the same expiration.
> @@ -280,8 +280,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   		chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag, 0);
>
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>   			goto errout;
> +		}
> +
>   		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>   		if (err < 0)
>   			goto errout_chunk_put;
> @@ -315,8 +318,10 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   		chunk = sctp_make_datafrag_empty(asoc, sinfo, over, frag, 0);
>
> -		if (!chunk)
> +		if (!chunk) {
> +			err = -ENOMEM;
>   			goto errout;
> +		}
>
>   		err = sctp_user_addto_chunk(chunk, offset, over,msgh->msg_iov);
>
> @@ -342,7 +347,7 @@ errout:
>   		sctp_chunk_free(chunk);
>   	}
>   	sctp_datamsg_put(msg);
> -	return NULL;
> +	return ERR_PTR(err);
>   }
>
>   /* Check whether this message has expired. */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index a60d1f8..406d957 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1915,8 +1915,8 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>
>   	/* Break the message into multiple chunks of maximum size. */
>   	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> -	if (!datamsg) {
> -		err = -ENOMEM;
> +	if (IS_ERR(datamsg)) {
> +		err = PTR_ERR(datamsg);
>   		goto out_free;
>   	}
>
>

^ permalink raw reply

* Re: [PATCH] sctp: fix memory leak in sctp_datamsg_from_user() when copy from user space fails
From: Vlad Yasevich @ 2012-11-26 14:52 UTC (permalink / raw)
  To: Tommi Rantala
  Cc: linux-sctp, netdev, Neil Horman, Sridhar Samudrala,
	David S. Miller, Dave Jones
In-Reply-To: <1353590491-12166-1-git-send-email-tt.rantala@gmail.com>

On 11/22/2012 08:21 AM, Tommi Rantala wrote:
> Trinity (the syscall fuzzer) discovered a memory leak in SCTP,
> reproducible e.g. with the sendto() syscall by passing invalid
> user space pointer in the second argument:
>
>   #include <string.h>
>   #include <arpa/inet.h>
>   #include <sys/socket.h>
>
>   int main(void)
>   {
>           int fd;
>           struct sockaddr_in sa;
>
>           fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
>           if (fd < 0)
>                   return 1;
>
>           memset(&sa, 0, sizeof(sa));
>           sa.sin_family = AF_INET;
>           sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>           sa.sin_port = htons(11111);
>
>           sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
>           return 0;
>   }
>
> As far as I can tell, the leak has been around since ~2003.
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
>   net/sctp/chunk.c |    7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 7c2df9c..d241ef5 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -284,7 +284,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>   			goto errout;
>   		err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
>   		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>
>   		offset += len;
>
> @@ -324,7 +324,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>   		__skb_pull(chunk->skb, (__u8 *)chunk->chunk_hdr
>   			   - (__u8 *)chunk->skb->data);
>   		if (err < 0)
> -			goto errout;
> +			goto errout_chunk_put;
>
>   		sctp_datamsg_assign(msg, chunk);
>   		list_add_tail(&chunk->frag_list, &msg->chunks);
> @@ -332,6 +332,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
>   	return msg;
>
> +errout_chunk_put:
> +	sctp_chunk_put(chunk);
> +
>   errout:
>   	list_for_each_safe(pos, temp, &msg->chunks) {
>   		list_del_init(pos);
>

Should be using sctp_chunk_free().  Good find.

-vlad

^ permalink raw reply

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Jesper Dangaard Brouer @ 2012-11-26 14:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu
In-Reply-To: <1353859891.30446.634.camel@edumazet-glaptop>

On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> 
> > Yes, for the default large 64k packets size, its just a "fake"
> > benchmark.  And notice with my fixes, we are even faster than the
> > none-frag/single-UDP packet case... but its because we are getting a
> > GSO/GRO effect.
> 
> Could you elaborate on this GSO/GRO effect ?

On the big system, I saw none-frag UDP (1472 bytes) throughput of:
  7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s

While with UDP fragments size 65507 bytes I saw:
  9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s

Fragmented UDP is faster by:
 37668.26 - 29308.21 = 8360.05 Mbit/s

The 65507 bytes UDP size is just a benchmark test, and have no real-life
relevance.  As performance starts to drop (below none-frag/normal case)
when the frag size is decreased, to more realistic sizes...


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH v2] atm: br2684: Fix excessive queue bloat
From: Jesper Dangaard Brouer @ 2012-11-26 14:16 UTC (permalink / raw)
  To: David Woodhouse
  Cc: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR),
	Jesper Brouer
In-Reply-To: <1353881212.26346.303.camel@shinybook.infradead.org>


Nice work David Woodhouse.  What OpenWRT supported box have this 
hardware? (I want one so I can play with it ;-))

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* [PATCH] irda: irttp: fix memory leak in irttp_open_tsap() error path
From: Tommi Rantala @ 2012-11-26 14:16 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Ortiz, David S. Miller, Dave Jones, Tommi Rantala

Cleanup the memory we allocated earlier in irttp_open_tsap() when we hit
this error path. The leak goes back to at least 1da177e4
("Linux-2.6.12-rc2").

Discovered with Trinity (the syscall fuzzer).

Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
---
 net/irda/irttp.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/irda/irttp.c b/net/irda/irttp.c
index 1002e33..ae43c62 100644
--- a/net/irda/irttp.c
+++ b/net/irda/irttp.c
@@ -441,6 +441,7 @@ struct tsap_cb *irttp_open_tsap(__u8 stsap_sel, int credit, notify_t *notify)
 	lsap = irlmp_open_lsap(stsap_sel, &ttp_notify, 0);
 	if (lsap == NULL) {
 		IRDA_DEBUG(0, "%s: unable to allocate LSAP!!\n", __func__);
+		__irttp_close_tsap(self);
 		return NULL;
 	}

-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH 4/5] smsc95xx: refactor entering suspend modes
From: Bjørn Mork @ 2012-11-26 13:48 UTC (permalink / raw)
  To: Steve Glendinning
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1353607526-19307-5-git-send-email-steve.glendinning-nksJyM/082jR7s880joybQ@public.gmane.org>

[adding linux-usb to CC as this is very USB specific]

Steve Glendinning <steve.glendinning-nksJyM/082jR7s880joybQ@public.gmane.org> writes:

> +	smsc95xx_set_feature(dev, USB_DEVICE_REMOTE_WAKEUP);

That does look a bit strange to me.  This is a USB interface driver.
The USB device is handled by the generic "usb" USB device driver, which
will DTRT for you.  I don't think you need to set any USB device
features here.

Sorry for not commenting on this earlier.... It took me a while to
understand why that part surprised me.

Bjørn
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2] net: clean up locking in inet_frag_find()
From: Jesper Dangaard Brouer @ 2012-11-26 13:42 UTC (permalink / raw)
  To: Cong Wang, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <1353914786-10426-1-git-send-email-amwang@redhat.com>


Could we please hold back on this cleanup patch, as I have a stack of 9
patches modifying this area.

If people find this cleanup useful/correct?, I can integrate it into my
patch stack...

--Jesper

On Mon, 2012-11-26 at 15:26 +0800, Cong Wang wrote:
> It is weird to take the read lock outside of inet_frag_find()
> but release it inside...  This can be improved by refactoring
> the code, that is, introducing inet{4,6}_frag_find() which call
> the their own hash function, inet{4,6}_hash_frag(), hiding the
> details from their callers.

^ permalink raw reply

* Re: linux-next: manual merge of the akpm tree with Linus' tree
From: Xiaotian Feng @ 2012-11-26 13:25 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Andrew Morton, linux-next, linux-kernel, David Miller, netdev
In-Reply-To: <20121126234844.1952cdd84c3ba041cfe7a9af@canb.auug.org.au>

On Mon, Nov 26, 2012 at 8:48 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> Hi Andrew,
>
> Today's linux-next merge of the akpm tree got a conflict in
> drivers/net/ethernet/jme.c between commit 71c6c837a0fe ("drivers/net: fix
> tasklet misuse issue") from Linus' tree and commit  "tasklet: ignore
> disabled tasklet in tasklet_action()" from the akpm tree.
>

You can simply remove the following part of the patch


@@ -1862,8 +1862,8 @@ jme_open(struct net_device *netdev)

         tasklet_enable(&jme->linkch_task);
         tasklet_enable(&jme->txclean_task);
-       tasklet_hi_enable(&jme->rxclean_task);
-       tasklet_hi_enable(&jme->rxempty_task);
+       tasklet_enable(&jme->rxclean_task);
+       tasklet_enable(&jme->rxempty_task);

         rc = jme_request_irq(jme);
         if (rc)

Do you want me to re-generate a patch for you?


> I am not sure what to do here, so I have dropped the akpm patch.
>
> --
> Cheers,
> Stephen Rothwell                    sfr@canb.auug.org.au

^ permalink raw reply

* linux-next: manual merge of the akpm tree with Linus' tree
From: Stephen Rothwell @ 2012-11-26 12:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-next, linux-kernel, Xiaotian Feng, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 421 bytes --]

Hi Andrew,

Today's linux-next merge of the akpm tree got a conflict in
drivers/net/ethernet/jme.c between commit 71c6c837a0fe ("drivers/net: fix
tasklet misuse issue") from Linus' tree and commit  "tasklet: ignore
disabled tasklet in tasklet_action()" from the akpm tree.

I am not sure what to do here, so I have dropped the akpm patch.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* iputils-s20121126
From: YOSHIFUJI Hideaki @ 2012-11-26 12:13 UTC (permalink / raw)
  To: 'netdev@vger.kernel.org'; +Cc: YOSHIFUJI Hideaki

Hello.

iputils-s20121126 comes with a lot of bug fixes and improvements.

New features and bux fixes (selected) since s20121114:
        - static link support for libidn.
        - arping: select default interface.
        - ninfod: rejects queries from global addresses.
        - ping: do not free uninitialized value (bug since s20121112).
        - ping6: -N subject-ipv6 and subject-ipv4 sub-options fixed.
        - ping6: Randomize nonce field in NI Queries.
        - ping6: source routing deprecated.
        - tracepath: broken if port was omitted (bug since s20121112).

Files:
        https://sourceforge.net/projects/iputils/files/
        http://www.skbuff.net/iputils/
Tree:
        http://www.linux-ipv6.org/gitweb/gitweb.cgi?p=gitroot/iputils.git
        https://sourceforge.net/p/iputils/code/ci/HEAD/tree/

Regards,

--yoshfuji
----------
Changelogs:

Jan Synacek (2):
      ping,ping6: Add newline to error message.
      ping: Don't free an unintialized value.

YOSHIFUJI Hideaki (69):
      arping,clockdiff,ping,rarpd,rdisc,traceroute6 doc:
s/CAP_NET_RAWIO/CAP_NET_RAW/.
      ping,ping6: Do not assume radix point is denoted by '.' (-i option).
      arping,ping,ping6,rdisc,traceroute6: Fix version string.
      makefile: Give -fno-strict-aliasing to compiler by default.
      ping6: Use SCOPE_DELIMITER.
      Makefile: Remove -lm from ADDLIB.
      rdisc_srv,Makefile: Fix build.
      rdisc_srv,Makefile: Build rdisc_srv with make all.
      arping: set_device_broadcast() does not need to store return value
of sub-functions.
      arping,Makefile: Make default interface configurable.
      arping: Do not allow empty device name (-I option).
      arping: Introduce check_ifflags() helper function.
      arping: Introduce device structure to hold output device information.
      arping: ALlow no default interface and select one by getifaddrs().
      arping: Introduce 2nd (legacy) method to select interface by ioctls.
      arping,Makefile: Allow build without getifaddrs() with
WITHOUT_IFADDRS=yes.
      Makefile: Use $< instead of $^ to complile C source code.
      ping,ping6: Reorder command-line options in alphabetical order.
      ping6: Show suboptions for Node Information Queries if -N
suboption is invalid.
      ping,ping6 doc: Readability for TOS (-Q) option.
      rdisc: Missing new line after usage.
      rdisc: Make rdisc with responder support if configured.
      Makefile: distclean depends on clean.
      Makefile: Default to -O3.
      Makefile: Minimize options to gcc.
      Makefile: Add rule to build assembly files.
      arping,Makefile: 3rd legacy implementation to check network devices.
      arping: Less ifdefs.
      rdisc doc: Document -r, -p and -T options.
      ping6: NI Subjecet address did not work (-N subject-{ipv6,ipv4]
suboptions).
      ping6: Ensure to detect subject type conflicts.
      iputils-s20121121
      ping6: Use IN6_IS_ADDR_UNSPECIFIED() instead of our own helper
function.
      ping6 doc: Explicitly describe ping6 is IPv6 version if ping.
      ping6: Deprecate source routing by default (RFC5095).
      ping6: Use RFC3542 functions and definition for source routing.
      ping6: Introduce niquery_is_enabled() for readability.
      arping doc: interface is optional (-I option).
      ping: Eliminate dirty hack to cope with ancient egcs bug.
      Makefile: Fix missing right parenthese in comment.
      arping: Fix build failure with USE_SYSFS=yes and/or
WITHOUT_IFADDRS=yes
      arping: Unify source files.
      arping: Reorder functions and comment out unsued code.
      arping,ping,ping6,tracepath,traceroute6 Makefile: Support static
link of libidn by USE_IDN=static.
      Makefile: Minimize statically linked libraries.
      ping6: Do not clear seq check array twice for NI.
      ping6: Use MD5_DIGEST_LENGTH instead of magic value 16.
      ping6: Introduce helper functions for nonce in NI.
      ping6: Introduce NI_NONCE_SIZE macro instead of magic value 8.
      ping6: Ensure to call srand() to get some randomness in NI Nonce.
      ping6: Generate different NI Nonce in each NI Query (Memory version).
      ping6: Generate different NI Nonce in each NI Query (MD5 version).
      ping6: Cache NI Nonce.
      ping6: Print 'sequence number' embedded in NI Nonce.
      ninfod: Do noy try to memcpy to self.
      ninfod Makefile: More precise dependencies.
      ninfod: Discard multicat packet outside linklocal scope.
      ninfod: Apply default policy to refuse queries from global addresses.
      ninfod: Normalize timespec for delay.
      ninfod: Fix double-free without pthreads.
      ninfod: Do not mix output from multiple threads.
      ninfod: Employ internal buffer in stderrlog() for common case.
      iputils-s20121125
      tracepath: Repair tracepath without -p option.
      tracepath,tracepath6: -p option in usage.
      ping,ping6: Use MAX_DUP_CHK directly, not using mx_dup_chk variable.
      ping,ping6: Abstract received bitmap macros/definitions.
      ping,ping6: Use __u64 or __u32 for bitmap.
      iputils-s20121126

^ permalink raw reply

* [PATCH net-next] cpts: add missing kconfig dependency
From: Richard Cochran @ 2012-11-26 12:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-arm-kernel, David Miller, Cyril Chemparathy, Mugunthan V N

The Common Platform Time Sync function of the CPSW does not depend the
CPSW configuration option as it should. This patch fixes the issue by
adding the dependency.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
 drivers/net/ethernet/ti/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
index 48fcb5e..4426151 100644
--- a/drivers/net/ethernet/ti/Kconfig
+++ b/drivers/net/ethernet/ti/Kconfig
@@ -62,6 +62,7 @@ config TI_CPSW
 
 config TI_CPTS
 	boolean "TI Common Platform Time Sync (CPTS) Support"
+	depends on TI_CPSW
 	select PTP_1588_CLOCK
 	---help---
 	  This driver supports the Common Platform Time Sync unit of
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH net-next 2/2] ptp: reduce stack usage when measuring the system time offset
From: Richard Cochran @ 2012-11-26 11:44 UTC (permalink / raw)
  To: netdev; +Cc: David Miller
In-Reply-To: <1146e32bcb835ebf394bab91db3161600ab5213a.1353929829.git.richardcochran@gmail.com>

This patch removes the large buffer from the stack of the system
offset ioctl and replaces it with a kmalloced buffer.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
 drivers/ptp/ptp_chardev.c |   21 ++++++++++++++-------
 1 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
index 9d7542e..34a0c60 100644
--- a/drivers/ptp/ptp_chardev.c
+++ b/drivers/ptp/ptp_chardev.c
@@ -34,7 +34,7 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 {
 	struct ptp_clock_caps caps;
 	struct ptp_clock_request req;
-	struct ptp_sys_offset sysoff;
+	struct ptp_sys_offset *sysoff = NULL;
 	struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock);
 	struct ptp_clock_info *ops = ptp->info;
 	struct ptp_clock_time *pct;
@@ -94,17 +94,22 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_SYS_OFFSET:
-		if (copy_from_user(&sysoff, (void __user *)arg,
-				   sizeof(sysoff))) {
+		sysoff = kmalloc(sizeof(*sysoff), GFP_KERNEL);
+		if (!sysoff) {
+			err = -ENOMEM;
+			break;
+		}
+		if (copy_from_user(sysoff, (void __user *)arg,
+				   sizeof(*sysoff))) {
 			err = -EFAULT;
 			break;
 		}
-		if (sysoff.n_samples > PTP_MAX_SAMPLES) {
+		if (sysoff->n_samples > PTP_MAX_SAMPLES) {
 			err = -EINVAL;
 			break;
 		}
-		pct = &sysoff.ts[0];
-		for (i = 0; i < sysoff.n_samples; i++) {
+		pct = &sysoff->ts[0];
+		for (i = 0; i < sysoff->n_samples; i++) {
 			getnstimeofday(&ts);
 			pct->sec = ts.tv_sec;
 			pct->nsec = ts.tv_nsec;
@@ -117,7 +122,7 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		getnstimeofday(&ts);
 		pct->sec = ts.tv_sec;
 		pct->nsec = ts.tv_nsec;
-		if (copy_to_user((void __user *)arg, &sysoff, sizeof(sysoff)))
+		if (copy_to_user((void __user *)arg, sysoff, sizeof(*sysoff)))
 			err = -EFAULT;
 		break;
 
@@ -125,6 +130,8 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		err = -ENOTTY;
 		break;
 	}
+
+	kfree(sysoff);
 	return err;
 }
 
-- 
1.7.2.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox