* Re: [PATCH] sctp: fix memory leak in sctp_datamsg_from_user() when copy from user space fails
From: Vlad Yasevich @ 2012-11-26 14:52 UTC (permalink / raw)
To: Tommi Rantala
Cc: linux-sctp, netdev, Neil Horman, Sridhar Samudrala,
David S. Miller, Dave Jones
In-Reply-To: <1353590491-12166-1-git-send-email-tt.rantala@gmail.com>
On 11/22/2012 08:21 AM, Tommi Rantala wrote:
> Trinity (the syscall fuzzer) discovered a memory leak in SCTP,
> reproducible e.g. with the sendto() syscall by passing invalid
> user space pointer in the second argument:
>
> #include <string.h>
> #include <arpa/inet.h>
> #include <sys/socket.h>
>
> int main(void)
> {
> int fd;
> struct sockaddr_in sa;
>
> fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
> if (fd < 0)
> return 1;
>
> memset(&sa, 0, sizeof(sa));
> sa.sin_family = AF_INET;
> sa.sin_addr.s_addr = inet_addr("127.0.0.1");
> sa.sin_port = htons(11111);
>
> sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
> return 0;
> }
>
> As far as I can tell, the leak has been around since ~2003.
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
> net/sctp/chunk.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 7c2df9c..d241ef5 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -284,7 +284,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
> goto errout;
> err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
> if (err < 0)
> - goto errout;
> + goto errout_chunk_put;
>
> offset += len;
>
> @@ -324,7 +324,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
> __skb_pull(chunk->skb, (__u8 *)chunk->chunk_hdr
> - (__u8 *)chunk->skb->data);
> if (err < 0)
> - goto errout;
> + goto errout_chunk_put;
>
> sctp_datamsg_assign(msg, chunk);
> list_add_tail(&chunk->frag_list, &msg->chunks);
> @@ -332,6 +332,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> return msg;
>
> +errout_chunk_put:
> + sctp_chunk_put(chunk);
> +
> errout:
> list_for_each_safe(pos, temp, &msg->chunks) {
> list_del_init(pos);
>
Should be using sctp_chunk_free(). Good find.
-vlad
^ permalink raw reply
* Re: [PATCH] sctp: fix -ENOMEM result with invalid user space pointer in sendto() syscall
From: Vlad Yasevich @ 2012-11-26 14:56 UTC (permalink / raw)
To: Tommi Rantala
Cc: linux-sctp, netdev, Neil Horman, Sridhar Samudrala,
David S. Miller, Dave Jones
In-Reply-To: <1353590596-12216-1-git-send-email-tt.rantala@gmail.com>
On 11/22/2012 08:23 AM, Tommi Rantala wrote:
> Consider the following program, that sets the second argument to the
> sendto() syscall incorrectly:
>
> #include <string.h>
> #include <arpa/inet.h>
> #include <sys/socket.h>
>
> int main(void)
> {
> int fd;
> struct sockaddr_in sa;
>
> fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
> if (fd < 0)
> return 1;
>
> memset(&sa, 0, sizeof(sa));
> sa.sin_family = AF_INET;
> sa.sin_addr.s_addr = inet_addr("127.0.0.1");
> sa.sin_port = htons(11111);
>
> sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
> return 0;
> }
>
> We get -ENOMEM:
>
> $ strace -e sendto ./demo
> sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate memory)
>
> Propagate the error code from sctp_user_addto_chunk(), so that we will
> tell user space what actually went wrong:
>
> $ strace -e sendto ./demo
> sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EFAULT (Bad address)
>
> Noticed while running Trinity (the syscall fuzzer).
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Looks good
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
-vlad
> ---
> net/sctp/chunk.c | 13 +++++++++----
> net/sctp/socket.c | 4 ++--
> 2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index d241ef5..3952ca9 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -183,7 +183,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> msg = sctp_datamsg_new(GFP_KERNEL);
> if (!msg)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
>
> /* Note: Calculate this outside of the loop, so that all fragments
> * have the same expiration.
> @@ -280,8 +280,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag, 0);
>
> - if (!chunk)
> + if (!chunk) {
> + err = -ENOMEM;
> goto errout;
> + }
> +
> err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
> if (err < 0)
> goto errout_chunk_put;
> @@ -315,8 +318,10 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> chunk = sctp_make_datafrag_empty(asoc, sinfo, over, frag, 0);
>
> - if (!chunk)
> + if (!chunk) {
> + err = -ENOMEM;
> goto errout;
> + }
>
> err = sctp_user_addto_chunk(chunk, offset, over,msgh->msg_iov);
>
> @@ -342,7 +347,7 @@ errout:
> sctp_chunk_free(chunk);
> }
> sctp_datamsg_put(msg);
> - return NULL;
> + return ERR_PTR(err);
> }
>
> /* Check whether this message has expired. */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index a60d1f8..406d957 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1915,8 +1915,8 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>
> /* Break the message into multiple chunks of maximum size. */
> datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> - if (!datamsg) {
> - err = -ENOMEM;
> + if (IS_ERR(datamsg)) {
> + err = PTR_ERR(datamsg);
> goto out_free;
> }
>
>
^ permalink raw reply
* Re: [PATCH net-next] gro: Handle inline VLAN tags
From: Andrew Gallatin @ 2012-11-26 15:04 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, bhutchings, netdev, linux-net-drivers, herbert
In-Reply-To: <20121119.191002.1995098917961576324.davem@davemloft.net>
On 11/19/12 19:10, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 16 Nov 2012 17:09:19 -0800
>
>> On Sat, 2012-11-17 at 00:32 +0000, Ben Hutchings wrote:
>>> On Fri, 2012-11-16 at 16:16 -0800, Eric Dumazet wrote:
>>>> On Sat, 2012-11-17 at 00:00 +0000, Ben Hutchings wrote:
>>>>
>>>>> I'm not sure what you mean by this. Is your point that the
>>>>> copy-on-write is never needed? It is still possible for pskb_may_pull()
>>>>> to fail.
>>>>>
>>>>
>>>> A packet sniffer should have a copy of bad frames, even if dropped later
>>>> in our stacks.
>>>>
>>>> GRO layer is not allowed to drop a frame, even if not 'correct'.
>>>
>>> What do you think the accelerated hardware does with frames that have a
>>> truncated VLAN tag?
>>
>> The hardware should send us the frame, exactly like when RX checksum is
>> wrong.
>
> I agree with Eric, and therefore will not apply this patch.
>
David,
How do you feel about the patchset I posted on 11/14/2012
([PATCH net-next 0/3] myri10ge: LRO to GRO conversion,
http://marc.info/?l=linux-netdev&m=135289838223920&w=2)
which moves myri10ge from LRO to GRO?
Specifically, if doing vlan decap in GRO is not OK, then how
about doing it in the driver?
BTW, if I have bungled something it the myri10ge patchset submission,
I do apologize. I don't frequently submit patches, and it is likely
I screwed up some convention..
Thanks,
Drew
^ permalink raw reply
* Re: [PATCH v2 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Brian Haley @ 2012-11-26 15:10 UTC (permalink / raw)
To: David Miller; +Cc: xemul, eric.dumazet, netdev
In-Reply-To: <20121120.135842.249477087130415954.davem@davemloft.net>
On 11/20/2012 01:58 PM, David Miller wrote:
>> v2: Added seqlock protection while copying device name.
>>
>> Signed-off-by: Brian Haley <brian.haley@hp.com>
>
> Brian I was going to apply this, but something about how you email
> patches results in them being corrupted.
>
> Go to:
>
> http://patchwork.ozlabs.org/patch/199732/
>
> Click on Download "mbox", and try to apply that to the net-next tree
> to see what I mean.
I'll take a look why that was wrapping and send a v3, been away...
-Brian
^ permalink raw reply
* Re: [PATCH net-next v2] net: clean up locking in inet_frag_find()
From: Eric Dumazet @ 2012-11-26 15:12 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, Patrick McHardy, Pablo Neira Ayuso, David S. Miller
In-Reply-To: <1353914786-10426-1-git-send-email-amwang@redhat.com>
On Mon, 2012-11-26 at 15:26 +0800, Cong Wang wrote:
> It is weird to take the read lock outside of inet_frag_find()
> but release it inside... This can be improved by refactoring
> the code, that is, introducing inet{4,6}_frag_find() which call
> the their own hash function, inet{4,6}_hash_frag(), hiding the
> details from their callers.
>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <amwang@redhat.com>
>
> ---
> include/net/inet_frag.h | 17 +++++-
> include/net/ipv6.h | 3 -
> net/ipv4/inet_fragment.c | 82 +++++++++++++++++++++++++++++--
> net/ipv4/ip_fragment.c | 16 +-----
> net/ipv6/netfilter/nf_conntrack_reasm.c | 7 +--
> net/ipv6/reassembly.c | 34 +------------
> 6 files changed, 97 insertions(+), 62 deletions(-)
Not clear to me its a win, as it adds 35 LOC. Nobody really complained
of this locking schem in the past.
Also Jesper is working on this stuff, so you dont really ease its work.
^ permalink raw reply
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Eric Dumazet @ 2012-11-26 15:15 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
In-Reply-To: <1353940930.11754.221.camel@localhost>
On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> >
> > > Yes, for the default large 64k packets size, its just a "fake"
> > > benchmark. And notice with my fixes, we are even faster than the
> > > none-frag/single-UDP packet case... but its because we are getting a
> > > GSO/GRO effect.
> >
> > Could you elaborate on this GSO/GRO effect ?
>
> On the big system, I saw none-frag UDP (1472 bytes) throughput of:
> 7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
>
> While with UDP fragments size 65507 bytes I saw:
> 9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
>
> Fragmented UDP is faster by:
> 37668.26 - 29308.21 = 8360.05 Mbit/s
>
> The 65507 bytes UDP size is just a benchmark test, and have no real-life
> relevance. As performance starts to drop (below none-frag/normal case)
> when the frag size is decreased, to more realistic sizes...
Yes, but I doubt GRO / GSO are the reason you get better performance.
GRO doesnt aggregate UDP frames.
^ permalink raw reply
* Re: [PATCH RFC 0/5] Containerize syslog
From: Eric W. Biederman @ 2012-11-26 15:16 UTC (permalink / raw)
To: Rui Xiang; +Cc: Serge E. Hallyn, serge.hallyn, containers, netdev
In-Reply-To: <50ACA05F.7080005@gmail.com>
Rui Xiang <leo.ruixiang@gmail.com> writes:
> On 2012-11-19 22:37, Serge E. Hallyn wrote:
>> I understand that user namespaces aren't 100% usable yet, but looking
>> long term, is there a reason to have the syslog namespace separate
>> from user namespace?
>
> Actually we don't have strong preference. We'll think more about it. Hope we can make
> consensus with Eric.
I hope I am not hard to work with. My primary concern is reasonable
looking code and good long term maintainable semantics.
I really don't care in which namespace where we file the kernel log
statements.
I care much more about which kernel log print statements we want filed
differently.
Eric
^ permalink raw reply
* [PATCH v3 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Brian Haley @ 2012-11-26 15:21 UTC (permalink / raw)
To: David Miller; +Cc: Pavel Emelyanov, Eric Dumazet, netdev@vger.kernel.org
Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
will then require another call like if_indextoname() to get the actual interface
name, have it return the name directly.
This also matches the existing man page description on socket(7) which mentions
the argument being an interface name.
If the value has not been set, zero is returned and optlen will be set to zero
to indicate there is no interface name present.
Added a seqlock to protect this code path, and dev_ifname(), from someone
changing the device name via dev_change_name().
v2: Added seqlock protection while copying device name.
v3: Fixed word wrap in patch.
Signed-off-by: Brian Haley <brian.haley@hp.com>
--
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e46c830..e9929ab 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1567,6 +1567,8 @@ extern int call_netdevice_notifiers(unsigned long val, struct net_device *dev);
extern rwlock_t dev_base_lock; /* Device list lock */
+extern seqlock_t devnet_rename_seq; /* Device rename lock */
+
#define for_each_netdev(net, d) \
list_for_each_entry(d, &(net)->dev_base_head, dev_list)
diff --git a/net/core/dev.c b/net/core/dev.c
index 7304ea8..2a5f558 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -203,6 +203,8 @@ static struct list_head offload_base __read_mostly;
DEFINE_RWLOCK(dev_base_lock);
EXPORT_SYMBOL(dev_base_lock);
+DEFINE_SEQLOCK(devnet_rename_seq);
+
static inline void dev_base_seq_inc(struct net *net)
{
while (++net->dev_base_seq == 0);
@@ -1091,22 +1093,31 @@ int dev_change_name(struct net_device *dev, const char *newname)
if (dev->flags & IFF_UP)
return -EBUSY;
- if (strncmp(newname, dev->name, IFNAMSIZ) == 0)
+ write_seqlock(&devnet_rename_seq);
+
+ if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
+ write_sequnlock(&devnet_rename_seq);
return 0;
+ }
memcpy(oldname, dev->name, IFNAMSIZ);
err = dev_get_valid_name(net, dev, newname);
- if (err < 0)
+ if (err < 0) {
+ write_sequnlock(&devnet_rename_seq);
return err;
+ }
rollback:
ret = device_rename(&dev->dev, dev->name);
if (ret) {
memcpy(dev->name, oldname, IFNAMSIZ);
+ write_sequnlock(&devnet_rename_seq);
return ret;
}
+ write_sequnlock(&devnet_rename_seq);
+
write_lock_bh(&dev_base_lock);
hlist_del_rcu(&dev->name_hlist);
write_unlock_bh(&dev_base_lock);
@@ -1124,6 +1135,7 @@ rollback:
/* err >= 0 after dev_alloc_name() or stores the first errno */
if (err >= 0) {
err = ret;
+ write_seqlock(&devnet_rename_seq);
memcpy(dev->name, oldname, IFNAMSIZ);
goto rollback;
} else {
@@ -4148,6 +4160,7 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
{
struct net_device *dev;
struct ifreq ifr;
+ unsigned seq;
/*
* Fetch the caller's info block.
@@ -4156,6 +4169,8 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
return -EFAULT;
+retry:
+ seq = read_seqbegin(&devnet_rename_seq);
rcu_read_lock();
dev = dev_get_by_index_rcu(net, ifr.ifr_ifindex);
if (!dev) {
@@ -4165,6 +4180,8 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
strcpy(ifr.ifr_name, dev->name);
rcu_read_unlock();
+ if (read_seqretry(&devnet_rename_seq, seq))
+ goto retry;
if (copy_to_user(arg, &ifr, sizeof(struct ifreq)))
return -EFAULT;
diff --git a/net/core/sock.c b/net/core/sock.c
index d4f7b58..a692ef4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -505,7 +505,8 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
}
EXPORT_SYMBOL(sk_dst_check);
-static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen)
+static int sock_setbindtodevice(struct sock *sk, char __user *optval,
+ int optlen)
{
int ret = -ENOPROTOOPT;
#ifdef CONFIG_NETDEVICES
@@ -562,6 +563,59 @@ out:
return ret;
}
+static int sock_getbindtodevice(struct sock *sk, char __user *optval,
+ int __user *optlen, int len)
+{
+ int ret = -ENOPROTOOPT;
+#ifdef CONFIG_NETDEVICES
+ struct net *net = sock_net(sk);
+ struct net_device *dev;
+ char devname[IFNAMSIZ];
+ unsigned seq;
+
+ if (sk->sk_bound_dev_if == 0) {
+ len = 0;
+ goto zero;
+ }
+
+ ret = -EINVAL;
+ if (len < IFNAMSIZ)
+ goto out;
+
+retry:
+ seq = read_seqbegin(&devnet_rename_seq);
+ rcu_read_lock();
+ dev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if);
+ ret = -ENODEV;
+ if (!dev) {
+ rcu_read_unlock();
+ goto out;
+ }
+
+ strcpy(devname, dev->name);
+ rcu_read_unlock();
+ if (read_seqretry(&devnet_rename_seq, seq))
+ goto retry;
+
+ len = strlen(devname) + 1;
+
+ ret = -EFAULT;
+ if (copy_to_user(optval, devname, len))
+ goto out;
+
+zero:
+ ret = -EFAULT;
+ if (put_user(len, optlen))
+ goto out;
+
+ ret = 0;
+
+out:
+#endif
+
+ return ret;
+}
+
static inline void sock_valbool_flag(struct sock *sk, int bit, int valbool)
{
if (valbool)
@@ -589,7 +643,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
*/
if (optname == SO_BINDTODEVICE)
- return sock_bindtodevice(sk, optval, optlen);
+ return sock_setbindtodevice(sk, optval, optlen);
if (optlen < sizeof(int))
return -EINVAL;
@@ -1075,15 +1129,17 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
case SO_NOFCS:
v.val = sock_flag(sk, SOCK_NOFCS);
break;
+
case SO_BINDTODEVICE:
- v.val = sk->sk_bound_dev_if;
- break;
+ return sock_getbindtodevice(sk, optval, optlen, len);
+
case SO_GET_FILTER:
len = sk_get_filter(sk, (struct sock_filter __user *)optval, len);
if (len < 0)
return len;
goto lenout;
+
default:
return -ENOPROTOOPT;
}
^ permalink raw reply related
* Re: [PATCH] sctp: fix memory leak in sctp_datamsg_from_user() when copy from user space fails
From: Neil Horman @ 2012-11-26 15:23 UTC (permalink / raw)
To: Tommi Rantala
Cc: linux-sctp, netdev, Vlad Yasevich, Sridhar Samudrala,
David S. Miller, Dave Jones
In-Reply-To: <1353590491-12166-1-git-send-email-tt.rantala@gmail.com>
On Thu, Nov 22, 2012 at 03:21:31PM +0200, Tommi Rantala wrote:
> Trinity (the syscall fuzzer) discovered a memory leak in SCTP,
> reproducible e.g. with the sendto() syscall by passing invalid
> user space pointer in the second argument:
>
> #include <string.h>
> #include <arpa/inet.h>
> #include <sys/socket.h>
>
> int main(void)
> {
> int fd;
> struct sockaddr_in sa;
>
> fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
> if (fd < 0)
> return 1;
>
> memset(&sa, 0, sizeof(sa));
> sa.sin_family = AF_INET;
> sa.sin_addr.s_addr = inet_addr("127.0.0.1");
> sa.sin_port = htons(11111);
>
> sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
> return 0;
> }
>
> As far as I can tell, the leak has been around since ~2003.
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
> net/sctp/chunk.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 7c2df9c..d241ef5 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -284,7 +284,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
> goto errout;
> err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
> if (err < 0)
> - goto errout;
> + goto errout_chunk_put;
>
> offset += len;
>
> @@ -324,7 +324,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
> __skb_pull(chunk->skb, (__u8 *)chunk->chunk_hdr
> - (__u8 *)chunk->skb->data);
> if (err < 0)
> - goto errout;
> + goto errout_chunk_put;
>
> sctp_datamsg_assign(msg, chunk);
> list_add_tail(&chunk->frag_list, &msg->chunks);
> @@ -332,6 +332,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> return msg;
>
> +errout_chunk_put:
> + sctp_chunk_put(chunk);
> +
> errout:
> list_for_each_safe(pos, temp, &msg->chunks) {
> list_del_init(pos);
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
I'm fine with it the way it is, but it might be nicer if you instead just moved
the list_add_tail call up between the if (!chunk) check and the
sctp_user_addto_chunk call. That way the unwind loop at the errout label can
just free the chunk without the need for an extra label.
Neil
^ permalink raw reply
* Re: [PATCH] sctp: fix -ENOMEM result with invalid user space pointer in sendto() syscall
From: Neil Horman @ 2012-11-26 15:25 UTC (permalink / raw)
To: Tommi Rantala
Cc: linux-sctp, netdev, Vlad Yasevich, Sridhar Samudrala,
David S. Miller, Dave Jones
In-Reply-To: <1353590596-12216-1-git-send-email-tt.rantala@gmail.com>
On Thu, Nov 22, 2012 at 03:23:16PM +0200, Tommi Rantala wrote:
> Consider the following program, that sets the second argument to the
> sendto() syscall incorrectly:
>
> #include <string.h>
> #include <arpa/inet.h>
> #include <sys/socket.h>
>
> int main(void)
> {
> int fd;
> struct sockaddr_in sa;
>
> fd = socket(AF_INET, SOCK_STREAM, 132 /*IPPROTO_SCTP*/);
> if (fd < 0)
> return 1;
>
> memset(&sa, 0, sizeof(sa));
> sa.sin_family = AF_INET;
> sa.sin_addr.s_addr = inet_addr("127.0.0.1");
> sa.sin_port = htons(11111);
>
> sendto(fd, NULL, 1, 0, (struct sockaddr *)&sa, sizeof(sa));
>
> return 0;
> }
>
> We get -ENOMEM:
>
> $ strace -e sendto ./demo
> sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate memory)
>
> Propagate the error code from sctp_user_addto_chunk(), so that we will
> tell user space what actually went wrong:
>
> $ strace -e sendto ./demo
> sendto(3, NULL, 1, 0, {sa_family=AF_INET, sin_port=htons(11111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EFAULT (Bad address)
>
> Noticed while running Trinity (the syscall fuzzer).
>
> Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
> ---
> net/sctp/chunk.c | 13 +++++++++----
> net/sctp/socket.c | 4 ++--
> 2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index d241ef5..3952ca9 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -183,7 +183,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> msg = sctp_datamsg_new(GFP_KERNEL);
> if (!msg)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
>
> /* Note: Calculate this outside of the loop, so that all fragments
> * have the same expiration.
> @@ -280,8 +280,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag, 0);
>
> - if (!chunk)
> + if (!chunk) {
> + err = -ENOMEM;
> goto errout;
> + }
> +
> err = sctp_user_addto_chunk(chunk, offset, len, msgh->msg_iov);
> if (err < 0)
> goto errout_chunk_put;
> @@ -315,8 +318,10 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
>
> chunk = sctp_make_datafrag_empty(asoc, sinfo, over, frag, 0);
>
> - if (!chunk)
> + if (!chunk) {
> + err = -ENOMEM;
> goto errout;
> + }
>
> err = sctp_user_addto_chunk(chunk, offset, over,msgh->msg_iov);
>
> @@ -342,7 +347,7 @@ errout:
> sctp_chunk_free(chunk);
> }
> sctp_datamsg_put(msg);
> - return NULL;
> + return ERR_PTR(err);
> }
>
> /* Check whether this message has expired. */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index a60d1f8..406d957 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1915,8 +1915,8 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>
> /* Break the message into multiple chunks of maximum size. */
> datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> - if (!datamsg) {
> - err = -ENOMEM;
> + if (IS_ERR(datamsg)) {
> + err = PTR_ERR(datamsg);
> goto out_free;
> }
>
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply
* performance regression on HiperSockets depending on MTU size
From: Frank Blaschka @ 2012-11-26 15:32 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, linux-s390
Hi Eric,
since kernel 3.6 we see a massive performance regression on s390
HiperSockets devices.
HiperSockets differ from normal devices by the fact they support
large MTU sizes (up to 56K). Here are some iperf numbers to show
the problem depended on MTU size:
# ifconfig hsi0 mtu 1500
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 47.6 KByte (default)
------------------------------------------------------------
[ 3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 632 MBytes 530 Mbits/sec
# ifconfig hsi0 mtu 9000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 97.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 2.26 GBytes 1.94 Gbits/sec
# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 322 KByte (default)
------------------------------------------------------------
[ 3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.3 sec 3.12 MBytes 2.53 Mbits/sec
Prior the regression throughput grows with the MTU size but now it drops
to a few Mbits if the MTU is bigger then 15000. It is interesting to see
if 2 or more connections are running in parallel the regression is gone.
# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2 -P2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 322 KByte (default)
------------------------------------------------------------
[ 4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
[ 3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 2.19 GBytes 1.88 Gbits/sec
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 2.17 GBytes 1.87 Gbits/sec
[SUM] 0.0-10.0 sec 4.36 GBytes 3.75 Gbits/sec
I bisected the problem to following patch:
commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed Jul 11 05:50:31 2012 +0000
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
(e.g. 640000) seems to fix the problem.
How does MTU influence/effects TSQ?
Why is the problem gone if there are more connections?
Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
Finally is this expected behavior or is there a bug depending on the big
MTU? What can I do to check ... ?
Thx for your help
Frank
^ permalink raw reply
* [PATCH] vhost: fix length for cross region descriptor
From: Michael S. Tsirkin @ 2012-11-26 15:57 UTC (permalink / raw)
To: netdev, David Miller; +Cc: Jason Wang, linux-kernel
If a single descriptor crosses a region, the
second chunk length should be decremented
by size translated so far, instead it includes
the full descriptor length.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/vhost/vhost.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ef8f598..5a3d0f1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1049,7 +1049,7 @@ static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
}
_iov = iov + ret;
size = reg->memory_size - addr + reg->guest_phys_addr;
- _iov->iov_len = min((u64)len, size);
+ _iov->iov_len = min((u64)len - s, size);
_iov->iov_base = (void __user *)(unsigned long)
(reg->userspace_addr + addr - reg->guest_phys_addr);
s += size;
--
MST
^ permalink raw reply related
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Jesper Dangaard Brouer @ 2012-11-26 15:29 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
In-Reply-To: <1353942950.30446.1772.camel@edumazet-glaptop>
On Mon, 2012-11-26 at 07:15 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> > On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > >
> > > > Yes, for the default large 64k packets size, its just a "fake"
> > > > benchmark. And notice with my fixes, we are even faster than the
> > > > none-frag/single-UDP packet case... but its because we are getting a
> > > > GSO/GRO effect.
> > >
> > > Could you elaborate on this GSO/GRO effect ?
> >
> > On the big system, I saw none-frag UDP (1472 bytes) throughput of:
> > 7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> >
> > While with UDP fragments size 65507 bytes I saw:
> > 9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> >
> > Fragmented UDP is faster by:
> > 37668.26 - 29308.21 = 8360.05 Mbit/s
> >
> > The 65507 bytes UDP size is just a benchmark test, and have no real-life
> > relevance. As performance starts to drop (below none-frag/normal case)
> > when the frag size is decreased, to more realistic sizes...
>
> Yes, but I doubt GRO / GSO are the reason you get better performance.
> GRO doesnt aggregate UDP frames.
Oh, now I think I understand your question.
I don't think GRO is helping me. Its the same "effect" as GRO. As (I
think) that the reasm frag SKB will be a "bigger" SKB, which is passed
to the rest of the stack. Thus, less (but) bigger SKBs get the overhead
of the rest of the stack. It was actually Herbert that mentioned it to
me...
--Jesper
^ permalink raw reply
* Re: performance regression on HiperSockets depending on MTU size
From: Eric Dumazet @ 2012-11-26 16:12 UTC (permalink / raw)
To: Frank Blaschka; +Cc: netdev, linux-s390
In-Reply-To: <20121126153242.GA61652@tuxmaker.boeblingen.de.ibm.com>
On Mon, 2012-11-26 at 16:32 +0100, Frank Blaschka wrote:
> Hi Eric,
>
> since kernel 3.6 we see a massive performance regression on s390
> HiperSockets devices.
>
> HiperSockets differ from normal devices by the fact they support
> large MTU sizes (up to 56K). Here are some iperf numbers to show
> the problem depended on MTU size:
>
> # ifconfig hsi0 mtu 1500
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 47.6 KByte (default)
> ------------------------------------------------------------
> [ 3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 632 MBytes 530 Mbits/sec
>
> # ifconfig hsi0 mtu 9000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 97.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 2.26 GBytes 1.94 Gbits/sec
>
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 322 KByte (default)
> ------------------------------------------------------------
> [ 3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.3 sec 3.12 MBytes 2.53 Mbits/sec
>
> Prior the regression throughput grows with the MTU size but now it drops
> to a few Mbits if the MTU is bigger then 15000. It is interesting to see
> if 2 or more connections are running in parallel the regression is gone.
>
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2 -P2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 322 KByte (default)
> ------------------------------------------------------------
> [ 4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
> [ 3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.0-10.0 sec 2.19 GBytes 1.88 Gbits/sec
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 2.17 GBytes 1.87 Gbits/sec
> [SUM] 0.0-10.0 sec 4.36 GBytes 3.75 Gbits/sec
>
> I bisected the problem to following patch:
>
> commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed Jul 11 05:50:31 2012 +0000
>
> tcp: TCP Small Queues
>
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
> (e.g. 640000) seems to fix the problem.
>
> How does MTU influence/effects TSQ?
> Why is the problem gone if there are more connections?
> Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
> Finally is this expected behavior or is there a bug depending on the big
> MTU? What can I do to check ... ?
>
Hi Frank, thanks for this report.
You could tweak tcp_limit_output_bytes, but IMO the root of the problem
is in the driver itself.
For example, I had to change mlx4 driver for the same problem : Make
sure a TX packet can be "TX completed" in a short amount of time.
In the case of mlx4, the wait time was 128 us, but I suspect on your
case its more like an infinite time or several ms.
The driver is delaying the free of TX skb by a fixed amount of time,
or relies on following transmits to perform the TX completion
Check for an example :
commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Nov 5 16:20:42 2012 +0000
mlx4: change TX coalescing defaults
mlx4 currently uses a too high tx coalescing setting, deferring
TX completion interrupts by up to 128 us.
With the recent skb_orphan() removal in commit 8112ec3b872,
performance of a single TCP flow is capped to ~4 Gbps, unless
we increase tcp_limit_output_bytes.
I suggest using 16 us instead of 128 us, allowing a finer control.
Performance of a single TCP flow is restored to previous levels,
while keeping TCP small queues fully enabled with default sysctl.
This patch is also a BQL prereq.
Reported-by: Vimalkumar <j.vimal@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [REGRESSION] r8169: jumbo fixes caused jumbo regressions!
From: Kirill Smelkov @ 2012-11-26 16:19 UTC (permalink / raw)
To: Hayes Wang, Realtek linux nic maintainers
Cc: Francois Romieu, David S. Miller, Greg Kroah-Hartman, netdev
In-Reply-To: <20121114092530.GA22323@tugrik.mns.mnsspb.ru>
On Wed, Nov 14, 2012 at 01:25:30PM +0400, Kirill Smelkov wrote:
> On Tue, Nov 13, 2012 at 11:35:12PM +0100, Francois Romieu wrote:
> > Kirill Smelkov <kirr@mns.spb.ru> :
> > [...]
> > > My test is to stream raw video from 8 PAL cameras to net - 4 for 720x576@25 and
> > > 4 for 360x288@25 which for YUYV format occupies ~ 860 Mbps of bandwidth. The
> > > program to transmit/receive video is here: http://repo.or.cz/w/rawv.git
[...]
> > > (by the way, on atom system, without tx csum offload, half of cpu time
> > > is spent only to calculate checksums...)
> >
> > :o(
>
> yes, that large. In top, my workload is
>
> %sy %id %si
>
> default driver load ~25 ~45 ~27
> (ethtool -k shows
> tx-checksumming: off)
>
> after ~8 ~81 ~11
> `ethtool -K eth0 tx on`
>
>
> that's why the issue is important.
>
>
> > > Now I wonder, where that 6K limit came from and why they say it is now
> > > not possible to use jumbos together with tx csum offload ?
> >
> > Here is an excerpt from a mail where Hayes explained the rules of
> > engagement back in may 2011 (John Lumby and Chris Friesen were Cced then):
>
> Can't find that mail in gmane netdev archive and on google, to restore
> full context. Was that in private?
>
>
> > ! The Max tx sizes for 8168 series are as following:
> > !
> > ! 8168B is 4K bytes.
> > ! 8168C and 8168CP are 6K bytes.
> > ! 8168D and later are 9K bytes.
> > !
> > ! Note that these sizes all include head size. That is, the mtu must less than
> > ! these values.
> > ! You have to enable Jumbo frame feature when the tx size is large, otherwise the
> > ! packet would not be sent. Because the hw doesn't provide the threshold, the
> > ! checking for MTU > 1500 is just for convenience for sw.
>
> This part is clear.
>
>
> > ! The TSO couldn't work with some feature which need to disable hw checksum, such
> > ! as Jumbo frame. The hw checksum have to be disabled in certain situations, so
> > ! the TSO feature should be checked in these situations, too.
>
> I don't enable TSO nor I need it. The text indirectly says that hw
> checksum should be disabled when jumbo frames are used.
[...]
> ~~~~
>
> Hayes, Realtek linux nic maintainers,
>
> could you please confirm that for all 8168C and 8168CP jumbo_max is
> 6K and that when jumbos are used, tx checksumming should be off?
>
> If so, how come my two chips work stable with ~7K jumbos and tx checksum
> offload on (tested this night again for ~16 hours without any problem).
>
> thanks beforehand.
Dear Hayes, Realtek linux nic maintainers,
Two years ago, for current products, I've specifically choosed
motherboard with RTL8111CP, because Linux driver supported large-enough
Jumbo-frames and tx/rx offload.
Now they say that jumbo-frames should be lowered in length and tx
offload is gone, but my nics still work without problems with old ~7K
jumbos and tx checksum offload. To keep current systems working I either
have to choose another hardware, or patch the driver in contrast to what
people say was the info from the manufacturer.
Neither I like to apply risky patches nor change already proved hardware
to something else without a good reason. So please, as Realtek
representatives,
could you please confirm that for all 8168C and 8168CP jumbo_max is
6K and that when jumbos are used, tx checksumming should be off?
Thanks beforehand,
Kirill
P.S. If so, how come my two chips work stable with ~7K jumbos and tx
checksum offload on (last time tested for ~16 hours without any problem)?
^ permalink raw reply
* Re: [PATCH v2] atm: br2684: Fix excessive queue bloat
From: David Woodhouse @ 2012-11-26 16:20 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR),
Jesper Brouer
In-Reply-To: <Pine.LNX.4.64.1211261504240.17614@ask.diku.dk>
[-- Attachment #1: Type: text/plain, Size: 582 bytes --]
On Mon, 2012-11-26 at 15:16 +0100, Jesper Dangaard Brouer wrote:
> Nice work David Woodhouse. What OpenWRT supported box have this
> hardware? (I want one so I can play with it ;-))
I'm using a Traverse Geos:
http://www.traverse.com.au/geos11-adsl2-x86-router-appliance
http://www.traverse.com.au/geos21-dual-adsl2-x86-router-appliance
Other boxes with real ADSL (like lantiq boxes) should also work.
Note that I don't actually *use* PPPoEoA (of which this is the EoA
part). I just do PPPoA, which is far more sensible and doesn't have MTU
issues.
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply
* RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Fujinaka, Todd @ 2012-11-26 16:23 UTC (permalink / raw)
To: Joe Jin, Dave, Tushar N
Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
linux-kernel@vger.kernel.org, Mary Mcgrath
In-Reply-To: <50AB8471.7080607@oracle.com>
On Tue, 20 Nov 2012, Joe Jin wrote:
> On 11/20/12 16:59, Dave, Tushar N wrote:
>> Have you power off the system completely after modifying eeprom? If not please do so.
>
> Hi Tushar,
>
> Seems not works for me, would you please help to check what is wrong of my operations?
...
> # lspci -s 0000:52:00.1 -vvv
> 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> <--snip-->
> Capabilities: [e0] Express (v1) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
^^^^^^^^^^^^^^^^^^^^
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 4096 bytes
> ^^^^^^^^^^^^^^^^^^^^^
>
<--snip-->
If you look at the previous section, DevCap, you'll see that it's
correctly advertising 256 bytes but the system is negotiating 128 for
the link to the Ethernet controller. Things on the "other" side of the
link are controlled outside of the e1000 driver.
Tushar's first suggestion was to check the PCIe payload settings in the
entire chain. Have you done that? Mismatches will cause hangs.
Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujinaka@intel.com
(503) 712-4565
^ permalink raw reply
* Re: BQL support in gianfar causes network hickup
From: Eric Dumazet @ 2012-11-26 16:34 UTC (permalink / raw)
To: Tino Keitel; +Cc: Paul Gortmaker, netdev, Keitel, Tino (ALC NetworX GmbH)
In-Reply-To: <20121126100111.GA3728@mac.home>
On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
>
> [...]
>
> > Hmm, I wonder if BQL makes a particular bug showing more often.
> >
> > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > drivers use 5 seconds.
> >
> > What happens if you change this to 5 seconds ?
>
> I still got the trace and a failing ptp client.
>
Thanks. Is this bug easy to trigger ?
I suspect a core issue and a race, likely to happen on your (non x86)
hardware
Could you add the following debugging patch ?
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index aefc150..a8859ec 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -117,7 +117,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
int ret = NETDEV_TX_BUSY;
/* And release qdisc */
- spin_unlock(root_lock);
+// spin_unlock(root_lock);
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (!netif_xmit_frozen_or_stopped(txq))
@@ -125,7 +125,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
HARD_TX_UNLOCK(dev, txq);
- spin_lock(root_lock);
+// spin_lock(root_lock);
if (dev_xmit_complete(ret)) {
/* Driver sent out skb successfully or skb was consumed */
^ permalink raw reply related
* Re: BQL support in gianfar causes network hickup
From: Keitel, Tino (ALC NetworX GmbH) @ 2012-11-26 17:08 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Tino Keitel, Paul Gortmaker, netdev@vger.kernel.org
In-Reply-To: <1353947677.7553.2.camel@edumazet-glaptop>
On Mo, 2012-11-26 at 08:34 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> > On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
> >
> > [...]
> >
> > > Hmm, I wonder if BQL makes a particular bug showing more often.
> > >
> > > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > > drivers use 5 seconds.
> > >
> > > What happens if you change this to 5 seconds ?
> >
> > I still got the trace and a failing ptp client.
> >
>
> Thanks. Is this bug easy to trigger ?
>
> I suspect a core issue and a race, likely to happen on your (non x86)
> hardware
>
> Could you add the following debugging patch ?
No visible difference:
NETDEV WATCHDOG: eth1 (fsl-gianfar): transmit queue 0 timed out
------------[ cut here ]------------
WARNING:
at /home/keitelt1/src/git/linux-stable/net/sched/sch_generic.c:255
Modules linked in:
NIP: c02448b0 LR: c02448b0 CTR: c01c19b8
REGS: c7ffbe40 TRAP: 0700 Not tainted (3.7.0-rc6-dirty)
MSR: 00029032 <EE,ME,IR,DR,RI> CR: 22002044 XER: 20000000
TASK = c03dd370[0] 'swapper' THREAD: c03fe000
GPR00: c02448b0 c7ffbef0 c03dd370 0000003f 00000001 c001aea8 00000000
00000001
GPR08: 00000001 c03e0000 00000000 0000009d 22002084 1008eb5c 07ffb000
ffffffff
GPR16: 00000004 c0362c7c c03dfbf8 00200000 c0411ed0 c0411cd0 c0411ad0
ffffffff
GPR24: 00000000 c749e1d8 00000004 c783c1b0 c0400000 c03e0000 c749e000
00000000
NIP [c02448b0] dev_watchdog+0x288/0x298
LR [c02448b0] dev_watchdog+0x288/0x298
Call Trace:
[c7ffbef0] [c02448b0] dev_watchdog+0x288/0x298 (unreliable)
[c7ffbf20] [c00267f8] call_timer_fn+0x6c/0xd8
[c7ffbf50] [c00269e4] run_timer_softirq+0x180/0x1f8
[c7ffbfa0] [c0021144] __do_softirq+0xc4/0x160
[c7ffbff0] [c000d0b8] call_do_softirq+0x14/0x24
[c03ffe00] [c00058e8] do_softirq+0x8c/0xb8
[c03ffe20] [c0021358] irq_exit+0x98/0xb4
[c03ffe30] [c0009fb0] timer_interrupt+0x158/0x170
[c03ffe50] [c000f02c] ret_from_except+0x0/0x14
--- Exception: 901 at _raw_spin_unlock_irq+0x3c/0x78
LR = _raw_spin_unlock_irq+0x2c/0x78
[c03fff20] [c00434c8] finish_task_switch.constprop.69+0x5c/0xdc
[c03fff40] [c02d354c] __schedule+0x1e0/0x410
[c03fff90] [c02d3a78] schedule_preempt_disabled+0x18/0x30
[c03fffa0] [c000898c] cpu_idle+0xfc/0x100
[c03fffc0] [c03b37b0] start_kernel+0x2dc/0x2f0
[c03ffff0] [00003438] 0x3438
Instruction dump:
7d2903a6 4e800421 80fe01fc 4bffff74 7fc3f378 4bfecb7d 7fc4f378 7fe6fb78
7c651b78 3c60c038 38637280 48090e51 <0fe00000> 39200001 993cc7c9
4bffffb8
---[ end trace c170f56a0503cdd2 ]---
Regards,
Tino
^ permalink raw reply
* Re: BQL support in gianfar causes network hickup
From: Eric Dumazet @ 2012-11-26 17:17 UTC (permalink / raw)
To: Keitel, Tino (ALC NetworX GmbH)
Cc: Tino Keitel, Paul Gortmaker, netdev@vger.kernel.org
In-Reply-To: <9AA65D849A88EB44B5D9B6A8BA098E23040A60D6EE6F@Exchange1.lawo.de>
On Mon, 2012-11-26 at 18:08 +0100, Keitel, Tino (ALC NetworX GmbH)
wrote:
> On Mo, 2012-11-26 at 08:34 -0800, Eric Dumazet wrote:
> > On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> > > On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
> > >
> > > [...]
> > >
> > > > Hmm, I wonder if BQL makes a particular bug showing more often.
> > > >
> > > > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > > > drivers use 5 seconds.
> > > >
> > > > What happens if you change this to 5 seconds ?
> > >
> > > I still got the trace and a failing ptp client.
> > >
> >
> > Thanks. Is this bug easy to trigger ?
> >
> > I suspect a core issue and a race, likely to happen on your (non x86)
> > hardware
> >
> > Could you add the following debugging patch ?
>
> No visible difference:
OK it seems you trigger the problem fast !
Please try the following as well :
diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 19ac096..77190bf 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -101,7 +101,7 @@
#include "gianfar.h"
-#define TX_TIMEOUT (1*HZ)
+#define TX_TIMEOUT (5*HZ)
const char gfar_driver_version[] = "1.3";
@@ -2465,9 +2465,9 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
struct sk_buff *skb;
int skb_dirtytx;
int tx_ring_size = tx_queue->tx_ring_size;
- int frags = 0, nr_txbds = 0;
+ int frags, nr_txbds;
int i;
- int howmany = 0;
+ int howmany = 0, total_txbds = 0;
int tqi = tx_queue->qindex;
unsigned int bytes_sent = 0;
u32 lstatus;
@@ -2479,7 +2479,6 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
skb_dirtytx = tx_queue->skb_dirtytx;
while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) {
- unsigned long flags;
frags = skb_shinfo(skb)->nr_frags;
@@ -2541,20 +2540,24 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
TX_RING_MOD_MASK(tx_ring_size);
howmany++;
- spin_lock_irqsave(&tx_queue->txlock, flags);
- tx_queue->num_txbdfree += nr_txbds;
- spin_unlock_irqrestore(&tx_queue->txlock, flags);
+ total_txbds += nr_txbds;
}
- /* If we freed a buffer, we can restart transmission, if necessary */
- if (netif_tx_queue_stopped(txq) && tx_queue->num_txbdfree)
- netif_wake_subqueue(dev, tqi);
+ if (howmany) {
+ unsigned long flags;
+ spin_lock_irqsave(&tx_queue->txlock, flags);
+ tx_queue->num_txbdfree += total_txbds;
+ /* If we freed a buffer, we can restart transmission, if necessary */
+ if (netif_tx_queue_stopped(txq))
+ netif_tx_wake_queue(txq);
+ netdev_tx_completed_queue(txq, howmany, bytes_sent);
+ spin_unlock_irqrestore(&tx_queue->txlock, flags);
+ }
/* Update dirty indicators */
tx_queue->skb_dirtytx = skb_dirtytx;
tx_queue->dirty_tx = bdp;
- netdev_tx_completed_queue(txq, howmany, bytes_sent);
return howmany;
}
^ permalink raw reply related
* Re: [Pv-drivers] [PATCH 086/493] net: remove use of __devexit_p
From: Dmitry Torokhov @ 2012-11-26 17:35 UTC (permalink / raw)
To: Bill Pemberton
Cc: xen-devel, Michael S. Tsirkin, VMware, Inc., gregkh,
virtualization, Francois Romieu, netdev
In-Reply-To: <1353349642-3677-86-git-send-email-wfp5p@virginia.edu>
On Mon, Nov 19, 2012 at 01:20:35PM -0500, Bill Pemberton wrote:
> CONFIG_HOTPLUG is going away as an option so __devexit_p is no longer
> needed.
>
> Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Shreyas Bhatewara <sbhatewara@vmware.com>
> Cc: "VMware, Inc." <pv-drivers@vmware.com>
> Cc: Francois Romieu <romieu@fr.zoreil.com>
> Cc: virtualization@lists.linux-foundation.org
> Cc: netdev@vger.kernel.org
> Cc: xen-devel@lists.xensource.com
> ---
> drivers/net/virtio_net.c | 2 +-
> drivers/net/vmxnet3/vmxnet3_drv.c | 2 +-
For VMXNET3:
Acked-by: Dmitry Torokhov <dtor@vmware.com>
Thanks,
Dmitry
^ permalink raw reply
* Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent
From: Stephen Hemminger @ 2012-11-26 17:37 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, davem
In-Reply-To: <1353916612-6357-1-git-send-email-jasowang@redhat.com>
On Mon, 26 Nov 2012 15:56:52 +0800
Jason Wang <jasowang@redhat.com> wrote:
> Some deivces do not free the old tx skbs immediately after it has been sent
> (usually in tx interrupt). One such example is virtio-net which optimizes for
> virt and only free the possible old tx skbs during the next packet sending. This
> would lead the pktgen to wait forever in the refcount of the skb if no other
> pakcet will be sent afterwards.
>
> Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could
> notify the pktgen that the device does not free skb immediately after it has
> been sent and let it not to wait for the refcount to be one.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
Another alternative would be using skb_orphan() and skb->destructor.
There are other cases where skb's are not freed right away.
^ permalink raw reply
* Re: [Pv-drivers] [PATCH 203/493] net: remove use of __devinit
From: Dmitry Torokhov @ 2012-11-26 17:38 UTC (permalink / raw)
To: Bill Pemberton
Cc: xen-devel, VMware, Inc., gregkh, Francois Romieu, virtualization,
linux-hippi, Maciej W. Rozycki, netdev, Krzysztof Halasa
In-Reply-To: <1353349642-3677-203-git-send-email-wfp5p@virginia.edu>
On Mon, Nov 19, 2012 at 01:22:32PM -0500, Bill Pemberton wrote:
> CONFIG_HOTPLUG is going away as an option so __devinit is no longer
> needed.
...
> drivers/net/vmxnet3/vmxnet3_drv.c | 2 +-
For VMXNET3:
Acked-by: Dmitry Torokhov <dtor@vmware.com>
Thanks,
Dmitry
^ permalink raw reply
* Re: [Pv-drivers] [PATCH 471/493] net: remove use of __devexit
From: Dmitry Torokhov @ 2012-11-26 17:39 UTC (permalink / raw)
To: Bill Pemberton
Cc: Jiri Slaby, Michael S. Tsirkin, VMware, Inc., virtualization,
Maciej W. Rozycki, Christian Lamparter, Hin-Tak Leung,
Luciano Coelho, Stefano Brivio, libertas-dev, Dan Williams,
ath5k-devel, Luis R. Rodriguez, linux-hippi, Nick Kossifidis,
Francois Romieu, Herton Ronaldo Krzesinski, Stanislav Yakovlev,
b43-dev, Samuel Ortiz, John W. Linville, linux-can, xen-devel,
Marc
In-Reply-To: <1353349642-3677-471-git-send-email-wfp5p@virginia.edu>
On Mon, Nov 19, 2012 at 01:27:00PM -0500, Bill Pemberton wrote:
> CONFIG_HOTPLUG is going away as an option so __devexit is no
> longer needed.
...
> drivers/net/vmxnet3/vmxnet3_drv.c | 2 +-
For VMXNET3:
Acked-by: Dmitry Torokhov <dtor@vmware.com>
Thanks,
Dmitry
^ permalink raw reply
* Re: [PATCH] bonding: in balance-rr mode, set curr_active_slave only if it is up
From: Jay Vosburgh @ 2012-11-26 18:36 UTC (permalink / raw)
To: Michal Kubecek; +Cc: netdev, Andy Gospodarek, linux-kernel
In-Reply-To: <20121122125202.14252C35B8@unicorn.suse.cz>
Michal Kubecek <mkubecek@suse.cz> wrote:
>If all slaves of a balance-rr bond with ARP monitor are enslaved
>with down link state, bond keeps down state even after slaves
>go up.
>
>This is caused by bond_enslave() setting curr_active_slave to
>first slave not taking into account its link state. As
>bond_loadbalance_arp_mon() uses curr_active_slave to identify
>whether slave's down->up transition should update bond's link
>state, bond stays down even if slaves are up (until first slave
>goes from up to down at least once).
The bond_loadbalance_arp_mon function actually uses
curr_active_slave to determine whether or not to do a "failover" (select
a new active slave), which in turn will call bond_set_carrier() from
within bond_select_active_slave().
Other than that nitpick about the description, I see how setting
curr_active_slave to a down slave would cause loadbalance_arp_mon to
skip the "failover" step (because it presumes that an active slave is
always up, and therefore no new one needs to be selected), and thus skip
setting the master's carrier state.
-J
>Before commit f31c7937 "bonding: start slaves with link down for
>ARP monitor", this was masked by slaves always starting in UP
>state with ARP monitor (and MII monitor not relying on
>curr_active_slave being NULL if there is no slave up).
>
>Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
> drivers/net/bonding/bond_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 5f5b69f..c8bff3e 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1838,7 +1838,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> * anyway (it holds no special properties of the bond device),
> * so we can change it without calling change_active_interface()
> */
>- if (!bond->curr_active_slave)
>+ if (!bond->curr_active_slave && new_slave->link == BOND_LINK_UP)
> bond->curr_active_slave = new_slave;
>
> break;
>--
>1.7.10.4
>
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox