* Re: kernel BUG at kernel/timer.c:748!
From: Eric Dumazet @ 2012-09-19 22:01 UTC (permalink / raw)
To: Dave Jones; +Cc: Yuchung Cheng, Julian Anastasov, netdev
In-Reply-To: <20120919211059.GA10985@redhat.com>
On Wed, 2012-09-19 at 17:10 -0400, Dave Jones wrote:
> On Sat, Sep 15, 2012 at 11:16:52AM -0700, Yuchung Cheng wrote:
> > On Fri, Sep 14, 2012 at 2:29 PM, Dave Jones <davej@redhat.com> wrote:
> > > On Wed, Sep 05, 2012 at 11:48:29PM +0300, Julian Anastasov wrote:
> > >
> > > > > kernel BUG at kernel/timer.c:748!
> > > > > Call Trace:
> > > > > ? lock_sock_nested+0x8d/0xa0
> > > > > sk_reset_timer+0x1c/0x30
> > > > > ? sock_setsockopt+0x8c/0x960
> > > > > inet_csk_reset_keepalive_timer+0x20/0x30
> > > > > tcp_set_keepalive+0x3d/0x50
> > > > > sock_setsockopt+0x923/0x960
> > > > > ? trace_hardirqs_on_caller+0x16/0x1e0
> > > > > ? fget_light+0x24c/0x520
> > > > > sys_setsockopt+0xc6/0xe0
> > > > > system_call_fastpath+0x1a/0x1f
> > > >
> > > > Can this help? In case you see ICMPV6_PKT_TOOBIG...
> > > >
> > > > [PATCH] tcp: fix possible socket refcount problem for ipv6
> > >
> > > I just managed to reproduce this bug on rc5 with this patch,
> > > so it doesn't seem to help.
> > Could you post some tcpdump traces?
>
> It's likely that there aren't any packets. The fuzzer isn't smart
> enough (yet) to do anything too clever to the sockets it creates.
>
> More likely is that this is some race where thread A is doing a setsockopt
> while thread B is doing a tear-down of the same socket.
I spent some time trying to track this bug, but found nothing so far.
The timer->function are never cleared by TCP stack at tear down, and
should be set before fd is installed and can be caught by other threads.
Most likely its a refcounting issue...
Following debugging patch might trigger a bug sooner ?
diff --git a/include/net/sock.h b/include/net/sock.h
index 84bdaec..5d3ad5b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -511,18 +511,18 @@ static inline void sock_hold(struct sock *sk)
*/
static inline void __sock_put(struct sock *sk)
{
- atomic_dec(&sk->sk_refcnt);
+ int newcnt = atomic_dec_return(&sk->sk_refcnt);
+
+ WARN_ON(newcnt <= 0);
}
static inline bool sk_del_node_init(struct sock *sk)
{
bool rc = __sk_del_node_init(sk);
- if (rc) {
- /* paranoid for a while -acme */
- WARN_ON(atomic_read(&sk->sk_refcnt) == 1);
+ if (rc)
__sock_put(sk);
- }
+
return rc;
}
#define sk_del_node_init_rcu(sk) sk_del_node_init(sk)
@@ -1620,7 +1620,10 @@ static inline void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
/* Ungrab socket and destroy it, if it was the last reference. */
static inline void sock_put(struct sock *sk)
{
- if (atomic_dec_and_test(&sk->sk_refcnt))
+ int newcnt = atomic_dec_return(&sk->sk_refcnt);
+
+ WARN_ON(newcnt < 0);
+ if (newcnt == 0)
sk_free(sk);
}
^ permalink raw reply related
* [PATCH 2/6] xfrm_user: fix info leak in copy_to_user_state()
From: Mathias Krause @ 2012-09-19 21:33 UTC (permalink / raw)
To: David S. Miller; +Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause
In-Reply-To: <1348090423-32665-1-git-send-email-minipli@googlemail.com>
The memory reserved to dump the xfrm state includes the padding bytes of
struct xfrm_usersa_info added by the compiler for alignment (7 for
amd64, 3 for i386). Add an explicit memset(0) before filling the buffer
to avoid the info leak.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
net/xfrm/xfrm_user.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index ba26423..c1f15b5 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -689,6 +689,7 @@ out:
static void copy_to_user_state(struct xfrm_state *x, struct xfrm_usersa_info *p)
{
+ memset(p, 0, sizeof(*p));
memcpy(&p->id, &x->id, sizeof(p->id));
memcpy(&p->sel, &x->sel, sizeof(p->sel));
memcpy(&p->lft, &x->lft, sizeof(p->lft));
--
1.7.10.4
^ permalink raw reply related
* [PATCH 6/6] xfrm_user: don't copy esn replay window twice for new states
From: Mathias Krause @ 2012-09-19 21:33 UTC (permalink / raw)
To: David S. Miller; +Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause
In-Reply-To: <1348090423-32665-1-git-send-email-minipli@googlemail.com>
The ESN replay window was already fully initialized in
xfrm_alloc_replay_state_esn(). No need to copy it again.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
net/xfrm/xfrm_user.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 7fd92b8..fa072de 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -457,10 +457,11 @@ static void copy_from_user_state(struct xfrm_state *x, struct xfrm_usersa_info *
* somehow made shareable and move it to xfrm_state.c - JHS
*
*/
-static void xfrm_update_ae_params(struct xfrm_state *x, struct nlattr **attrs)
+static void xfrm_update_ae_params(struct xfrm_state *x, struct nlattr **attrs,
+ int update_esn)
{
struct nlattr *rp = attrs[XFRMA_REPLAY_VAL];
- struct nlattr *re = attrs[XFRMA_REPLAY_ESN_VAL];
+ struct nlattr *re = update_esn ? attrs[XFRMA_REPLAY_ESN_VAL] : NULL;
struct nlattr *lt = attrs[XFRMA_LTIME_VAL];
struct nlattr *et = attrs[XFRMA_ETIMER_THRESH];
struct nlattr *rt = attrs[XFRMA_REPLAY_THRESH];
@@ -570,7 +571,7 @@ static struct xfrm_state *xfrm_state_construct(struct net *net,
goto error;
/* override default values from above */
- xfrm_update_ae_params(x, attrs);
+ xfrm_update_ae_params(x, attrs, 0);
return x;
@@ -1840,7 +1841,7 @@ static int xfrm_new_ae(struct sk_buff *skb, struct nlmsghdr *nlh,
goto out;
spin_lock_bh(&x->lock);
- xfrm_update_ae_params(x, attrs);
+ xfrm_update_ae_params(x, attrs, 1);
spin_unlock_bh(&x->lock);
c.event = nlh->nlmsg_type;
--
1.7.10.4
^ permalink raw reply related
* [PATCH 5/6] xfrm_user: ensure user supplied esn replay window is valid
From: Mathias Krause @ 2012-09-19 21:33 UTC (permalink / raw)
To: David S. Miller
Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause,
Martin Willi
In-Reply-To: <1348090423-32665-1-git-send-email-minipli@googlemail.com>
The current code fails to ensure that the netlink message actually
contains as many bytes as the header indicates. If a user creates a new
state or updates an existing one but does not supply the bytes for the
whole ESN replay window, the kernel copies random heap bytes into the
replay bitmap, the ones happen to follow the XFRMA_REPLAY_ESN_VAL
netlink attribute. This leads to following issues:
1. The replay window has random bits set confusing the replay handling
code later on.
2. A malicious user could use this flaw to leak up to ~3.5kB of heap
memory when she has access to the XFRM netlink interface (requires
CAP_NET_ADMIN).
Known users of the ESN replay window are strongSwan and Steffen's
iproute2 patch (<http://patchwork.ozlabs.org/patch/85962/>). The latter
uses the interface with a bitmap supplied while the former does not.
strongSwan is therefore prone to run into issue 1.
To fix both issues without breaking existing userland allow using the
XFRMA_REPLAY_ESN_VAL netlink attribute with either an empty bitmap or a
fully specified one. For the former case we initialize the in-kernel
bitmap with zero, for the latter we copy the user supplied bitmap.
For state updates the full bitmap must be supplied.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Martin Willi <martin@revosec.ch>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
net/xfrm/xfrm_user.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 9f1e749..7fd92b8 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -123,9 +123,17 @@ static inline int verify_replay(struct xfrm_usersa_info *p,
struct nlattr **attrs)
{
struct nlattr *rt = attrs[XFRMA_REPLAY_ESN_VAL];
+ struct xfrm_replay_state_esn *rs;
- if ((p->flags & XFRM_STATE_ESN) && !rt)
- return -EINVAL;
+ if (p->flags & XFRM_STATE_ESN) {
+ if (!rt)
+ return -EINVAL;
+
+ rs = nla_data(rt);
+ if (nla_len(rt) < xfrm_replay_state_esn_len(rs) &&
+ nla_len(rt) != sizeof(*rs))
+ return -EINVAL;
+ }
if (!rt)
return 0;
@@ -370,14 +378,15 @@ static inline int xfrm_replay_verify_len(struct xfrm_replay_state_esn *replay_es
struct nlattr *rp)
{
struct xfrm_replay_state_esn *up;
+ size_t ulen;
if (!replay_esn || !rp)
return 0;
up = nla_data(rp);
+ ulen = xfrm_replay_state_esn_len(up);
- if (xfrm_replay_state_esn_len(replay_esn) !=
- xfrm_replay_state_esn_len(up))
+ if (nla_len(rp) < ulen || xfrm_replay_state_esn_len(replay_esn) != ulen)
return -EINVAL;
return 0;
@@ -388,22 +397,28 @@ static int xfrm_alloc_replay_state_esn(struct xfrm_replay_state_esn **replay_esn
struct nlattr *rta)
{
struct xfrm_replay_state_esn *p, *pp, *up;
+ size_t klen, ulen;
if (!rta)
return 0;
up = nla_data(rta);
+ klen = xfrm_replay_state_esn_len(up);
+ ulen = nla_len(rta) > sizeof(*up) ? klen : sizeof(*up);
- p = kmemdup(up, xfrm_replay_state_esn_len(up), GFP_KERNEL);
+ p = kzalloc(klen, GFP_KERNEL);
if (!p)
return -ENOMEM;
- pp = kmemdup(up, xfrm_replay_state_esn_len(up), GFP_KERNEL);
+ pp = kzalloc(klen, GFP_KERNEL);
if (!pp) {
kfree(p);
return -ENOMEM;
}
+ memcpy(p, up, ulen);
+ memcpy(pp, up, ulen);
+
*replay_esn = p;
*preplay_esn = pp;
--
1.7.10.4
^ permalink raw reply related
* [PATCH 4/6] xfrm_user: fix info leak in copy_to_user_tmpl()
From: Mathias Krause @ 2012-09-19 21:33 UTC (permalink / raw)
To: David S. Miller
Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause,
Brad Spengler
In-Reply-To: <1348090423-32665-1-git-send-email-minipli@googlemail.com>
The memory used for the template copy is a local stack variable. As
struct xfrm_user_tmpl contains multiple holes added by the compiler for
alignment, not initializing the memory will lead to leaking stack bytes
to userland. Add an explicit memset(0) to avoid the info leak.
Initial version of the patch by Brad Spengler.
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
net/xfrm/xfrm_user.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 7511427..9f1e749 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1423,6 +1423,7 @@ static int copy_to_user_tmpl(struct xfrm_policy *xp, struct sk_buff *skb)
struct xfrm_user_tmpl *up = &vec[i];
struct xfrm_tmpl *kp = &xp->xfrm_vec[i];
+ memset(up, 0, sizeof(*up));
memcpy(&up->id, &kp->id, sizeof(up->id));
up->family = kp->encap_family;
memcpy(&up->saddr, &kp->saddr, sizeof(up->saddr));
--
1.7.10.4
^ permalink raw reply related
* [PATCH 3/6] xfrm_user: fix info leak in copy_to_user_policy()
From: Mathias Krause @ 2012-09-19 21:33 UTC (permalink / raw)
To: David S. Miller; +Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause
In-Reply-To: <1348090423-32665-1-git-send-email-minipli@googlemail.com>
The memory reserved to dump the xfrm policy includes multiple padding
bytes added by the compiler for alignment (padding bytes in struct
xfrm_selector and struct xfrm_userpolicy_info). Add an explicit
memset(0) before filling the buffer to avoid the heap info leak.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
net/xfrm/xfrm_user.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index c1f15b5..7511427 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1318,6 +1318,7 @@ static void copy_from_user_policy(struct xfrm_policy *xp, struct xfrm_userpolicy
static void copy_to_user_policy(struct xfrm_policy *xp, struct xfrm_userpolicy_info *p, int dir)
{
+ memset(p, 0, sizeof(*p));
memcpy(&p->sel, &xp->selector, sizeof(p->sel));
memcpy(&p->lft, &xp->lft, sizeof(p->lft));
memcpy(&p->curlft, &xp->curlft, sizeof(p->curlft));
--
1.7.10.4
^ permalink raw reply related
* [PATCH 1/6] xfrm_user: fix info leak in copy_to_user_auth()
From: Mathias Krause @ 2012-09-19 21:33 UTC (permalink / raw)
To: David S. Miller; +Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause
In-Reply-To: <1348090423-32665-1-git-send-email-minipli@googlemail.com>
copy_to_user_auth() fails to initialize the remainder of alg_name and
therefore discloses up to 54 bytes of heap memory via netlink to
userland.
Use strncpy() instead of strcpy() to fill the trailing bytes of alg_name
with null bytes.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
net/xfrm/xfrm_user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index e75d8e4..ba26423 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -742,7 +742,7 @@ static int copy_to_user_auth(struct xfrm_algo_auth *auth, struct sk_buff *skb)
return -EMSGSIZE;
algo = nla_data(nla);
- strcpy(algo->alg_name, auth->alg_name);
+ strncpy(algo->alg_name, auth->alg_name, sizeof(algo->alg_name));
memcpy(algo->alg_key, auth->alg_key, (auth->alg_key_len + 7) / 8);
algo->alg_key_len = auth->alg_key_len;
--
1.7.10.4
^ permalink raw reply related
* [PATCH 0/6] xfrm_user info leaks
From: Mathias Krause @ 2012-09-19 21:33 UTC (permalink / raw)
To: David S. Miller; +Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause
Hi David,
the following series fixes various info leaks in the xfrm netlink
interface. As always, a test case can be supplied on request.
Patches 1 to 5 are probably material for stable, too. Patch 6 is just a
minor optimization I stumbled across while auditing the code.
Please apply!
P.S.: What's still missing IMHO, is some upper limit for the ESN replay
window size to prevent malicious users from making the kernel allocate
huge states. One could not retrieve those via netlink anyway (as the
current code allocates just a single page for the state dump). Maybe 4k
is a sane default upper limit for the replay window size. But I'll leave
implementing this to someone else. ;)
Mathias Krause (6):
xfrm_user: fix info leak in copy_to_user_auth()
xfrm_user: fix info leak in copy_to_user_state()
xfrm_user: fix info leak in copy_to_user_policy()
xfrm_user: fix info leak in copy_to_user_tmpl()
xfrm_user: ensure user supplied esn replay window is valid
xfrm_user: don't copy esn replay window twice for new states
net/xfrm/xfrm_user.c | 41 ++++++++++++++++++++++++++++++-----------
1 file changed, 30 insertions(+), 11 deletions(-)
--
1.7.10.4
^ permalink raw reply
* Re: [PATCH] net/core: fix comment in skb_try_coalesce
From: David Miller @ 2012-09-19 21:29 UTC (permalink / raw)
To: roy.qing.li; +Cc: netdev
In-Reply-To: <1348023201-7727-1-git-send-email-roy.qing.li@gmail.com>
From: roy.qing.li@gmail.com
Date: Wed, 19 Sep 2012 10:53:21 +0800
> From: Li RongQing <roy.qing.li@gmail.com>
>
> It should be the skb which is not cloned
>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v4 net-next 0/4] ipv6: fix the reassembly expire code in nf_conntrack
From: David Miller @ 2012-09-19 21:28 UTC (permalink / raw)
To: amwang; +Cc: netdev, netfilter-devel, herbert, pablo
In-Reply-To: <1348023011-16195-1-git-send-email-amwang@redhat.com>
From: Cong Wang <amwang@redhat.com>
Date: Wed, 19 Sep 2012 10:50:07 +0800
> V4: some coding style fix
>
> V3: rename struct netns_nf_ct to struct netns_nf_frag
>
> V2: use IS_ENABLED(CONFIG_IPV6) to fix a build error
> rebase to latest net-next
>
> ipv6: add a new namespace for nf_conntrack_reasm
> ipv6: unify conntrack reassembly expire code with standard one
> ipv6: make ip6_frag_nqueues() and ip6_frag_mem() static
> ipv6: unify fragment thresh handling code
>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Signed-off-by: Cong Wang <amwang@redhat.com>
All applied to net-next, thanks.
^ permalink raw reply
* Re: [PATCH] asix: Support DLink DUB-E100 H/W Ver C1
From: David Miller @ 2012-09-19 21:21 UTC (permalink / raw)
To: sgh; +Cc: netdev, stable
In-Reply-To: <1347954657-8975-1-git-send-email-sgh@sgh.dk>
From: Søren Holm <sgh@sgh.dk>
Date: Tue, 18 Sep 2012 09:50:57 +0200
> Signed-off-by: Søren Holm <sgh@sgh.dk>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next] mlx4: use dev_kfree_skb() instead of dev_kfree_skb_any()
From: Ying Cai @ 2012-09-19 21:21 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Yevgeny Petrilin, David Miller, netdev, Or Gerlitz
In-Reply-To: <1348056777.26523.750.camel@edumazet-glaptop>
Hi Yevgeny,
It seems all the TxQs are sharing the same interrupt for Tx
completions. Will it be better to have separate interrupt per
num_tx_rings_p_up (8) queues? E.g. for a 16 core system, with 16 * 8
Tx queues, to have 16 interrupts for Tx completions of those 128 Tx
queues?
Also I'm looking at mlx4_en_select_queue(), it is using
__skb_tx_hash(). Use something to achieve XPS may bring better
performances.
Thanks.
On Wed, Sep 19, 2012 at 5:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Wed, 2012-09-19 at 07:58 +0000, Yevgeny Petrilin wrote:
> > >
> > > Since commit e22979d96a5 (mlx4_en: Moving to Interrupts for TX
> > > completions), we no longer can free TX skb from hard IRQ, but only
> > > from
> > > normal softirq or process context.
> > >
> > > Therefore, we can directly call dev_kfree_skb() from
> > > mlx4_en_free_tx_desc() like other conventional NAPI drivers.
> > >
> >
> > Hi Eric,
> > At the moment the TX completion processing is done from IRQ context.
> > So I think we need to change the driver to work with NAPI for TX
> > completions
> > before making this change.
> >
> > I'll send the patch in a few days.
>
> Oops you're right, it seems I misread e22979d96 commit.
>
> irq term is a bit generic, you might add soft/hard qualifiers to
> distinguish the variant.
>
> Thanks
>
>
^ permalink raw reply
* Re: [Patch net-next] netpoll: call ->ndo_select_queue() in tx path
From: David Miller @ 2012-09-19 21:21 UTC (permalink / raw)
To: amwang; +Cc: netdev, s.munaut, edumazet
In-Reply-To: <1347948991-20860-1-git-send-email-amwang@redhat.com>
From: Cong Wang <amwang@redhat.com>
Date: Tue, 18 Sep 2012 14:16:31 +0800
> In netpoll tx path, we miss the chance of calling ->ndo_select_queue(),
> thus could cause problems when bonding is involved.
>
> This patch makes dev_pick_tx() extern (and rename it to netdev_pick_tx())
> to let netpoll call it in netpoll_send_skb_on_dev().
>
> Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <amwang@redhat.com>
> Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: Bruce Curtis @ 2012-09-19 21:19 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, edumazet, netdev
In-Reply-To: <1348088634.31352.26.camel@edumazet-glaptop>
On Wed, Sep 19, 2012 at 2:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Wed, 2012-09-19 at 16:34 -0400, David Miller wrote:
>
> > I have an idea on how to handle this.
> >
> > In drivers/net/loopback.c:loopback_tx(), skip the SKB orphan operation
> > if there is a friend socket at skb->friend.
> >
> > When sending such friend SKBs out at connection startup, arrange it
> > such that the skb->destructor will zap the skb->friend pointer to
> > NULL.
> >
> > Also, in skb_orphan*(), if necessary, set skb->friend to NULL.
> >
> > skb->sk will hold a reference to the socket, and since skb->friend
> > will be equal, this will make sure a pointer to an unreferenced
> > socket does not escape.
>
> I now am wondering if we still need skb->friend field.
>
> If skb->sk is not zeroed by a premature skb_orphan(), then
>
> skb->sk->sk_friend gives the friend ?
>
>
Good question, in the friends data path it's only used to
differentiate a friend directly Q'd skb so should work, I'll look into
it.
^ permalink raw reply
* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: David Miller @ 2012-09-19 21:17 UTC (permalink / raw)
To: eric.dumazet; +Cc: brutus, edumazet, netdev
In-Reply-To: <1348088634.31352.26.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 19 Sep 2012 23:03:54 +0200
> On Wed, 2012-09-19 at 16:34 -0400, David Miller wrote:
>
>> I have an idea on how to handle this.
>>
>> In drivers/net/loopback.c:loopback_tx(), skip the SKB orphan operation
>> if there is a friend socket at skb->friend.
>>
>> When sending such friend SKBs out at connection startup, arrange it
>> such that the skb->destructor will zap the skb->friend pointer to
>> NULL.
>>
>> Also, in skb_orphan*(), if necessary, set skb->friend to NULL.
>>
>> skb->sk will hold a reference to the socket, and since skb->friend
>> will be equal, this will make sure a pointer to an unreferenced
>> socket does not escape.
>
> I now am wondering if we still need skb->friend field.
>
> If skb->sk is not zeroed by a premature skb_orphan(), then
>
> skb->sk->sk_friend gives the friend ?
Good point.
^ permalink raw reply
* Re: kernel BUG at kernel/timer.c:748!
From: Dave Jones @ 2012-09-19 21:10 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: Julian Anastasov, netdev
In-Reply-To: <CAK6E8=cAOTqOzrzWCG=_wcQkWKS4bKh1HJ0UAH2vJNqvDPh+GQ@mail.gmail.com>
On Sat, Sep 15, 2012 at 11:16:52AM -0700, Yuchung Cheng wrote:
> On Fri, Sep 14, 2012 at 2:29 PM, Dave Jones <davej@redhat.com> wrote:
> > On Wed, Sep 05, 2012 at 11:48:29PM +0300, Julian Anastasov wrote:
> >
> > > > kernel BUG at kernel/timer.c:748!
> > > > Call Trace:
> > > > ? lock_sock_nested+0x8d/0xa0
> > > > sk_reset_timer+0x1c/0x30
> > > > ? sock_setsockopt+0x8c/0x960
> > > > inet_csk_reset_keepalive_timer+0x20/0x30
> > > > tcp_set_keepalive+0x3d/0x50
> > > > sock_setsockopt+0x923/0x960
> > > > ? trace_hardirqs_on_caller+0x16/0x1e0
> > > > ? fget_light+0x24c/0x520
> > > > sys_setsockopt+0xc6/0xe0
> > > > system_call_fastpath+0x1a/0x1f
> > >
> > > Can this help? In case you see ICMPV6_PKT_TOOBIG...
> > >
> > > [PATCH] tcp: fix possible socket refcount problem for ipv6
> >
> > I just managed to reproduce this bug on rc5 with this patch,
> > so it doesn't seem to help.
> Could you post some tcpdump traces?
It's likely that there aren't any packets. The fuzzer isn't smart
enough (yet) to do anything too clever to the sockets it creates.
More likely is that this is some race where thread A is doing a setsockopt
while thread B is doing a tear-down of the same socket.
Dave
^ permalink raw reply
* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: Eric Dumazet @ 2012-09-19 21:03 UTC (permalink / raw)
To: David Miller; +Cc: brutus, edumazet, netdev
In-Reply-To: <20120919.163406.487082174277409074.davem@davemloft.net>
On Wed, 2012-09-19 at 16:34 -0400, David Miller wrote:
> I have an idea on how to handle this.
>
> In drivers/net/loopback.c:loopback_tx(), skip the SKB orphan operation
> if there is a friend socket at skb->friend.
>
> When sending such friend SKBs out at connection startup, arrange it
> such that the skb->destructor will zap the skb->friend pointer to
> NULL.
>
> Also, in skb_orphan*(), if necessary, set skb->friend to NULL.
>
> skb->sk will hold a reference to the socket, and since skb->friend
> will be equal, this will make sure a pointer to an unreferenced
> socket does not escape.
I now am wondering if we still need skb->friend field.
If skb->sk is not zeroed by a premature skb_orphan(), then
skb->sk->sk_friend gives the friend ?
^ permalink raw reply
* Re: Netfilter lacks ability to filter packets via Application-origin
From: John Fastabend @ 2012-09-19 20:50 UTC (permalink / raw)
To: Chad Gray; +Cc: Ben Hutchings, netdev@vger.kernel.org
In-Reply-To: <1348086252.2636.58.camel@bwh-desktop.uk.solarflarecom.com>
On 9/19/2012 1:24 PM, Ben Hutchings wrote:
> On Wed, 2012-09-19 at 15:40 -0400, Chad Gray wrote:
>> Users need the ability for Linux firewall to filter packets based on what
>> Application they are originating from. This ability is present in Mac and
>> Windows firewalls, but not Linux.
>>
>> For example, users would like ability to open Port 80 for Firefox, but keep
>> Port 80 closed for other applications.
>>
>> This ability enhances Privacy & Security of the user but also helps to better
>> inform the user about the comings and goings of internet traffic and what
>> application/s are causing the traffic.
>
> Most of the Linux Security Modules seem to support this sort of network
> policy.
>
> Ben.
>
Another approach might be to use the net_cls cgroups and set the
classid matching against it with tc or netfilters.
.John
^ permalink raw reply
* Re: Opportunity to backport pch_gbe patches to stable 3.2.x
From: Ben Hutchings @ 2012-09-19 20:49 UTC (permalink / raw)
To: Erwan Velu; +Cc: netdev
In-Reply-To: <505A11E2.8000403@gmail.com>
On Wed, Sep 19, 2012 at 08:41:38PM +0200, Erwan Velu wrote:
> Hi there,
>
> I'm using on a daily use and in production 3.2.x kernels (.29 currently) on a Kontron Nano-etx board that features the pch_gbe network driver.
>
> I've been experiencing lots of troubles on this kernel series but since I've backported the following commits back to my 3.2.x, it works far better.
>
> So from the history of next the following commits apply perfectly and runs very well.
>
> I do think that other users of pch_gbe will appreciate this.
>
> 1a0bdadb4e36abac63b0a9787f372aac30c11a9e
[...]
This first one is adding support for an extra hardware feature, which
is not suitable material for stable. Please pick out only the
important bug fixes, per Documentation/stable_kernel_rules.txt.
If you want the driver to be completely backported, look at the
compat-drivers project
<https://backports.wiki.kernel.org/index.php/Documentation/compat-drivers>
or distribution kernels instead of stable.
Ben.
--
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
- Albert Camus
^ permalink raw reply
* Re: [Patch net-next] l2tp: fix compile error when CONFIG_IPV6=m and CONFIG_L2TP=y
From: David Miller @ 2012-09-19 20:44 UTC (permalink / raw)
To: amwang; +Cc: netdev
In-Reply-To: <1347947642-20330-1-git-send-email-amwang@redhat.com>
From: Cong Wang <amwang@redhat.com>
Date: Tue, 18 Sep 2012 13:54:02 +0800
> When CONFIG_IPV6=m and CONFIG_L2TP=y, I got the following compile error:
...
> This is due to l2tp uses symbols from IPV6, so when l2tp is
> builtin, IPV6 has to be builtin too.
>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <amwang@redhat.com>
The correct way to express this is:
depends on (IPV6 || IPV6=n)
Which results in the KCONFIG option only being offers in modes
compatible with the dependency. Using a 'select' doesn't work
properly in these kinds of cases.
Anyways, grep for that string to see how it is used in other similar
situations.
^ permalink raw reply
* Re: [PATCH net-next] netdev: make address const in device address management
From: David Miller @ 2012-09-19 20:38 UTC (permalink / raw)
To: shemminger; +Cc: jeffrey.t.kirsher, john.r.fastabend, netdev
In-Reply-To: <20120917130326.028dac29@nehalam.linuxnetplumber.net>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 17 Sep 2012 13:03:26 -0700
> The internal functions for add/deleting addresses don't change
> their argument.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Applied, thanks Stephen.
^ permalink raw reply
* Re: [patch net] sky2: fix rx filter setup on link up
From: Jiri Pirko @ 2012-09-19 20:38 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, davem, mlindner, linux-kernel
In-Reply-To: <20120919132248.5fced1c7@s6510.linuxnetplumber.net>
Wed, Sep 19, 2012 at 10:22:48PM CEST, shemminger@vyatta.com wrote:
>Rather than saving and restoring values, why not just redo the
>full setup? This would also determine if the change was a result
>of something outside the driver.
You cannot call sky2_set_multicast() directly here. It is called from
__dev_set_rx_mode(). You would have to take at lease netif_addr_lock()
here. I think that clearer is to remember computed value....
>
>
>--- a/drivers/net/ethernet/marvell/sky2.c 2012-09-18 21:12:01.156438131 -0700
>+++ b/drivers/net/ethernet/marvell/sky2.c 2012-09-19 13:20:40.373620276 -0700
>@@ -2201,6 +2201,8 @@ static void sky2_link_up(struct sky2_por
>
> sky2_enable_rx_tx(sky2);
>
>+ sky2_set_multicast(sky2->netdev);
>+
> gm_phy_write(hw, port, PHY_MARV_INT_MASK, PHY_M_DEF_MSK);
>
> netif_carrier_on(sky2->netdev);
^ permalink raw reply
* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: David Miller @ 2012-09-19 20:34 UTC (permalink / raw)
To: eric.dumazet; +Cc: brutus, edumazet, netdev
In-Reply-To: <1347913239.26523.173.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 17 Sep 2012 22:20:39 +0200
> On Mon, 2012-09-17 at 11:58 -0700, Bruce "Brutus" Curtis wrote:
>> From: "Bruce \"Brutus\" Curtis" <brutus@google.com>
>>
>> TCP/IP loopback socket pair stack bypass, based on an idea by, and
>> rough upstream patch from, David Miller <davem@davemloft.net> called
>> "friends", the data structure modifcations and connection scheme are
>> reused with extensive data-path changes.
>
> ...
>
>>
>> + if (skb->friend) {
>> + /*
>> + * If friends haven't been made yet, our sk_friend
>> + * still == NULL, then update with the ACK's friend
>> + * value (the listen()er's sock addr) which is used
>> + * as a place holder.
>> + */
>> + cmpxchg(&sk->sk_friend, NULL, skb->friend);
>> + }
>
>
> There is a fundamental issue with this patch
>
> Setting skb->friend to a socket structure, without holding a reference
> on it is going to add subtle races and bugs.
>
> In this code, we have no guarantee the socket pointed by skb->friend was
> eventually freed and/or reused.
>
> But adding references might be overkill, as we need to unref them in
> some places, in hot path.
I have an idea on how to handle this.
In drivers/net/loopback.c:loopback_tx(), skip the SKB orphan operation
if there is a friend socket at skb->friend.
When sending such friend SKBs out at connection startup, arrange it
such that the skb->destructor will zap the skb->friend pointer to
NULL.
Also, in skb_orphan*(), if necessary, set skb->friend to NULL.
skb->sk will hold a reference to the socket, and since skb->friend
will be equal, this will make sure a pointer to an unreferenced
socket does not escape.
^ permalink raw reply
* Re: [PATCH] fix ZOMBIE state bug in PPPOE driver
From: David Miller @ 2012-09-19 20:26 UTC (permalink / raw)
To: stid.smth; +Cc: linux-kernel, netdev
In-Reply-To: <CANEcBPQnsV26UkyGxms0vKM9wK1NxfKjN4Z0qGipQtV4UPZgGA@mail.gmail.com>
From: Xiaodong Xu <stid.smth@gmail.com>
Date: Sun, 16 Sep 2012 10:30:53 +0800
> Hi All,
>
> I found a bug in kernel PPPOE driver.
> When PPPOE is running over a virtual ethernet interface (e.g., a
> bonding interface) and the user tries to delete the interface in case
> the PPPOE state is ZOMBIE, the kernel will loop infinitely while
> unregistering net_device for the reference count is not reset to zero
> which should be done by dev_put().
>
> The following patch could fix this issue:
>
> $ git diff
Please read Documentation/SubmittingPatches in the kernel source tree
to learn how to properly submit a patch.
In particular your patch was missing a proper signoff and your Subject
line is incorrectly specified.
Thanks.
^ permalink raw reply
* Re: [PATCH] sched: fix virtual-start-time update in QFQ
From: David Miller @ 2012-09-19 20:25 UTC (permalink / raw)
To: paolo.valente; +Cc: shemminger, jhs, fchecconi, rizzo, netdev, linux-kernel
In-Reply-To: <20120915104134.GA29862@paolo-ThinkPad-W520>
From: Paolo Valente <paolo.valente@unimore.it>
Date: Sat, 15 Sep 2012 12:41:35 +0200
> If the old timestamps of a class, say cl, are stale when the class
> becomes active, then QFQ may assign to cl a much higher start time
> than the maximum value allowed. This may happen when QFQ assigns to
> the start time of cl the finish time of a group whose classes are
> characterized by a higher value of the ratio
> max_class_pkt/weight_of_the_class with respect to that of
> cl. Inserting a class with a too high start time into the bucket list
> corrupts the data structure and may eventually lead to crashes.
> This patch limits the maximum start time assigned to a class.
>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Applied and queued up for -stable, thanks.
Please use "pkt_sched" instead of just plain "sched" in the prefixes
of your subject lines for packet scheduler changes so such changes
can be distinguished from process scheduler changes.
I fixed it up for you this time.
Thanks.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox