From: Joe Stringer <joe@wand.net.nz>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Joe Stringer <joe@wand.net.nz>, Florian Westphal <fw@strlen.de>,
netdev <netdev@vger.kernel.org>,
john fastabend <john.fastabend@gmail.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Lorenz Bauer <lmb@cloudflare.com>,
Jakub Sitnicki <jakub@cloudflare.com>,
Paolo Abeni <pabeni@redhat.com>
Subject: Re: Removing skb_orphan() from ip_rcv_core()
Date: Tue, 25 Jun 2019 11:20:46 -0700 [thread overview]
Message-ID: <CAOftzPgOOy_jDXgBO2dJFGUU9cnAVCaXtD66R8VH3yXe7NpM7g@mail.gmail.com> (raw)
In-Reply-To: <b6baadcb-29af-82f1-bebe-56d5f45b12e6@gmail.com>
On Mon, Jun 24, 2019 at 11:37 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 6/24/19 8:17 PM, Joe Stringer wrote:
> > On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote:
> >> Joe Stringer <joe@wand.net.nz> wrote:
> >>> However, if I drop these lines then I end up causing sockets to
> >>> release references too many times. Seems like if we don't orphan the
> >>> skb here, then later logic assumes that we have one more reference
> >>> than we actually have, and decrements the count when it shouldn't
> >>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> >>> to assume we always have a reference to the socket?)
> >>
> >> We might be calling the wrong destructor (i.e., the one set by tcp
> >> receive instead of the one set at tx time)?
> >
> > Hmm, interesting thought. Sure enough, with a bit of bpftrace
> > debugging we find it's tcp_wfree():
> >
> > $ cat ip_rcv.bt
> > #include <linux/skbuff.h>
> >
> > kprobe:ip_rcv {
> > $sk = ((struct sk_buff *)arg0)->sk;
> > $des = ((struct sk_buff *)arg0)->destructor;
> > if ($sk) {
> > if ($des) {
> > printf("received %s on %s with sk destructor %s
> > set\n", str(arg0), str(arg1), ksym($des));
> > @ip4_stacks[kstack] = count();
> > }
> > }
> > }
> > $ sudo bpftrace ip_rcv.bt
> > Attaching 1 probe...
> > received on eth0 with sk destructor tcp_wfree set
> > ^C
> >
> > @ip4_stacks[
> > ip_rcv+1
> > __netif_receive_skb+24
> > process_backlog+179
> > net_rx_action+304
> > __do_softirq+220
> > do_softirq_own_stack+42
> > do_softirq.part.17+70
> > __local_bh_enable_ip+101
> > ip_finish_output2+421
> > __ip_finish_output+187
> > ip_finish_output+44
> > ip_output+109
> > ip_local_out+59
> > __ip_queue_xmit+368
> > ip_queue_xmit+16
> > __tcp_transmit_skb+1303
> > tcp_connect+2758
> > tcp_v4_connect+1135
> > __inet_stream_connect+214
> > inet_stream_connect+59
> > __sys_connect+237
> > __x64_sys_connect+26
> > do_syscall_64+90
> > entry_SYSCALL_64_after_hwframe+68
> > ]: 1
> >
> > Is there a solution here where we call the destructor if it's not
> > sock_efree()? When the socket is later stolen, it will only return the
> > reference via a call to sock_put(), so presumably at that point in the
> > stack we already assume that the skb->destructor is not one of these
> > other destructors (otherwise we wouldn't release the resources
> > correctly).
> >
>
> What was the driver here ? In any case, the following patch should help.
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3699,6 +3699,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
> return NET_RX_DROP;
> }
>
> + skb_orphan(skb);
> skb_scrub_packet(skb, true);
> skb->priority = 0;
> return 0;
Looks like it was bridge in the end, found by attaching a similar
bpftrace program to __dev_forward_sk(). Interestingly enough, the
device attached to the skb reported its name as "eth0" despite not
having such a named link or named bridge that I could find anywhere
via "ip link" / "brctl show"..
__dev_forward_skb+1
dev_hard_start_xmit+151
__dev_queue_xmit+1928
dev_queue_xmit+16
br_dev_queue_push_xmit+123
br_forward_finish+69
__br_forward+327
br_forward+204
br_dev_xmit+598
dev_hard_start_xmit+151
__dev_queue_xmit+1928
dev_queue_xmit+16
neigh_resolve_output+339
ip_finish_output2+402
__ip_finish_output+187
ip_finish_output+44
ip_output+109
ip_local_out+59
__ip_queue_xmit+368
ip_queue_xmit+16
__tcp_transmit_skb+1303
tcp_connect+2758
tcp_v4_connect+1135
__inet_stream_connect+214
inet_stream_connect+59
__sys_connect+237
__x64_sys_connect+26
do_syscall_64+90
entry_SYSCALL_64_after_hwframe+68
So I guess something like this could be another alternative:
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 82225b8b54f5..c2de2bb35080 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -65,6 +65,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
+ skb_orphan(skb);
skb->tstamp = 0;
return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
net, sk, skb, NULL, skb->dev,
next prev parent reply other threads:[~2019-06-25 18:21 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-21 17:58 Removing skb_orphan() from ip_rcv_core() Joe Stringer
2019-06-21 20:59 ` Florian Westphal
2019-06-25 3:17 ` Joe Stringer
2019-06-25 6:37 ` Eric Dumazet
2019-06-25 9:35 ` Daniel Borkmann
2019-06-25 17:03 ` Eric Dumazet
2019-06-25 18:20 ` Joe Stringer [this message]
2019-06-22 0:36 ` Eric Dumazet
2019-06-24 14:47 ` Jamal Hadi Salim
2019-06-24 16:49 ` Eric Dumazet
2019-06-25 10:55 ` Jamal Hadi Salim
2019-06-25 3:26 ` Joe Stringer
2019-06-25 11:06 ` Jamal Hadi Salim
2019-06-25 18:29 ` Joe Stringer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOftzPgOOy_jDXgBO2dJFGUU9cnAVCaXtD66R8VH3yXe7NpM7g@mail.gmail.com \
--to=joe@wand.net.nz \
--cc=daniel@iogearbox.net \
--cc=eric.dumazet@gmail.com \
--cc=fw@strlen.de \
--cc=jakub@cloudflare.com \
--cc=john.fastabend@gmail.com \
--cc=lmb@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).