Netdev List
 help / color / mirror / Atom feed
* [PATCH] net: Don't default Cavium PTP driver to 'y'
From: Bjorn Helgaas @ 2019-02-05 20:47 UTC (permalink / raw)
  To: David S. Miller
  Cc: Aleksey Makarov, Radoslaw Biernacki, Philippe Ombredanne, netdev,
	linux-kernel, linux-arm-kernel

From: Bjorn Helgaas <bhelgaas@google.com>

8c56df372bc1 ("net: add support for Cavium PTP coprocessor") added the
Cavium PTP coprocessor driver and enabled it by default.  Remove the
"default y" because the driver only applies to Cavium ThunderX processors.

Fixes: 8c56df372bc1 ("net: add support for Cavium PTP coprocessor")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/net/ethernet/cavium/Kconfig |    1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/Kconfig b/drivers/net/ethernet/cavium/Kconfig
index 5f03199a3acf..05f4a3b21e29 100644
--- a/drivers/net/ethernet/cavium/Kconfig
+++ b/drivers/net/ethernet/cavium/Kconfig
@@ -54,7 +54,6 @@ config CAVIUM_PTP
 	tristate "Cavium PTP coprocessor as PTP clock"
 	depends on 64BIT && PCI
 	imply PTP_1588_CLOCK
-	default y
 	---help---
 	  This driver adds support for the Precision Time Protocol Clocks and
 	  Timestamping coprocessor (PTP) found on Cavium processors.


^ permalink raw reply related

* Re: [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case
From: Stanislav Fomichev @ 2019-02-05 20:45 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn
In-Reply-To: <CAF=yD-JaWQG=5mA3bBU1+as9hkTNT=+3aJB1uwN_U4qFz44btQ@mail.gmail.com>

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > When flow_dissector is called without skb (with only data and hlen),
> > construct on-stack skb (which has a linear chunk of data passed
> > to the flow dissector). This should let us handle eth_get_headlen
> > case where only data is provided and we don't want to (yet) allocate
> > an skb.
> >
> > Since this on-stack skb doesn't allocate its own data, we can't
> > add shinfo and need to be careful to avoid any code paths that use
> > it. Flow dissector BPF programs can only call bpf_skb_load_bytes helper,
> > which doesn't touch shinfo in our case (skb->len is the length of the
> > linear header so it exits early).
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/skbuff.h    |  5 +++
> >  net/core/flow_dissector.c | 95 +++++++++++++++++++++++++++++----------
> >  2 files changed, 76 insertions(+), 24 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index aa9a9983de80..5f1c085cb34c 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1227,6 +1227,11 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
> >                             const struct sk_buff *skb,
> >                             struct flow_dissector *flow_dissector,
> >                             struct bpf_flow_keys *flow_keys);
> > +bool __flow_bpf_dissect(struct bpf_prog *prog,
> > +                       void *data, __be16 proto,
> > +                       int nhoff, int hlen,
> > +                       struct flow_dissector *flow_dissector,
> > +                       struct bpf_flow_keys *flow_keys);
> 
> nit: please use more descriptive name. Perhaps bpf_flow_dissect_raw
> and rename __skb_flow_bpf_dissect to bpf_flow_dissect_skb.
Agreed.

> > +bool __flow_bpf_dissect(struct bpf_prog *prog,
> > +                       void *data, __be16 proto,
> > +                       int nhoff, int hlen,
> > +                       struct flow_dissector *flow_dissector,
> > +                       struct bpf_flow_keys *flow_keys)
> > +{
> > +       struct bpf_skb_data_end *cb;
> > +       struct sk_buff skb;
> > +       u32 result;
> > +
> > +       __init_skb(&skb, data, hlen);
> > +       skb_put(&skb, hlen);
> > +       skb.protocol = proto;
> > +
> > +       init_flow_keys(flow_keys, &skb, nhoff);
> > +
> > +       cb = (struct bpf_skb_data_end *)skb.cb;
> > +       cb->data_meta = skb.data;
> > +       cb->data_end  = skb.data + skb_headlen(&skb);
> > +
> > +       result = BPF_PROG_RUN(prog, &skb);
> > +
> > +       clamp_flow_keys(flow_keys, hlen);
> >
> >         return result == BPF_OK;
> >  }
> 
> Can__flow_bpf_dissect just construct an skb and then call
> __skb_flow_bpf_dissect?
__skb_flow_bpf_dissect calls bpf_compute_data_pointers which calls
skb_metadata_len which touches shinfo. And I don't think I have a
clever way to handle that.

> 
> It will unnecessarily save and restore the control block, but that is
> a relatively small cost (compared to, say, zeroing the entire skb).

^ permalink raw reply

* Re: [RFC bpf-next 2/7] net: introduce skb_net helper
From: Stanislav Fomichev @ 2019-02-05 20:42 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn
In-Reply-To: <CAF=yD-KR+=h94ccMeeB-CBwQQQvp+BLv9H=FawNE_nE3e4uZgg@mail.gmail.com>

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > skb_net returns network namespace from the associated device or socket.
> >
> > This will be used in the next commit.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/skbuff.h |  2 ++
> >  net/core/skbuff.c      | 10 ++++++++++
> >  2 files changed, 12 insertions(+)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index ad883ab2762c..28723a86efdf 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -4343,5 +4343,7 @@ static inline __wsum lco_csum(struct sk_buff *skb)
> >         return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
> >  }
> >
> > +struct net *skb_net(const struct sk_buff *skb);
> > +
> >  #endif /* __KERNEL__ */
> >  #endif /* _LINUX_SKBUFF_H */
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 23c9cf100bd4..016db13fa2b6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5585,6 +5585,16 @@ void skb_condense(struct sk_buff *skb)
> >         skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> >  }
> >
> > +struct net *skb_net(const struct sk_buff *skb)
> > +{
> > +       if (skb->dev)
> > +               return dev_net(skb->dev);
> > +       else if (skb->sk)
> > +               return sock_net(skb->sk);
> > +       return NULL;
> > +}
> > +EXPORT_SYMBOL(skb_net);
> 
> If this needs a helper it is probably better static inline in the header.
I did that initially in skbuff.h as inline, but it is missing some
struct definitions, decided to go with this for RFC. Will look into
inlining.

^ permalink raw reply

* Re: [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect
From: Stanislav Fomichev @ 2019-02-05 20:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn
In-Reply-To: <CAF=yD-JOpmA-ahiyH07kpNXmUO7z-V3oqWRQ6D5S+R1TSgwFPA@mail.gmail.com>

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > This new argument will be used in the next patches for the
> > eth_get_headlen use case. eth_get_headlen calls flow dissector
> > with only data (without skb) so there is currently no way to
> > pull attached BPF flow dissector program. With this new argument,
> > we can amend the callers to explicitly pass network namespace
> > so we can use attached BPF program.
> >
> > Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> 
> >  /**
> >   * __skb_flow_dissect - extract the flow_keys struct and return it
> > + * @net: associated network namespace, if NULL pulled from skb
> >   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
> >   * @flow_dissector: list of keys to dissect
> >   * @target_container: target structure to put dissected values into
> > @@ -739,7 +740,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
> >   *
> >   * Caller must take care of zeroing target container memory.
> >   */
> > -bool __skb_flow_dissect(const struct sk_buff *skb,
> > +bool __skb_flow_dissect(struct net *net,
> > +                       const struct sk_buff *skb,
> >                         struct flow_dissector *flow_dissector,
> >                         void *target_container,
> >                         void *data, __be16 proto, int nhoff, int hlen,
> > @@ -799,12 +801,11 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >
> >                 rcu_read_lock();
> >
> > -               if (skb->dev)
> > -                       attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
> > -               else if (skb->sk)
> > -                       attached = rcu_dereference(sock_net(skb->sk)->flow_dissector_prog);
> > -               else
> > -                       WARN_ON_ONCE(1);
> > +               if (!net && skb)
> > +                       net = skb_net(skb);
> > +               if (net)
> > +                       attached = rcu_dereference(net->flow_dissector_prog);
> > +               WARN_ON_ONCE(!net);
> 
> Instead of this just call skb_net(skb) in all callers of
> __skb_flow_dissect that are called with an skb argument directly?
> 
> It may have to be able to handle skb == NULL args.
Ack, will look into it.

^ permalink raw reply

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
From: Stanislav Fomichev @ 2019-02-05 20:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn
In-Reply-To: <CAF=yD-KJK7R5RwW9JphxRBG-=UTapgeAygT76_2HNJWPxvW0iA@mail.gmail.com>

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > skb. Because we use passed skb to lookup associated networking namespace
> > to find whether we have a BPF program attached or not, we always use
> > C-based flow dissector in this case.
> >
> > The goal of this patch series is to add new networking namespace argument
> > to the eth_get_headlen and make BPF flow dissector programs be able to
> > work in the skb-less case.
> >
> > The series goes like this:
> > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> >    initialize temporary skb
> > 2. introduce skb_net which can be used to get networking namespace
> >    associated with an skb
> > 3. add new optional network namespace argument to __skb_flow_dissect and
> >    plumb through the callers
> > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> >    (using __init_skb) and calls BPF flow dissector program
> 
> The main concern I see with this series is this cost of skb zeroing
> for every packet in the device driver receive routine, *independent*
> from the real skb allocation and zeroing which will likely happen
> later.
Yes, plus ~200 bytes on the stack for the callers.

Not sure how visible this zeroing though, I can probably try to get some
numbers from BPF_PROG_TEST_RUN (running current version vs running with
on-stack skb).

^ permalink raw reply

* Re: Kernel panic in eth_header
From: Eric Dumazet @ 2019-02-05 20:28 UTC (permalink / raw)
  To: Andrew, Netdev
In-Reply-To: <189be8e7-7126-06bf-67bf-53d56ea0723c@seti.kr.ua>



On 02/05/2019 12:21 PM, Andrew wrote:

> I think that backport will be trivial - at least patch lays smoothly on 4.9 (just with offsets difference).
> 
> I'll test it.
> 
> Btw, maybe there's a some test conditions to quickly check if patch helps? Crash is reproducible with unpredictable interval (tens of hours of quite heavy load).
>

Build your kernel with CONFIG_KASAN=y

Then run the tests Peter wrote.

4c3510483d26420d2c2c7cc075ad872286cc5932 selftests: net: ip_defrag: cover new IPv6 defrag behavior
3271a4821882a64214acc1bd7b173900ec70c9bf selftests: net: fix/improve ip_defrag selftest
bccc17118bcf3c62c947361d51760334f6602f43 selftests/net: add ipv6 tests to ip_defrag selftest
02c7f38b7ace9f1b2ddb7a88139127eef4cf8706 selftests/net: add ip_defrag selftest



^ permalink raw reply

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Heiner Kallweit @ 2019-02-05 20:23 UTC (permalink / raw)
  To: Joe Perches, Eric Dumazet, David Miller, thierry.reding
  Cc: andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <4aea523a3bc5be0d944f7ed9fadac276b7115002.camel@perches.com>

On 05.02.2019 21:18, Joe Perches wrote:
> On Tue, 2019-02-05 at 12:04 -0800, Eric Dumazet wrote:
>>
>> On 02/05/2019 10:42 AM, Joe Perches wrote:
>>> It's declared after a pointer so it is already is 2 byte aligned.
>>>
>>> A lot of drivers wouldn't work otherwise.
>>
>> Maybe these drivers are only used on arches where this does not matter.
> 
> Possible.
> 
> I had only grepped through the sources looking for
> declarations using:
> 
> $ git grep -B1 '\[ETH_ALEN\];' -- '*.c' | grep -A1 '\*'
> 
> It's quite a few files in net/ too btw.
> 
> I still think adding __align(<even#>) is unnecessary here unless
> it follows something like a bool or a u8.
> 
> 
I there's such a controversy, then it may be better to stay with
the current code, or?

^ permalink raw reply

* Re: [ovs-dev] [PATCH net-next V2 1/1] openvswitch: Declare ovs key structures using macros
From: Ben Pfaff @ 2019-02-05 20:23 UTC (permalink / raw)
  To: Gregory Rose
  Cc: Eli Britstein, David Miller, yihung.wei@gmail.com,
	dev@openvswitch.org, netdev@vger.kernel.org,
	simon.horman@netronome.com
In-Reply-To: <c3455573-5dad-e7a7-fc89-15312c0b7c88@gmail.com>

On Tue, Feb 05, 2019 at 10:22:09AM -0800, Gregory Rose wrote:
> 
> On 2/5/2019 4:02 AM, Eli Britstein wrote:
> > On 2/4/2019 10:07 PM, David Miller wrote:
> > > From: Yi-Hung Wei <yihung.wei@gmail.com>
> > > Date: Mon, 4 Feb 2019 10:47:18 -0800
> > > 
> > > > For example, to see how 'struct ovs_key_ipv6' is defined, now we need
> > > > to trace how OVS_KEY_IPV6_FIELDS is defined, and how OVS_KEY_FIELD_ARR
> > > > and OVS_KEY_FIELD defined.  I think it makes the header file to be
> > > > more complicated.
> > > I completely agree.
> > > 
> > > Unless this is totally unavoidable, I do not want to apply a patch
> > > which makes reading and auditing the networking code more difficult.
> > This technique is discussed for example in
> > https://stackoverflow.com/questions/6635851/real-world-use-of-x-macros,
> > and I found existing examples of using it in the kernel tree:
> > 
> > __BPF_FUNC_MAPPER in commit ebb676daa1a34 ("bpf: Print function name in
> > addition to function id")
> > 
> > __AAL_STAT_ITEMS and __SONET_ITEMS in commit 607ca46e97a1b ("UAPI:
> > (Scripted) Disintegrate include/linux"), the successor of commit
> > 1da177e4c3f4 ("Linux-2.6.12-rc2"). I didn't dig deeper to the past.
> > 
> > I can agree it makes that H file a bit more complicated, but for sure
> > less than ## macros that are widely used.
> > 
> > However, I think the alternatives of generating such defines by some
> > scripts, or having the fields in more than one place are even worse, so
> > it is a kind of unavoidable.
> 
> Why is using a script to generate defines for the requirements of your
> application or driver so bad?  Your patch
> turns openvswitch.h into some hardly readable code - using a script and
> having a machine output the macros
> your application or driver needs seems like a much better alternative to me.

In addition, one of the reasons that developers prefer to avoid
duplication is because of the need to synchronize copies when one of
them changes.  This doesn't apply in the same way to these structures,
because they are ABIs that will not change.

^ permalink raw reply

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Eric Dumazet @ 2019-02-05 20:23 UTC (permalink / raw)
  To: Joe Perches, David Miller, thierry.reding
  Cc: hkallweit1, andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <4aea523a3bc5be0d944f7ed9fadac276b7115002.camel@perches.com>



On 02/05/2019 12:18 PM, Joe Perches wrote:

> I still think adding __align(<even#>) is unnecessary here unless
> it follows something like a bool or a u8.

This would be some historical side effect, and we do not want to rely on that.

A security feature could in fact ask a compiler to perform random
shuffling of automatic variables.

( a la __randomize_layout )



^ permalink raw reply

* Re: Kernel panic in eth_header
From: Andrew @ 2019-02-05 20:21 UTC (permalink / raw)
  To: Netdev
In-Reply-To: <16f5a810-f183-2874-c67d-d490f70f7bf6@gmail.com>

On 05.02.2019 21:34, Florian Fainelli wrote:
> On 2/5/19 8:57 AM, Eric Dumazet wrote:
>>
>> On 02/05/2019 08:29 AM, Andrew wrote:
>>> Hi all.
>>>
>>> After upgrade on PPPoE BRAS to kernel 4.9.153 I've got an kernel panic after a 3 days of uptime.
>>>
>>> Unfortunately kernel is compiled w/o debug info; I rebuilt kernel with debug info enabled (kernel is compiled with same function addresses - I compare vmlinux symbol maps) - it says that panic is in net/ethernet/eth.c:88
>>>
>>> Below there is a kernel panic trace. igb is from vendor, ver. 5.3.5.4. What extra info is needed?
>>>
>>> [263565.106441] BUG: unable to handle kernel paging request at ffff88015a4d2dd4
>>> [263565.113527] IP: [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.119030] PGD 1e8f067 [263565.121474] PUD 0
>>> [263565.123580]
>>> [263565.125166] Oops: 0002 [#1] SMP
>>> [263565.128398] Modules linked in: xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_length xt_TCPMSS xt_tcpudp xt_mark xt_dscp iptable_mangle ip_tables x_tables nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack sch_sfq sch_htb cls_u32 sch_ingress sch_prio sch_tbf cls_flow cls_fw act_police ifb 8021q mrp garp stp llc softdog pppoe pppox ppp_generic slhc i2c_nforce2 i2c_core igb(O) parport_pc dca parport thermal asus_atk0110 fan ptp k10temp hwmon pps_core nv_tco
>>> [263565.176083] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.9.153-x86_64 #1
>>> [263565.183996] Hardware name: System manufacturer System Product Name/M2N-E, BIOS ASUS M2N-E ACPI BIOS Revision 5001 03/23/2010
>>> [263565.195289] task: ffff88007d0f5200 task.stack: ffffc9000006c000
>>> [263565.201295] RIP: 0010:[<ffffffff8158e48b>] [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.209225] RSP: 0018:ffff88007fa83c58  EFLAGS: 00010286
>>> [263565.214622] RAX: ffff88015a4d2dc8 RBX: 0000000000000008 RCX: ffff8800682434a0
>>> [263565.221843] RDX: ffff88015a4d2dc8 RSI: ffff88015a4d2dc8 RDI: ffff880077aab000
>>> [263565.229062] RBP: ffff88007b663d90 R08: ffff88007b663d90 R09: 0000000000000574
>>> [263565.236281] R10: ffff88007d1fa000 R11: 0000000000000000 R12: ffff8800682434a0
>>> [263565.243501] R13: ffff88007d1fa000 R14: 0000000000000574 R15: 0000000000000008
>>> [263565.250719] FS:  0000000000000000(0000) GS:ffff88007fa80000(0000) knlGS:0000000000000000
>>> [263565.258894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [263565.264725] CR2: ffff88015a4d2dd4 CR3: 000000007ad73000 CR4: 00000000000006f0
>>> [263565.271944] Stack:
>>> [263565.274041]  ffff880077aab000 ffff880068243400 ffff88007a745000 ffff8800682434a0
>>> [263565.281582]  0000000000000002 ffffffff81571d09 ffff880068243400 ffff88007fa83d00
>>> [263565.289121]  ffff88007a745000 ffff880077aab000 ffff88007a712000 ffffffff815a8c61
>>> [263565.296661] Call Trace:
>>> [263565.299193]  <IRQ> [263565.301205] [<ffffffff81571d09>] ? neigh_connected_output+0xa9/0x100
>>> [263565.307740]  [<ffffffff815a8c61>] ? ip_finish_output2+0x221/0x400
>>> [263565.313920]  [<ffffffff8159e144>] ? nf_iterate+0x54/0x60
>>> [263565.319319]  [<ffffffff815ab2fa>] ? ip_output+0x6a/0xf0
>>> [263565.324631]  [<ffffffff8159e102>] ? nf_iterate+0x12/0x60
>>> [263565.330030]  [<ffffffff815aa6e0>] ? ip_fragment.constprop.5+0x80/0x80
>>> [263565.336556]  [<ffffffff815a73b6>] ? ip_forward+0x396/0x480
>>> [263565.342128]  [<ffffffff815a6fb0>] ? ip_check_defrag+0x1e0/0x1e0
>>> [263565.348134]  [<ffffffff815a5a2e>] ? ip_rcv+0x2ae/0x370
>>> [263565.353361]  [<ffffffffa0107c02>] ? pppoe_rcv_core+0xd2/0x160 [pppoe]
>>> [263565.359888]  [<ffffffff815a5170>] ? ip_local_deliver_finish+0x1d0/0x1d0
>>> [263565.366586]  [<ffffffff81562a57>] ? __netif_receive_skb_core+0x527/0xa80
>>> [263565.373373]  [<ffffffff81567632>] ? process_backlog+0x92/0x130
>>> [263565.379291]  [<ffffffff8156745d>] ? net_rx_action+0x24d/0x390
>>> [263565.385124]  [<ffffffff81628374>] ? __do_softirq+0xf4/0x2a0
>>> [263565.390784]  [<ffffffff8107136c>] ? irq_exit+0xbc/0xd0
>>> [263565.396008]  [<ffffffff81626cd6>] ? call_function_single_interrupt+0x96/0xa0
>>> [263565.403141]  <EOI> [263565.405153] [<ffffffff81623eb0>] ? __sched_text_end+0x2/0x2
>>> [263565.410907]  [<ffffffff81624182>] ? native_safe_halt+0x2/0x10
>>> [263565.416741]  [<ffffffff81623ec8>] ? default_idle+0x18/0xd0
>>> [263565.422314]  [<ffffffff810a7a46>] ? cpu_startup_entry+0x126/0x220
>>> [263565.428492]  [<ffffffff8104c261>] ? start_secondary+0x161/0x180
>>> [263565.434496] Code: 0e 00 00 00 53 89 d3 49 89 cc 4c 89 c5 45 89 ce e8 bb 8a fc ff 66 83 fb 01 48 89 c6 74 44 66 83 fb 04 74 3e 66 c1 c3 08 48 85 ed <66> 89 58 0c 74 40 8b 45 00 4d 85 e4 89 46 06 0f b7 45 04 66 89
>>> [263565.454534] RIP  [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.460124]  RSP <ffff88007fa83c58>
>>> [263565.463696] CR2: ffff88015a4d2dd4
>>> [263565.467104] ---[ end trace a1bcaf3618724adf ]---
>>> [263565.471807] Kernel panic - not syncing: Fatal exception in interrupt
>>> [263565.478245] Kernel Offset: disabled
>>> [263565.481818] Rebooting in 5 seconds..
>>>
>>
>> This is a well known issue, a fix should come shortly in stable branches
> Is Peter or yourself doing the backport? David would only take care of
> the most two recent stable kernels.
>
> Sorry about missing that change as part of the fragmenstack backport to
> 4.9...

I think that backport will be trivial - at least patch lays smoothly on 
4.9 (just with offsets difference).

I'll test it.

Btw, maybe there's a some test conditions to quickly check if patch 
helps? Crash is reproducible with unpredictable interval (tens of hours 
of quite heavy load).


>> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
>> index f8bbd693c19c247e41839c2d0b5318ca51b23ee8..d95b32af4a0e3f552405c9e61cc372729834160c 100644
>> --- a/net/ipv4/ip_fragment.c
>> +++ b/net/ipv4/ip_fragment.c
>> @@ -425,6 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>>           * fragment.
>>           */
>>   
>> +       err = -EINVAL;
>>          /* Find out where to put this fragment.  */
>>          prev_tail = qp->q.fragments_tail;
>>          if (!prev_tail)
>> @@ -501,7 +502,6 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>>   
>>   discard_qp:
>>          inet_frag_kill(&qp->q);
>> -       err = -EINVAL;
>>          __IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
>>   err:
>>          kfree_skb(skb);
>>
>>
>>
>


^ permalink raw reply

* [net 1/3] net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Raed Salem, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190205202011.24023-1-saeedm@mellanox.com>

From: Raed Salem <raeds@mellanox.com>

At Innova IPsec TX offload data path a special software parser metadata
is used to pass some packet attributes to the hardware, this metadata
is passed using the Ethernet control segment of a WQE (a HW descriptor)
header.

The cited commit might nullify this header, hence the metadata is lost,
this caused a significant performance drop during hw offloading
operation.

Fix by restoring the metadata at the Ethernet control segment in case
it was nullified.

Fixes: 37fdffb217a4 ("net/mlx5: WQ, fixes for fragmented WQ buffers API")
Signed-off-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 598ad7e4d5c9..0e55cd1f2e98 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -387,8 +387,14 @@ netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
 	contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi);
 	if (unlikely(contig_wqebbs_room < num_wqebbs)) {
+#ifdef CONFIG_MLX5_EN_IPSEC
+		struct mlx5_wqe_eth_seg cur_eth = wqe->eth;
+#endif
 		mlx5e_fill_sq_frag_edge(sq, wq, pi, contig_wqebbs_room);
 		mlx5e_sq_fetch_wqe(sq, &wqe, &pi);
+#ifdef CONFIG_MLX5_EN_IPSEC
+		wqe->eth = cur_eth;
+#endif
 	}
 
 	/* fill wqe */
-- 
2.20.1


^ permalink raw reply related

* [net 2/3] net/mlx5e: Properly set steering match levels for offloaded TC decap rules
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Slava Ovsiienko, Jianbo Liu, Saeed Mahameed
In-Reply-To: <20190205202011.24023-1-saeedm@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>

The match level computed by the driver gets to be wrong for decap
rules with wildcarded inner packet match such as:

tc filter add dev vxlan_sys_4789 protocol all parent ffff: prio 2 flower
       enc_dst_ip 192.168.0.9 enc_key_id 100 enc_dst_port 4789
       action tunnel_key unset
       action mirred egress redirect dev eth1

The FW errs for a missing matching meta-data indicator for the outer
headers (where we do have a match), and a wrong matching meta-data
indicator for the inner headers (where we don't have a match).

Fix that by taking into account the matching on the tunnel info and
relating the match level of the encapsulated packet to the firmware
inner headers indicator in case of decap.

As for vxlan we mandate a match on the tunnel udp dst port, and in general
we practically madndate a match on the source or dest ip for any IP tunnel,
the fix was done in a minimal manner around the tunnel match parsing code.

Fixes: d708f902989b ('net/mlx5e: Get the required HW match level while parsing TC flow matches')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Slava Ovsiienko <viacheslavo@mellanox.com>
Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_tun.c    |  4 +++-
 .../ethernet/mellanox/mlx5/core/en/tc_tun.h    |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_tc.c    | 18 ++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h  |  1 +
 .../mellanox/mlx5/core/eswitch_offloads.c      | 17 +++++++++--------
 5 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
index 046948ead152..a3750af074a4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
@@ -612,16 +612,18 @@ int mlx5e_tc_tun_parse(struct net_device *filter_dev,
 		       struct mlx5_flow_spec *spec,
 		       struct tc_cls_flower_offload *f,
 		       void *headers_c,
-		       void *headers_v)
+		       void *headers_v, u8 *match_level)
 {
 	int tunnel_type;
 	int err = 0;
 
 	tunnel_type = mlx5e_tc_tun_get_type(filter_dev);
 	if (tunnel_type == MLX5E_TC_TUNNEL_TYPE_VXLAN) {
+		*match_level = MLX5_MATCH_L4;
 		err = mlx5e_tc_tun_parse_vxlan(priv, spec, f,
 					       headers_c, headers_v);
 	} else if (tunnel_type == MLX5E_TC_TUNNEL_TYPE_GRETAP) {
+		*match_level = MLX5_MATCH_L3;
 		err = mlx5e_tc_tun_parse_gretap(priv, spec, f,
 						headers_c, headers_v);
 	} else {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
index 706ce7bf15e7..b63f15de899d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
@@ -39,6 +39,6 @@ int mlx5e_tc_tun_parse(struct net_device *filter_dev,
 		       struct mlx5_flow_spec *spec,
 		       struct tc_cls_flower_offload *f,
 		       void *headers_c,
-		       void *headers_v);
+		       void *headers_v, u8 *match_level);
 
 #endif //__MLX5_EN_TC_TUNNEL_H__
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index cae6c6d48984..043896e13ffa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1302,7 +1302,7 @@ static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 static int parse_tunnel_attr(struct mlx5e_priv *priv,
 			     struct mlx5_flow_spec *spec,
 			     struct tc_cls_flower_offload *f,
-			     struct net_device *filter_dev)
+			     struct net_device *filter_dev, u8 *match_level)
 {
 	struct netlink_ext_ack *extack = f->common.extack;
 	void *headers_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
@@ -1317,7 +1317,7 @@ static int parse_tunnel_attr(struct mlx5e_priv *priv,
 	int err = 0;
 
 	err = mlx5e_tc_tun_parse(filter_dev, priv, spec, f,
-				 headers_c, headers_v);
+				 headers_c, headers_v, match_level);
 	if (err) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "failed to parse tunnel attributes");
@@ -1426,7 +1426,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 			      struct mlx5_flow_spec *spec,
 			      struct tc_cls_flower_offload *f,
 			      struct net_device *filter_dev,
-			      u8 *match_level)
+			      u8 *match_level, u8 *tunnel_match_level)
 {
 	struct netlink_ext_ack *extack = f->common.extack;
 	void *headers_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
@@ -1477,7 +1477,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 		switch (key->addr_type) {
 		case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
 		case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
-			if (parse_tunnel_attr(priv, spec, f, filter_dev))
+			if (parse_tunnel_attr(priv, spec, f, filter_dev, tunnel_match_level))
 				return -EOPNOTSUPP;
 			break;
 		default:
@@ -1826,11 +1826,11 @@ static int parse_cls_flower(struct mlx5e_priv *priv,
 	struct mlx5_core_dev *dev = priv->mdev;
 	struct mlx5_eswitch *esw = dev->priv.eswitch;
 	struct mlx5e_rep_priv *rpriv = priv->ppriv;
+	u8 match_level, tunnel_match_level = MLX5_MATCH_NONE;
 	struct mlx5_eswitch_rep *rep;
-	u8 match_level;
 	int err;
 
-	err = __parse_cls_flower(priv, spec, f, filter_dev, &match_level);
+	err = __parse_cls_flower(priv, spec, f, filter_dev, &match_level, &tunnel_match_level);
 
 	if (!err && (flow->flags & MLX5E_TC_FLOW_ESWITCH)) {
 		rep = rpriv->rep;
@@ -1846,10 +1846,12 @@ static int parse_cls_flower(struct mlx5e_priv *priv,
 		}
 	}
 
-	if (flow->flags & MLX5E_TC_FLOW_ESWITCH)
+	if (flow->flags & MLX5E_TC_FLOW_ESWITCH) {
 		flow->esw_attr->match_level = match_level;
-	else
+		flow->esw_attr->tunnel_match_level = tunnel_match_level;
+	} else {
 		flow->nic_attr->match_level = match_level;
+	}
 
 	return err;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 9c89eea9b2c3..748ff178a1d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -312,6 +312,7 @@ struct mlx5_esw_flow_attr {
 	} dests[MLX5_MAX_FLOW_FWD_VPORTS];
 	u32	mod_hdr_id;
 	u8	match_level;
+	u8	tunnel_match_level;
 	struct mlx5_fc *counter;
 	u32	chain;
 	u16	prio;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 53065b6ae593..d4e6fe5b9300 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -160,14 +160,15 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 		MLX5_SET_TO_ONES(fte_match_set_misc, misc,
 				 source_eswitch_owner_vhca_id);
 
-	if (attr->match_level == MLX5_MATCH_NONE)
-		spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
-	else
-		spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS |
-					      MLX5_MATCH_MISC_PARAMETERS;
-
-	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_DECAP)
-		spec->match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
+	spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
+	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_DECAP) {
+		if (attr->tunnel_match_level != MLX5_MATCH_NONE)
+			spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
+		if (attr->match_level != MLX5_MATCH_NONE)
+			spec->match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
+	} else if (attr->match_level != MLX5_MATCH_NONE) {
+		spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
+	}
 
 	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
 		flow_act.modify_id = attr->mod_hdr_id;
-- 
2.20.1


^ permalink raw reply related

* [net 3/3] net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Guy Shattah, Or Gerlitz, Saeed Mahameed
In-Reply-To: <20190205202011.24023-1-saeedm@mellanox.com>

From: Guy Shattah <sguy@mellanox.com>

In packets that need to be decaped the internal headers
have to be checked, not the external ones.

Fixes: bdd66ac0aeed ("net/mlx5e: Disallow TC offloading of unsupported match/action combinations")
Signed-off-by: Guy Shattah <sguy@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 043896e13ffa..1c3c9fa26b55 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2181,6 +2181,7 @@ static bool csum_offload_supported(struct mlx5e_priv *priv,
 
 static bool modify_header_match_supported(struct mlx5_flow_spec *spec,
 					  struct tcf_exts *exts,
+					  u32 actions,
 					  struct netlink_ext_ack *extack)
 {
 	const struct tc_action *a;
@@ -2190,7 +2191,11 @@ static bool modify_header_match_supported(struct mlx5_flow_spec *spec,
 	u16 ethertype;
 	int nkeys, i;
 
-	headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, outer_headers);
+	if (actions & MLX5_FLOW_CONTEXT_ACTION_DECAP)
+		headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, inner_headers);
+	else
+		headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, outer_headers);
+
 	ethertype = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ethertype);
 
 	/* for non-IP we only re-write MACs, so we're okay */
@@ -2247,7 +2252,7 @@ static bool actions_match_supported(struct mlx5e_priv *priv,
 
 	if (actions & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
 		return modify_header_match_supported(&parse_attr->spec, exts,
-						     extack);
+						     actions, extack);
 
 	return true;
 }
-- 
2.20.1


^ permalink raw reply related

* [pull request][net 0/3] Mellanox, mlx5 fixes 2019-02-05
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Hi Dave,

This series introduces some fixes to mlx5 driver.

Please pull and let me know if there is any problem.

For -stable v4.19
('net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance')

For -stable v4.20
('net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows')

Thanks,
Saeed.

---
The following changes since commit f09bef61f1ed72869b231e5cff16e73a06505cfb:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf (2019-02-05 11:23:23 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-02-05

for you to fetch changes up to 1651925d403e077e3fc86f961905e27c6810e132:

  net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows (2019-02-05 12:10:19 -0800)

----------------------------------------------------------------
mlx5-fixes-2019-02-05

----------------------------------------------------------------
Guy Shattah (1):
      net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows

Or Gerlitz (1):
      net/mlx5e: Properly set steering match levels for offloaded TC decap rules

Raed Salem (1):
      net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance

 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.c    |  4 +++-
 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.h    |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    | 27 ++++++++++++++--------
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  6 +++++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  1 +
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 17 +++++++-------
 6 files changed, 37 insertions(+), 20 deletions(-)

^ permalink raw reply

* Re: [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-6-sdf@google.com>

On Tue, Feb 5, 2019 at 12:58 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Now that we have __flow_bpf_dissect which works on raw data (by
> constructing temporary on-stack skb), use it when doing
> BPF_PROG_TEST_RUN for flow dissector.
>
> This should help us catch any possible bugs due to missing shinfo on
> the on-stack skb.
>
> Note that existing __skb_flow_bpf_dissect swallows L2 headers and returns
> nhoff=0, we need to preserve the existing behavior.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  net/bpf/test_run.c | 52 +++++++++++++++-------------------------------
>  1 file changed, 17 insertions(+), 35 deletions(-)
>

>         ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
>                               retval, duration);
> -
> -       kfree_skb(skb);
> -       kfree(sk);
> +       kfree(data);
>         return ret;
> +

nit: unnecessary whitespace line

>  }

^ permalink raw reply

* Re: [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-5-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> When flow_dissector is called without skb (with only data and hlen),
> construct on-stack skb (which has a linear chunk of data passed
> to the flow dissector). This should let us handle eth_get_headlen
> case where only data is provided and we don't want to (yet) allocate
> an skb.
>
> Since this on-stack skb doesn't allocate its own data, we can't
> add shinfo and need to be careful to avoid any code paths that use
> it. Flow dissector BPF programs can only call bpf_skb_load_bytes helper,
> which doesn't touch shinfo in our case (skb->len is the length of the
> linear header so it exits early).
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/skbuff.h    |  5 +++
>  net/core/flow_dissector.c | 95 +++++++++++++++++++++++++++++----------
>  2 files changed, 76 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index aa9a9983de80..5f1c085cb34c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1227,6 +1227,11 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
>                             const struct sk_buff *skb,
>                             struct flow_dissector *flow_dissector,
>                             struct bpf_flow_keys *flow_keys);
> +bool __flow_bpf_dissect(struct bpf_prog *prog,
> +                       void *data, __be16 proto,
> +                       int nhoff, int hlen,
> +                       struct flow_dissector *flow_dissector,
> +                       struct bpf_flow_keys *flow_keys);

nit: please use more descriptive name. Perhaps bpf_flow_dissect_raw
and rename __skb_flow_bpf_dissect to bpf_flow_dissect_skb.

> +bool __flow_bpf_dissect(struct bpf_prog *prog,
> +                       void *data, __be16 proto,
> +                       int nhoff, int hlen,
> +                       struct flow_dissector *flow_dissector,
> +                       struct bpf_flow_keys *flow_keys)
> +{
> +       struct bpf_skb_data_end *cb;
> +       struct sk_buff skb;
> +       u32 result;
> +
> +       __init_skb(&skb, data, hlen);
> +       skb_put(&skb, hlen);
> +       skb.protocol = proto;
> +
> +       init_flow_keys(flow_keys, &skb, nhoff);
> +
> +       cb = (struct bpf_skb_data_end *)skb.cb;
> +       cb->data_meta = skb.data;
> +       cb->data_end  = skb.data + skb_headlen(&skb);
> +
> +       result = BPF_PROG_RUN(prog, &skb);
> +
> +       clamp_flow_keys(flow_keys, hlen);
>
>         return result == BPF_OK;
>  }

Can__flow_bpf_dissect just construct an skb and then call
__skb_flow_bpf_dissect?

It will unnecessarily save and restore the control block, but that is
a relatively small cost (compared to, say, zeroing the entire skb).

^ permalink raw reply

* Re: [RFC bpf-next 2/7] net: introduce skb_net helper
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-3-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> skb_net returns network namespace from the associated device or socket.
>
> This will be used in the next commit.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/skbuff.h |  2 ++
>  net/core/skbuff.c      | 10 ++++++++++
>  2 files changed, 12 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index ad883ab2762c..28723a86efdf 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4343,5 +4343,7 @@ static inline __wsum lco_csum(struct sk_buff *skb)
>         return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
>  }
>
> +struct net *skb_net(const struct sk_buff *skb);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_SKBUFF_H */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 23c9cf100bd4..016db13fa2b6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5585,6 +5585,16 @@ void skb_condense(struct sk_buff *skb)
>         skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
>  }
>
> +struct net *skb_net(const struct sk_buff *skb)
> +{
> +       if (skb->dev)
> +               return dev_net(skb->dev);
> +       else if (skb->sk)
> +               return sock_net(skb->sk);
> +       return NULL;
> +}
> +EXPORT_SYMBOL(skb_net);

If this needs a helper it is probably better static inline in the header.

^ permalink raw reply

* Re: [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-4-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> This new argument will be used in the next patches for the
> eth_get_headlen use case. eth_get_headlen calls flow dissector
> with only data (without skb) so there is currently no way to
> pull attached BPF flow dissector program. With this new argument,
> we can amend the callers to explicitly pass network namespace
> so we can use attached BPF program.
>
> Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

>  /**
>   * __skb_flow_dissect - extract the flow_keys struct and return it
> + * @net: associated network namespace, if NULL pulled from skb
>   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
>   * @flow_dissector: list of keys to dissect
>   * @target_container: target structure to put dissected values into
> @@ -739,7 +740,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
>   *
>   * Caller must take care of zeroing target container memory.
>   */
> -bool __skb_flow_dissect(const struct sk_buff *skb,
> +bool __skb_flow_dissect(struct net *net,
> +                       const struct sk_buff *skb,
>                         struct flow_dissector *flow_dissector,
>                         void *target_container,
>                         void *data, __be16 proto, int nhoff, int hlen,
> @@ -799,12 +801,11 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>
>                 rcu_read_lock();
>
> -               if (skb->dev)
> -                       attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
> -               else if (skb->sk)
> -                       attached = rcu_dereference(sock_net(skb->sk)->flow_dissector_prog);
> -               else
> -                       WARN_ON_ONCE(1);
> +               if (!net && skb)
> +                       net = skb_net(skb);
> +               if (net)
> +                       attached = rcu_dereference(net->flow_dissector_prog);
> +               WARN_ON_ONCE(!net);

Instead of this just call skb_net(skb) in all callers of
__skb_flow_dissect that are called with an skb argument directly?

It may have to be able to handle skb == NULL args.

^ permalink raw reply

* Re: [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers
From: Willem de Bruijn @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-2-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> __init_skb is essentially a version of __build_skb which accepts skb as
> an argument (instead of doing kmem_cache_alloc to allocate it).
>
> __init_skb_shinfo initializes shinfo.
>
> No functional changes.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Nice code deduplication :-)

^ permalink raw reply

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
From: Willem de Bruijn @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-1-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> skb. Because we use passed skb to lookup associated networking namespace
> to find whether we have a BPF program attached or not, we always use
> C-based flow dissector in this case.
>
> The goal of this patch series is to add new networking namespace argument
> to the eth_get_headlen and make BPF flow dissector programs be able to
> work in the skb-less case.
>
> The series goes like this:
> 1. introduce __init_skb and __init_skb_shinfo; those will be used to
>    initialize temporary skb
> 2. introduce skb_net which can be used to get networking namespace
>    associated with an skb
> 3. add new optional network namespace argument to __skb_flow_dissect and
>    plumb through the callers
> 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
>    (using __init_skb) and calls BPF flow dissector program

The main concern I see with this series is this cost of skb zeroing
for every packet in the device driver receive routine, *independent*
from the real skb allocation and zeroing which will likely happen
later.

^ permalink raw reply

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Joe Perches @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, thierry.reding
  Cc: hkallweit1, andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <ca6ca816-af03-e1f9-739a-a06cc4837b80@gmail.com>

On Tue, 2019-02-05 at 12:04 -0800, Eric Dumazet wrote:
> 
> On 02/05/2019 10:42 AM, Joe Perches wrote:
> > It's declared after a pointer so it is already is 2 byte aligned.
> > 
> > A lot of drivers wouldn't work otherwise.
> 
> Maybe these drivers are only used on arches where this does not matter.

Possible.

I had only grepped through the sources looking for
declarations using:

$ git grep -B1 '\[ETH_ALEN\];' -- '*.c' | grep -A1 '\*'

It's quite a few files in net/ too btw.

I still think adding __align(<even#>) is unnecessary here unless
it follows something like a bool or a u8.



^ permalink raw reply

* Re: Kernel panic in eth_header
From: Eric Dumazet @ 2019-02-05 20:13 UTC (permalink / raw)
  To: Florian Fainelli, Eric Dumazet, Andrew, Netdev
In-Reply-To: <3ed6d5b9-f2ef-8bd1-f7b4-c4e1d8a540fd@gmail.com>



On 02/05/2019 11:34 AM, Florian Fainelli wrote:
> On 2/5/19 8:57 AM, Eric Dumazet wrote:
>>
>>
>> On 02/05/2019 08:29 AM, Andrew wrote:
>>> Hi all.
>>>
>>> After upgrade on PPPoE BRAS to kernel 4.9.153 I've got an kernel panic after a 3 days of uptime.
>>>
>>> Unfortunately kernel is compiled w/o debug info; I rebuilt kernel with debug info enabled (kernel is compiled with same function addresses - I compare vmlinux symbol maps) - it says that panic is in net/ethernet/eth.c:88
>>>
>>> Below there is a kernel panic trace. igb is from vendor, ver. 5.3.5.4. What extra info is needed?
>>>
>>> [263565.106441] BUG: unable to handle kernel paging request at ffff88015a4d2dd4
>>> [263565.113527] IP: [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.119030] PGD 1e8f067 [263565.121474] PUD 0
>>> [263565.123580]
>>> [263565.125166] Oops: 0002 [#1] SMP
>>> [263565.128398] Modules linked in: xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_length xt_TCPMSS xt_tcpudp xt_mark xt_dscp iptable_mangle ip_tables x_tables nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack sch_sfq sch_htb cls_u32 sch_ingress sch_prio sch_tbf cls_flow cls_fw act_police ifb 8021q mrp garp stp llc softdog pppoe pppox ppp_generic slhc i2c_nforce2 i2c_core igb(O) parport_pc dca parport thermal asus_atk0110 fan ptp k10temp hwmon pps_core nv_tco
>>> [263565.176083] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.9.153-x86_64 #1
>>> [263565.183996] Hardware name: System manufacturer System Product Name/M2N-E, BIOS ASUS M2N-E ACPI BIOS Revision 5001 03/23/2010
>>> [263565.195289] task: ffff88007d0f5200 task.stack: ffffc9000006c000
>>> [263565.201295] RIP: 0010:[<ffffffff8158e48b>] [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.209225] RSP: 0018:ffff88007fa83c58  EFLAGS: 00010286
>>> [263565.214622] RAX: ffff88015a4d2dc8 RBX: 0000000000000008 RCX: ffff8800682434a0
>>> [263565.221843] RDX: ffff88015a4d2dc8 RSI: ffff88015a4d2dc8 RDI: ffff880077aab000
>>> [263565.229062] RBP: ffff88007b663d90 R08: ffff88007b663d90 R09: 0000000000000574
>>> [263565.236281] R10: ffff88007d1fa000 R11: 0000000000000000 R12: ffff8800682434a0
>>> [263565.243501] R13: ffff88007d1fa000 R14: 0000000000000574 R15: 0000000000000008
>>> [263565.250719] FS:  0000000000000000(0000) GS:ffff88007fa80000(0000) knlGS:0000000000000000
>>> [263565.258894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [263565.264725] CR2: ffff88015a4d2dd4 CR3: 000000007ad73000 CR4: 00000000000006f0
>>> [263565.271944] Stack:
>>> [263565.274041]  ffff880077aab000 ffff880068243400 ffff88007a745000 ffff8800682434a0
>>> [263565.281582]  0000000000000002 ffffffff81571d09 ffff880068243400 ffff88007fa83d00
>>> [263565.289121]  ffff88007a745000 ffff880077aab000 ffff88007a712000 ffffffff815a8c61
>>> [263565.296661] Call Trace:
>>> [263565.299193]  <IRQ> [263565.301205] [<ffffffff81571d09>] ? neigh_connected_output+0xa9/0x100
>>> [263565.307740]  [<ffffffff815a8c61>] ? ip_finish_output2+0x221/0x400
>>> [263565.313920]  [<ffffffff8159e144>] ? nf_iterate+0x54/0x60
>>> [263565.319319]  [<ffffffff815ab2fa>] ? ip_output+0x6a/0xf0
>>> [263565.324631]  [<ffffffff8159e102>] ? nf_iterate+0x12/0x60
>>> [263565.330030]  [<ffffffff815aa6e0>] ? ip_fragment.constprop.5+0x80/0x80
>>> [263565.336556]  [<ffffffff815a73b6>] ? ip_forward+0x396/0x480
>>> [263565.342128]  [<ffffffff815a6fb0>] ? ip_check_defrag+0x1e0/0x1e0
>>> [263565.348134]  [<ffffffff815a5a2e>] ? ip_rcv+0x2ae/0x370
>>> [263565.353361]  [<ffffffffa0107c02>] ? pppoe_rcv_core+0xd2/0x160 [pppoe]
>>> [263565.359888]  [<ffffffff815a5170>] ? ip_local_deliver_finish+0x1d0/0x1d0
>>> [263565.366586]  [<ffffffff81562a57>] ? __netif_receive_skb_core+0x527/0xa80
>>> [263565.373373]  [<ffffffff81567632>] ? process_backlog+0x92/0x130
>>> [263565.379291]  [<ffffffff8156745d>] ? net_rx_action+0x24d/0x390
>>> [263565.385124]  [<ffffffff81628374>] ? __do_softirq+0xf4/0x2a0
>>> [263565.390784]  [<ffffffff8107136c>] ? irq_exit+0xbc/0xd0
>>> [263565.396008]  [<ffffffff81626cd6>] ? call_function_single_interrupt+0x96/0xa0
>>> [263565.403141]  <EOI> [263565.405153] [<ffffffff81623eb0>] ? __sched_text_end+0x2/0x2
>>> [263565.410907]  [<ffffffff81624182>] ? native_safe_halt+0x2/0x10
>>> [263565.416741]  [<ffffffff81623ec8>] ? default_idle+0x18/0xd0
>>> [263565.422314]  [<ffffffff810a7a46>] ? cpu_startup_entry+0x126/0x220
>>> [263565.428492]  [<ffffffff8104c261>] ? start_secondary+0x161/0x180
>>> [263565.434496] Code: 0e 00 00 00 53 89 d3 49 89 cc 4c 89 c5 45 89 ce e8 bb 8a fc ff 66 83 fb 01 48 89 c6 74 44 66 83 fb 04 74 3e 66 c1 c3 08 48 85 ed <66> 89 58 0c 74 40 8b 45 00 4d 85 e4 89 46 06 0f b7 45 04 66 89
>>> [263565.454534] RIP  [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.460124]  RSP <ffff88007fa83c58>
>>> [263565.463696] CR2: ffff88015a4d2dd4
>>> [263565.467104] ---[ end trace a1bcaf3618724adf ]---
>>> [263565.471807] Kernel panic - not syncing: Fatal exception in interrupt
>>> [263565.478245] Kernel Offset: disabled
>>> [263565.481818] Rebooting in 5 seconds..
>>>
>>
>>
>> This is a well known issue, a fix should come shortly in stable branches
> 
> Is Peter or yourself doing the backport? David would only take care of
> the most two recent stable kernels.
> 
> Sorry about missing that change as part of the fragmenstack backport to
> 4.9...


Greg took care of this for the trees he manages.




^ permalink raw reply

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Eric Dumazet @ 2019-02-05 20:04 UTC (permalink / raw)
  To: Joe Perches, David Miller, thierry.reding
  Cc: hkallweit1, andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <8553086eaec167846992f1cff12aa388cee81767.camel@perches.com>



On 02/05/2019 10:42 AM, Joe Perches wrote:
> 
> It's declared after a pointer so it is already is 2 byte aligned.
> 
> A lot of drivers wouldn't work otherwise.

Maybe these drivers are only used on arches where this does not matter.



^ permalink raw reply

* RE: lan78xx: WARNING: irq 79 handler enabled interrupts
From: Stefan Wahren @ 2019-02-05 19:57 UTC (permalink / raw)
  To: netdev; +Cc: eric, UNGLinuxDriver, marc.zyngier, linux-arm-kernel, Woojung.Huh
In-Reply-To: <958035754.310420.1546378307673@email.ionos.de>

Hi,

> Stefan Wahren <stefan.wahren@i2se.com> hat am 1. Januar 2019 um 22:31 geschrieben:
> 
> 
> Hi Woojung,
> 
> > Woojung.Huh@microchip.com hat am 30. Dezember 2018 um 04:25 geschrieben:
> > 
> > 
> > HI Marc & Stephen,
> > 
> > Most of engineers are out until New Year's Day.
> 
> thanks. I didn't expect a reply that fast.
> 
> >  
> > LAN78xx driver uses irq_domain for phy interrupt, but smsc95xx uses polling.
> > Need to check flow again, you can try that comment out "lan78xx_setup_irq_domain" to
> > make dev->domain_data.phyirq = 0 which forces PHY polling.
> 
> I tested your suggestion with multi_v7_defconfig (32 bit) and arm64/defconfig.
> The warning disappeared and Ethernet is still working.
> 
> Only the old issue that we can't receive until a first packet has been send out reappear. But this should be manageable.
> 

i got informed that the engineers are busy with other issues and come back later to this :-(

Since i'm getting requests to provide my PHY polling patch, here it is:

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index e96bc0c..a5bb292 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2982,13 +2982,6 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf)
 
 	dev->net->hw_features = dev->net->features;
 
-	ret = lan78xx_setup_irq_domain(dev);
-	if (ret < 0) {
-		netdev_warn(dev->net,
-			    "lan78xx_setup_irq_domain() failed : %d", ret);
-		goto out1;
-	}
-
 	dev->net->hard_header_len += TX_OVERHEAD;
 	dev->hard_mtu = dev->net->mtu + dev->net->hard_header_len;
 
@@ -2996,13 +2989,13 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf)
 	ret = lan78xx_reset(dev);
 	if (ret) {
 		netdev_warn(dev->net, "Registers INIT FAILED....");
-		goto out2;
+		goto out1;
 	}
 
 	ret = lan78xx_mdio_init(dev);
 	if (ret) {
 		netdev_warn(dev->net, "MDIO INIT FAILED.....");
-		goto out2;
+		goto out1;
 	}
 
 	dev->net->flags |= IFF_MULTICAST;
@@ -3011,9 +3004,6 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf)
 
 	return ret;
 
-out2:
-	lan78xx_remove_irq_domain(dev);
-
 out1:
 	netdev_warn(dev->net, "Bind routine FAILED");
 	cancel_work_sync(&pdata->set_multicast);

^ permalink raw reply related

* Re: [PATCH net] net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: Maciej W. Rozycki @ 2019-02-05 19:57 UTC (permalink / raw)
  To: Yang Wei; +Cc: netdev, David S. Miller, yang.wei9
In-Reply-To: <1549382464-5138-1-git-send-email-albin_yang@163.com>

On Wed, 6 Feb 2019, Yang Wei wrote:

> From: Yang Wei <yang.wei9@zte.com.cn>
> 
> dev_consume_skb_irq() should be called in dfx_xmt_done() when skb
> xmit done. It makes drop profiles(dropwatch, perf) more friendly.
> 
> Signed-off-by: Yang Wei <yang.wei9@zte.com.cn>

Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>

 It looks to me the driver has to be reviewed WRT `dev_kfree_skb' vs 
`kfree_skb' use too.  I'll have a look into it unless you are happy to do 
that.

 Thanks for your contribution!

  Maciej

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox