Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions
From: Eric Dumazet @ 2018-04-18 17:06 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji
In-Reply-To: <20180418003327.19992-17-dsahern@gmail.com>



On 04/17/2018 05:33 PM, David Ahern wrote:
> Most FIB entries can be added using memory allocated with GFP_KERNEL.
> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
> can be reached from the packet path (e.g., ndisc and autoconfig) or
> atomic notifiers use GFP_ATOMIC; paths from user context (adding
> addresses and routes) use GFP_KERNEL.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---


Hmmm....

ipv6_ifa_notify() calls __ipv6_ifa_notify() under rcu_read_lock_bh()/rcu_read_unlock_bh()

So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.

^ permalink raw reply

* RE: SRIOV switchdev mode BoF minutes
From: Parikh, Neerav @ 2018-04-18 17:07 UTC (permalink / raw)
  To: Andy Gospodarek, Jakub Kicinski
  Cc: Or Gerlitz, Samudrala, Sridhar, David Miller, Singhai, Anjali,
	Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed,
	Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180418151529.GL33938@C02RW35GFVH8.dhcp.broadcom.net>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Andy Gospodarek
> Sent: Wednesday, April 18, 2018 8:15 AM
> To: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: Andy Gospodarek <andrew.gospodarek@broadcom.com>; Or Gerlitz
> <gerlitz.or@gmail.com>; Samudrala, Sridhar <sridhar.samudrala@intel.com>;
> David Miller <davem@davemloft.net>; Singhai, Anjali
> <anjali.singhai@intel.com>; Michael Chan <michael.chan@broadcom.com>;
> Simon Horman <simon.horman@netronome.com>; John Fastabend
> <john.fastabend@gmail.com>; Saeed Mahameed <saeedm@mellanox.com>;
> Jiri Pirko <jiri@mellanox.com>; Rony Efraim <ronye@mellanox.com>; Linux
> Netdev List <netdev@vger.kernel.org>
> Subject: Re: SRIOV switchdev mode BoF minutes
> 
> On Tue, Apr 17, 2018 at 04:19:15PM -0700, Jakub Kicinski wrote:
> > On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote:
> > > There is also a school of thought that the VF reps could be
> > > pre-allocated on the SmartNIC so that any application processing that
> > > traffic would sit idle when no traffic arrives on the rep, but could
> > > process frames that do arrive when the VFs were created on the host.
> > > This implementation will depend on how resources are allocated on a
> > > given bit of hardware, but can really work well.
> >
> > +1 if there is no FW resource allocation issues IMHO it's okay to
> > just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/
> > controller.  The reprs should just show link down as if PCIe cable
> > was unpluged until host actually enables them.
> 
> Yes we are on the same page on this.
> 
> > A similar issue exists on multi-host for PFs, right?  If one of the
> > hosts is down do we still show their PF repr?  IMHO yes.
> 
> I would agree with that as well.  With today's model the VF reps are
> created once a PF is put into switchdev mode, but I'm still working out
> how we want to consider whether or not a PF rep for the other domains is
> created locally or not and also how one can determine which domain is in
> control.
> 
> Permanent config options (like NVRAM settings) could easily handle which
> domain is in control, but that still does not mean that PF reps must be
> created automatically, does it?
> 
> > That makes the thing looks more like a switch with cables being plugged
> > in and out.
> 
> Yes, that's exactly how I view it as well.

If we need to behave like a switch or emulate that mode then is there a 
thought around the usability model?

So, while whichever domain is in control the implication above is that the
max number of vports supported (VFs and PFs) will need to be represented
regardless of whether they're "Enabled" or not in a given "Switch".
By "Enabled" I mean SRIOV VFs may not have been enabled but still the
representor exists.
For example if there are 256 VFs supported on a given PF when someone
switches into the switchdev mode there will be ~257 representor netdevs
added into the system. And if you've multi-port, multi-function devices
the number of representor netdevs will increase accordingly.
While representor netdevs' naming may help a bit here but a user will 
need to determine and differentiate between the sprawl of representors
netdevs and data netdevs to identify which all can be added into an 
OVS bridge (or vSwitch).
And switching to the "switchdev mode" becomes a pre-requisite before
any of the vSwitch bridges that uses these representor netdevs.

While, in a SmartNIC where users are not managing the devices this may
be deployed based on the NIC FW/SW capabilities. 
But, I'm not sure how the same model applied on a standard host running 
Linux will work across devices.

^ permalink raw reply

* Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions
From: David Ahern @ 2018-04-18 17:10 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: davem, idosch, roopa, weiwan, kafai, yoshfuji
In-Reply-To: <7dd4a7ae-f940-70fd-fab8-3376d4662f70@gmail.com>

On 4/18/18 11:06 AM, Eric Dumazet wrote:
> 
> 
> On 04/17/2018 05:33 PM, David Ahern wrote:
>> Most FIB entries can be added using memory allocated with GFP_KERNEL.
>> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
>> can be reached from the packet path (e.g., ndisc and autoconfig) or
>> atomic notifiers use GFP_ATOMIC; paths from user context (adding
>> addresses and routes) use GFP_KERNEL.
>>
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>> ---
> 
> 
> Hmmm....
> 
> ipv6_ifa_notify() calls __ipv6_ifa_notify() under rcu_read_lock_bh()/rcu_read_unlock_bh()
> 
> So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.
> 
> 

Thanks for catching that. Will add fix to my followup set.

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Paolo Abeni @ 2018-04-18 17:15 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: David S. Miller
In-Reply-To: <890db004-4dfe-7f77-61ee-1ac0d7d2a24c@gmail.com>

Hi,

On Wed, 2018-04-18 at 09:56 -0700, Eric Dumazet wrote:
> 
> On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> > This changeset extends the idea behind commit c8c8b127091b ("udp:
> > under rx pressure, try to condense skbs"), trading more BH cpu
> > time and memory bandwidth to decrease the load on the user space
> > receiver.
> > 
> > At boot time we allocate a limited amount of skbs with small
> > data buffer, storing them in per cpu arrays. Such skbs are never
> > freed.
> > 
> > At run time, under rx pressure, the BH tries to copy the current
> > skb contents into the cache - if the current cache skb is available,
> > and the ingress skb is small enough and without any head states.
> > 
> > When using the cache skb, the ingress skb is dropped by the BH
> > - while still hot on cache - and the cache skb is inserted into
> > the rx queue, after increasing its usage count. Also, the cache
> > array index is moved to the next entry.
> > 
> > The receive side is unmodified: in udp_rcvmsg() the usage skb
> > usage count is decreased and the skb is _not_ freed - since the
> > cache keeps usage > 0. Since skb->usage is hot in the cache of the
> > receiver at consume time - the receiver has just read skb->data,
> > which lies in the same cacheline - the whole skb_consume_udp() becomes
> > really cheap.
> > 
> > UDP receive performances under flood improve as follow:
> > 
> > NR RX queues	Kpps	Kpps	Delta (%)
> > 		Before	After
> > 
> > 1		2252	2305	2
> > 2		2151	2569	19
> > 4		2033	2396	17
> > 8		1969	2329	18
> > 
> > Overall performances of knotd DNS server under real traffic flood
> > improves as follow:
> > 
> > 		Kpps	Kpps	Delta (%)
> > 		Before	After
> > 
> > 		3777	3981	5
> 
> 
> It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
> adding this bloat to the kernel ?
> 
> Sorry, 5% improvement while you easily can get 300% improvement with no kernel change
> is not appealing to me :/

Thank you for the feedback.
Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
the above tests are leveraging it.

That 5% is on top of that 300%.

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH bpf-next v4 03/10] bpf: btf: Check members of struct/union
From: Jakub Kicinski @ 2018-04-18 17:22 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180417204243.4028831-4-kafai@fb.com>

On Tue, 17 Apr 2018 13:42:36 -0700, Martin KaFai Lau wrote:
> This patch checks a few things of struct's members:
> 
> 1) It has a valid size (e.g. a "const void" is invalid)
> 2) A member's size (+ its member's offset) does not exceed
>    the containing struct's size.
> 3) The member's offset satisfies the alignment requirement

Could we also introduce a requirement for members to have different
names?  Maybe it's there but I missed it.  Would BTF with duplicated
member names be considered valid?

> The above can only be done after the needs_resolve member's type
> is resolved.  Hence, the above is done together in
> btf_struct_resolve().
> 
> Each possible member's type (e.g. int, enum, modifier...) implements
> the check_member() ops which will be called from btf_struct_resolve().
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Acked-by: Alexei Starovoitov <ast@fb.com>

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 17:24 UTC (permalink / raw)
  To: pabeni; +Cc: willemdebruijn.kernel, netdev, willemb
In-Reply-To: <1524050274.2599.21.camel@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 18 Apr 2018 13:17:54 +0200

> When testing with Spectre/Meltdown mitigation in places, I expect
> that the most relevant part of the gain is due to the single syscall
> per burst.

I was going to say exactly this.

Batching schemes that were borderline beneficial before spectre/meltdown
are more worthwhile now.

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Andy Gospodarek @ 2018-04-18 17:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andy Gospodarek, Or Gerlitz, Samudrala, Sridhar, David Miller,
	Anjali Singhai Jain, Michael Chan, Simon Horman, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180418092634.13488a33@cakuba.netronome.com>

On Wed, Apr 18, 2018 at 09:26:34AM -0700, Jakub Kicinski wrote:
> On Wed, 18 Apr 2018 11:15:29 -0400, Andy Gospodarek wrote:
> > > A similar issue exists on multi-host for PFs, right?  If one of the
> > > hosts is down do we still show their PF repr?  IMHO yes.  
> > 
> > I would agree with that as well.  With today's model the VF reps are
> > created once a PF is put into switchdev mode, but I'm still working out
> > how we want to consider whether or not a PF rep for the other domains is
> > created locally or not and also how one can determine which domain is in
> > control.
> > 
> > Permanent config options (like NVRAM settings) could easily handle which
> > domain is in control, but that still does not mean that PF reps must be
> > created automatically, does it?
> 
> The control domain is tricky.  I'm not sure I understand how you could
> not have a PF rep for remote domains, though.  How do you configure
> switching to the PF netdev if there is no rep?

Yes, for complete control of all traffic using standard Linux APIs a PF
rep is a requirement.

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 17:28 UTC (permalink / raw)
  To: sowmini.varadhan
  Cc: willemdebruijn.kernel, sridhar.samudrala, netdev, willemb
In-Reply-To: <20180418123103.GC19633@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 18 Apr 2018 08:31:03 -0400

> However, I share Sridhar's concerns about the very fundamental change
> to UDP message boundary semantics here.  There is actually no such thing
> as a "segment" in udp, so in general this feature makes me a little
> uneasy.  Well behaved udp applications should already be sending mtu
> sized datagrams. And the not-so-well-behaved ones are probably relying
> on IP fragmentation/reassembly to take care of datagram boundary semantics
> for them?
> 
> As Sridhar points out, the feature is not really "negotiated" - one side
> unilaterally sets the option. If the receiver is a classic/POSIX UDP
> implementation, it will have no way of knowing that message boundaries
> have been re-adjusted at the sender.  

There are no "semantics".

What ends up on the wire is the same before the kernel/app changes as
afterwards.

The only difference is that instead of the application doing N - 1
sendmsg() calls with mtu sized writes, it's giving everything all at
once and asking the kernel to segment.

It even gives the application control over the size of the packets,
which I think is completely prudent in this situation.

^ permalink raw reply

* Re: [PATCH 1/2] net: netsec: enable tx-irq during open callback
From: David Miller @ 2018-04-18 17:30 UTC (permalink / raw)
  To: jassisinghbrar; +Cc: netdev, masahisa.kojima, ard.biesheuvel, jaswinder.singh
In-Reply-To: <CABb+yY3ikHT54JTH0eFDjTEpV=z1aNNrTeagC6dWbRvODCPg2w@mail.gmail.com>

From: Jassi Brar <jassisinghbrar@gmail.com>
Date: Wed, 18 Apr 2018 18:27:59 +0530

> Just to make sure, let me please mention that c009f413b79de52 and
> 9a00b697ce31e are very much needed in stable kernel. Without these we
> couldn't install any OS over network.

Ok, both patches in this series queued up for -stable.

^ permalink raw reply

* Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions
From: Eric Dumazet @ 2018-04-18 17:30 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet, netdev
  Cc: davem, idosch, roopa, weiwan, kafai, yoshfuji
In-Reply-To: <77bf271b-7a66-8b53-ea6f-be86bb65590c@gmail.com>



On 04/18/2018 10:10 AM, David Ahern wrote:
> On 4/18/18 11:06 AM, Eric Dumazet wrote:
>>
>>
>> On 04/17/2018 05:33 PM, David Ahern wrote:
>>> Most FIB entries can be added using memory allocated with GFP_KERNEL.
>>> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
>>> can be reached from the packet path (e.g., ndisc and autoconfig) or
>>> atomic notifiers use GFP_ATOMIC; paths from user context (adding
>>> addresses and routes) use GFP_KERNEL.
>>>
>>> Signed-off-by: David Ahern <dsahern@gmail.com>
>>> ---
>>
>>
>> Hmmm....
>>
>> ipv6_ifa_notify() calls __ipv6_ifa_notify() under rcu_read_lock_bh()/rcu_read_unlock_bh()
>>
>> So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.
>>
>>
> 
> Thanks for catching that. Will add fix to my followup set.
> 
BTW, I am not sure why we use rcu_read_lock_bh()/rcu_read_unlock_bh() there :/

Maybe it is no longer needed.

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 17:34 UTC (permalink / raw)
  To: sowmini.varadhan
  Cc: eric.dumazet, willemdebruijn.kernel, sridhar.samudrala, netdev,
	willemb
In-Reply-To: <20180418134706.GD19633@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 18 Apr 2018 09:47:06 -0400

> - in the "GSO" proposal my 2000  bytes of data are sent as *two*
>   udp packets, each of them with a unique udp header, and uh_len set
>   to 1476 (for first) and 526 (for second). The receiver has no clue
>   that they are both part of the same UDP datagram, So wire format
>   is not the same, am I mistaken?

The sender is asking for two packets and this exact packetization,
that's why they explicitly set the socket option or provide the
cmsg value.

As Eric said, this is just exchanging N sendmsg() calls with 1.

^ permalink raw reply

* Re: [PATCH v3] net: davicom: dm9000: Avoid spinlock recursion during dm9000_timeout routine
From: David Miller @ 2018-04-18 17:35 UTC (permalink / raw)
  To: liuxiang_1999; +Cc: liu.xiang6, netdev, linux-kernel
In-Reply-To: <72770aae.bb31.162d9036874.Coremail.liuxiang_1999@126.com>

From: liuxiang  <liuxiang_1999@126.com>
Date: Wed, 18 Apr 2018 21:48:22 +0800 (CST)

> Because the timeout task gets the main spinlock and disable the
> current cpu's irq, there is no other task on the same cpu can run,
> and tasks on the other cpus can not enter the dm9000_timeout()
> again. So in the whole dm9000_timeout() routine, db->timeout_cpu can
> not be changed by other tasks. Although smp_processor_id() may
> change after preempt_enable(), these tasks always get the false
> result when call dm9000_current_in_timeout.  Only the timeout task
> get the true result. And if there is no timeout, all the tasks that
> want to do asynchronous phy operation get the false result. So I
> think this can avoid racy.

This is a very shaky foundation upon which to base the correctness
of your driver's synchronization in my opinion.

^ permalink raw reply

* Re: [PATCH bpf-next v3 04/11] bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
From: Michael Chan @ 2018-04-18 17:39 UTC (permalink / raw)
  To: Nikita V. Shirokov; +Cc: Alexei Starovoitov, Daniel Borkmann, Netdev
In-Reply-To: <20180418044223.17685-5-tehnerd@tehnerd.com>

On Tue, Apr 17, 2018 at 9:42 PM, Nikita V. Shirokov <tehnerd@tehnerd.com> wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for bnxt driver we will just calculate packet's length unconditionally
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

Acked-by: Michael Chan <michael.chan@broadcom.com>

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 17:40 UTC (permalink / raw)
  To: willemdebruijn.kernel
  Cc: sowmini.varadhan, eric.dumazet, sridhar.samudrala, netdev,
	willemb
In-Reply-To: <CAF=yD-LkTpVQ_8F2oTLeEDKcNWYuvn3QoSrSa0y5j7zgB6Em9A@mail.gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 18 Apr 2018 09:51:50 -0400

> Eric is correct. If the application sets a segment size with UDP_SEGMENT
> this is an instruction to the kernel to split the payload along that border into
> separate discrete datagrams.
> 
> It does not matter what the behavior is without setting this option. If a
> process wants to send a larger than MTU datagram and rely on the
> kernel to fragment, then it should not set the option.

+1

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: David Miller @ 2018-04-18 17:45 UTC (permalink / raw)
  To: eric.dumazet
  Cc: mpatocka, edumazet, joby.poriyath, bhutchings, netdev,
	linux-kernel
In-Reply-To: <3e65977e-53cd-bf09-bc4b-0ce40e9091fe@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Apr 2018 09:05:54 -0700

> Each virtio_net should probably allocate the exact amount of
> DMA-memory it wants, instead of expecting core networking stack to
> have a huge chunk of DMA-memory for everything.

Yes, if you need DMA'able memory, allocate DMA'able memory separately
and hang it off of the netdev instead of assuming you can just DMA
in/out of it.

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: David Miller @ 2018-04-18 17:46 UTC (permalink / raw)
  To: mpatocka
  Cc: eric.dumazet, edumazet, joby.poriyath, bhutchings, netdev,
	linux-kernel, mst, jasowang, virtualization
In-Reply-To: <alpine.LRH.2.02.1804181218270.19136@file01.intranet.prod.int.rdu2.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)

> The structure net_device is followed by arbitrary driver-specific data 
> (accessible with the function netdev_priv). And for virtio-net, these 
> driver-specific data must be in DMA memory.

And we are saying that this assumption is wrong and needs to be
corrected.

^ permalink raw reply

* Re: [PATCH bpf-next v4 07/10] bpf: btf: Add pretty print support to the basic arraymap
From: Martin KaFai Lau @ 2018-04-18 17:46 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, Alexei Starovoitov, kernel-team, jakub.kicinski
In-Reply-To: <ecbcec5b-6e74-333a-de4d-3114aec27833@iogearbox.net>

On Wed, Apr 18, 2018 at 05:20:11PM +0200, Daniel Borkmann wrote:
> Hi Martin,
> 
> first of all great work on the set! One issue that puzzled me
> while digesting it further below.
> 
> On 04/17/2018 10:42 PM, Martin KaFai Lau wrote:
> > This patch adds pretty print support to the basic arraymap.
> > Support for other bpf maps can be added later.
> > 
> > This patch adds new attrs to the BPF_MAP_CREATE command to allow
> > specifying the btf_fd, btf_key_id and btf_value_id.  The
> > BPF_MAP_CREATE can then associate the btf to the map if
> > the creating map supports BTF.
> 
> Feels like this patch is doing two things at once, meaning i)
> attaching btf object to map object through bpf syscall at map
> creation time, and ...
> 
> > A BTF supported map needs to implement two new map ops,
> > map_seq_show_elem() and map_check_btf().  This patch has
> > implemented these new map ops for the basic arraymap.
> > 
> > It also adds file_operations to the pinned map
> > such that the pinned map can be opened and read.
> 
> ... ii) pretty print map dump via bpf fs for array map.
> 
> > Here is a sample output when reading a pinned arraymap
> > with the following map's value:
> > 
> > struct map_value {
> > 	int count_a;
> > 	int count_b;
> > };
> > 
> > cat /sys/fs/bpf/pinned_array_map:
> > 
> > 0: {1,2}
> > 1: {3,4}
> > 2: {5,6}
> > ...
> 
> Rather than adding this to the bpf fs itself, why not add full BTF
> support for the main debugging and introspection tool we have and
> ship with the kernel for BPF, namely bpftool? You can already dump
> the whole map's key/value pairs via the following command from a
> pinned file:
> 
>   bpftool map dump pinned /sys/fs/bpf/pinned_array_map
> 
> And given we already export the BTF info in your earlier patch through
> the BPF_OBJ_GET_INFO_BY_FD, this would fit perfectly for bpftool
> integration instead where the pretty-print which is done through the
> extra cb map_seq_show_elem (which only does a map lookup and print
> anyway) in this patch can simply all be done in user space w/o any
> additional kernel complexity.
> 
> Aside that this would be very valuable there it would also nicely
> demonstrate usage of it, but more importantly we could avoid implementing
> such pretty-print callback in the kernel for every other map type and
> then having two locations where a user now needs to go for debugging
> (bpftool being one, and cat of pinned file the other; this split seems
> confusing from a user perspective, imho, but also single key lookup +
> pretty-print cannot be realized with the latter whereas it's trivial
> with bpftool).
> 
> The same could be done for bpftool map lookup, updates, deletions, etc
> where the key resp. key/value pair can be specified through a struct
> like initializer from cmdline. (But dump/lookup should be good enough
> starting point initially.) Thoughts?
Right, I have plan on the kernel side to allow that in bpftool.
There are a couple steps to be done first to allow getting the
BTF from a map-fd.

My idea is to have both (bpf fs, and userspace tool like bpftool).
If the user pinned it, it is more intuitive for cat/bash like
user to be able to cat it directly.  This 'cat bpffs/pathto/map'
would eventually be available to all map types with or without
map_lookup_elem support, as long as the map-type makes sense
for pretty print.

Going forward, bpf tool can do a more demanding jobs you mentioned.

> 
> Thanks again,
> Daniel
> 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > Acked-by: Alexei Starovoitov <ast@fb.com>
> > ---
> >  include/linux/bpf.h      |  20 ++++++-
> >  include/uapi/linux/bpf.h |   3 +
> >  kernel/bpf/arraymap.c    |  50 ++++++++++++++++
> >  kernel/bpf/inode.c       | 146 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  kernel/bpf/syscall.c     |  32 ++++++++++-
> >  5 files changed, 244 insertions(+), 7 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: David Miller @ 2018-04-18 17:47 UTC (permalink / raw)
  To: eric.dumazet
  Cc: mst, netdev, linux-kernel, virtualization, edumazet, mpatocka,
	joby.poriyath, bhutchings
In-Reply-To: <5f4e1286-b79f-0b9f-9a30-47d7654f3889@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Apr 2018 09:51:25 -0700

> I suggest that virtio_net clearly identifies which part needs a specific allocation
> and does its itself, instead of abusing the netdev_priv storage.
> 
> Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.

+1

^ permalink raw reply

* Re: [net 1/1] tipc: fix use-after-free in tipc_nametbl_stop
From: David Miller @ 2018-04-18 17:48 UTC (permalink / raw)
  To: jon.maloy
  Cc: netdev, mohan.krishna.ghanta.krishnamurthy, tung.q.nguyen,
	hoang.h.le, canh.d.luu, ying.xue, tipc-discussion
In-Reply-To: <1523993142-7202-1-git-send-email-jon.maloy@ericsson.com>

From: Jon Maloy <jon.maloy@ericsson.com>
Date: Tue, 17 Apr 2018 21:25:42 +0200

> When we delete a service item in tipc_nametbl_stop() we loop over
> all service ranges in the service's RB tree, and for each service
> range we loop over its pertaining publications while calling
> tipc_service_remove_publ() for each of them.
> 
> However, tipc_service_remove_publ() has the side effect that it also
> removes the comprising service range item when there are no publications
> left. This leads to a "use-after-free" access when the inner loop
> continues to the next iteration, since the range item holding the list
> we are looping no longer exists.
> 
> We fix this by moving the delete of the service range item outside
> the said function. Instead, we now let the two functions calling it
> test if the list is empty and perform the removal when that is the
> case.
> 
> Reported-by: syzbot+d64b64afc55660106556@syzkaller.appspotmail.com
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Applied.

^ permalink raw reply

* Re: [net 1/1] tipc: fix infinite loop when dumping link monitor summary
From: David Miller @ 2018-04-18 17:49 UTC (permalink / raw)
  To: jon.maloy
  Cc: netdev, mohan.krishna.ghanta.krishnamurthy, tung.q.nguyen,
	hoang.h.le, canh.d.luu, ying.xue, tipc-discussion
In-Reply-To: <1523995107-7325-1-git-send-email-jon.maloy@ericsson.com>

From: Jon Maloy <jon.maloy@ericsson.com>
Date: Tue, 17 Apr 2018 21:58:27 +0200

> From: Tung Nguyen <tung.q.nguyen@dektech.com.au>
> 
> When configuring the number of used bearers to MAX_BEARER and issuing
> command "tipc link monitor summary", the command enters infinite loop
> in user space.
> 
> This issue happens because function tipc_nl_node_dump_monitor() returns
> the wrong 'prev_bearer' value when all potential monitors have been
> scanned.
> 
> The correct behavior is to always try to scan all monitors until either
> the netlink message is full, in which case we return the bearer identity
> of the affected monitor, or we continue through the whole bearer array
> until we can return MAX_BEARERS. This solution also caters for the case
> where there may be gaps in the bearer array.
> 
> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Applied.

^ permalink raw reply

* [PATCH bpf] tools/bpf: fix test_sock and test_sock_addr.sh failure
From: Yonghong Song @ 2018-04-18 17:49 UTC (permalink / raw)
  To: ast, daniel, rdna, netdev; +Cc: kernel-team

The bpf selftests test_sock and test_sock_addr.sh failed
in my test machine. The failure looks like:
    $ ./test_sock
    Test case: bind4 load with invalid access: src_ip6 .. [PASS]
    Test case: bind4 load with invalid access: mark .. [PASS]
    Test case: bind6 load with invalid access: src_ip4 .. [PASS]
    Test case: sock_create load with invalid access: src_port .. [PASS]
    Test case: sock_create load w/o expected_attach_type (compat mode) .. [FAIL]
    Test case: sock_create load w/ expected_attach_type .. [FAIL]
    Test case: attach type mismatch bind4 vs bind6 .. [FAIL]
    ...
    Summary: 4 PASSED, 12 FAILED
    $ ./test_sock_addr.sh
    Wait for testing IPv4/IPv6 to become available .....
    ERROR: Timeout waiting for test IP to become available.

In test_sock, bpf program loads failed due to hitting memlock limits.
In test_sock_addr.sh, my test machine is a ipv6 only test box and using
"ping" without specifying address family for an ipv6 address does not work.

This patch fixed the issue by including header bpf_rlimit.h in test_sock.c
and test_sock_addr.c, and specifying address family for ping command.

Cc: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/test_sock.c       | 1 +
 tools/testing/selftests/bpf/test_sock_addr.c  | 1 +
 tools/testing/selftests/bpf/test_sock_addr.sh | 4 ++--
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sock.c b/tools/testing/selftests/bpf/test_sock.c
index 73bb20c..f4d99fa 100644
--- a/tools/testing/selftests/bpf/test_sock.c
+++ b/tools/testing/selftests/bpf/test_sock.c
@@ -13,6 +13,7 @@
 #include <bpf/bpf.h>
 
 #include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
 
 #ifndef ARRAY_SIZE
 # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
index d488f20..2950f80 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -15,6 +15,7 @@
 #include <bpf/libbpf.h>
 
 #include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
 
 #define CG_PATH	"/foo"
 #define CONNECT4_PROG_PATH	"./connect4_prog.o"
diff --git a/tools/testing/selftests/bpf/test_sock_addr.sh b/tools/testing/selftests/bpf/test_sock_addr.sh
index c6e1dcf..9832a87 100755
--- a/tools/testing/selftests/bpf/test_sock_addr.sh
+++ b/tools/testing/selftests/bpf/test_sock_addr.sh
@@ -4,7 +4,7 @@ set -eu
 
 ping_once()
 {
-	ping -q -c 1 -W 1 ${1%%/*} >/dev/null 2>&1
+	ping -${1} -q -c 1 -W 1 ${2%%/*} >/dev/null 2>&1
 }
 
 wait_for_ip()
@@ -13,7 +13,7 @@ wait_for_ip()
 	echo -n "Wait for testing IPv4/IPv6 to become available "
 	for _i in $(seq ${MAX_PING_TRIES}); do
 		echo -n "."
-		if ping_once ${TEST_IPv4} && ping_once ${TEST_IPv6}; then
+		if ping_once 4 ${TEST_IPv4} && ping_once 6 ${TEST_IPv6}; then
 			echo " OK"
 			return
 		fi
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Mikulas Patocka @ 2018-04-18 17:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Eric Dumazet, Ben Hutchings, netdev,
	linux-kernel, Michael S. Tsirkin, Jason Wang, virtualization
In-Reply-To: <5f4e1286-b79f-0b9f-9a30-47d7654f3889@gmail.com>



On Wed, 18 Apr 2018, Eric Dumazet wrote:

> 
> 
> On 04/18/2018 09:44 AM, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 18 Apr 2018, Eric Dumazet wrote:
> > 
> >>
> >>
> >> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> >>> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> >>> fails (later patches change it to kvzalloc).
> >>>
> >>> The problem with this is that if the vzalloc function is actually used, 
> >>> virtio_net doesn't work (because it expects that the extra memory should 
> >>> be accessible with DMA-API and memory allocated with vzalloc isn't).
> >>>
> >>> This patch changes it back to kzalloc and adds a warning if the allocated
> >>> size is too large (the allocation is unreliable in this case).
> >>>
> >>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >>> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> >>>
> >>> ---
> >>>  net/core/dev.c |    3 ++-
> >>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> Index: linux-2.6/net/core/dev.c
> >>> ===================================================================
> >>> --- linux-2.6.orig/net/core/dev.c	2018-04-16 21:08:36.000000000 +0200
> >>> +++ linux-2.6/net/core/dev.c	2018-04-18 16:24:43.000000000 +0200
> >>> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> >>>  	/* ensure 32-byte alignment of whole construct */
> >>>  	alloc_size += NETDEV_ALIGN - 1;
> >>>  
> >>> -	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >>> +	WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> >>> +	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >>>  	if (!p)
> >>>  		return NULL;
> >>>  
> >>>
> >>
> >> Since when a net_device needs to be in DMA zone ???
> >>
> >> I would rather fix virtio_net, this looks very suspect to me.
> >>
> >> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
> >> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.
> > 
> > The structure net_device is followed by arbitrary driver-specific data 
> > (accessible with the function netdev_priv). And for virtio-net, these 
> > driver-specific data must be in DMA memory.
> 
> I get that, but how is the original xenvif problem will be solved ?
> 
> Your patch would add a bug in some other driver(s)
> 
> I suggest that virtio_net clearly identifies which part needs a specific allocation
> and does its itself, instead of abusing the netdev_priv storage.
> 
> Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.

There are drivers that need to do DMA to driver-specific area.
And there are drivers that need driver-specific area larger than kmalloc 
limit.

These are conflicting requirements and one of those drivers must be 
changed.

I suggest to change the drivers that need large driver-specific area. 
That's why I added the WARN_ON, so that they can be identified.

Mikulas

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 17:50 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb
In-Reply-To: <20180417200059.30154-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Tue, 17 Apr 2018 16:00:50 -0400

> Segmentation offload reduces cycles/byte for large packets by
> amortizing the cost of protocol stack traversal.
> 
> This patchset implements GSO for UDP.

This looks great.

And as mentioned in other emails, the interface looks good too by letting
the app choose the segmentation point.

It's a real shame we lose checksumming with MSG_MORE, perhaps we can find
a way to lift that restriction?

^ permalink raw reply

* Re: [PATCH v3 net,stable] tun: fix vlan packet truncation
From: David Miller @ 2018-04-18 17:52 UTC (permalink / raw)
  To: bjorn; +Cc: netdev, jasowang
In-Reply-To: <20180417204638.11800-1-bjorn@mork.no>

From: Bjørn Mork <bjorn@mork.no>
Date: Tue, 17 Apr 2018 22:46:38 +0200

> Bogus trimming in tun_net_xmit() causes truncated vlan packets.
> 
> skb->len is correct whether or not skb_vlan_tag_present() is true. There
> is no more reason to adjust the skb length on xmit in this driver than
> any other driver. tun_put_user() adds 4 bytes to the total for tagged
> packets because it transmits the tag inline to userspace.  This is
> similar to a nic transmitting the tag inline on the wire.
> 
> Reproducing the bug by sending any tagged packet through back-to-back
> connected tap interfaces:
 ...
> Fixes: aff3d70a07ff ("tun: allow to attach ebpf socket filter")
> Cc: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [RFC net-next PATCH 2/2] bpf: disallow XDP data_meta to overlap with xdp_frame area
From: Jesper Dangaard Brouer @ 2018-04-18 17:53 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev, brouer
In-Reply-To: <51af68d8-3c02-b828-12b5-f2dc73406a65@iogearbox.net>

On Wed, 18 Apr 2018 18:21:21 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 04/18/2018 02:10 PM, Jesper Dangaard Brouer wrote:
> > If combining xdp_adjust_head and xdp_adjust_meta, then it is possible
> > to make data_meta overlap with area used by xdp_frame.  And another
> > invocation of xdp_adjust_head can then clear that area, due to
> > clearing of xdp_frame area.
> > 
> > The easiest solution I found was to simply not allow
> > xdp_buff->data_meta to overlap with area used by xdp_frame.  
> 
> Thanks Jesper! Trying to answer both emails in one. :) More below.
> 
> > Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  net/core/filter.c |   11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 15e9b5477360..e3623e741181 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2701,6 +2701,11 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
> >  		     data > xdp->data_end - ETH_HLEN))
> >  		return -EINVAL;
> >  
> > +	/* Disallow data_meta to use xdp_frame area */
> > +	if (metalen > 0 &&
> > +	    unlikely((data - metalen) < xdp_frame_end))
> > +		return -EINVAL;
> > +
> >  	/* Avoid info leak, when reusing area prev used by xdp_frame */
> >  	if (data < xdp_frame_end) {  
> 
> Effectively, when metalen > 0, then data_meta < data pointer, so above test
> on new data_meta might be better, but feels like a bit of a workaround to
> handle moving data pointer but disallowing moving data_meta pointer whereas
> both could be handled if we wanted to go that path.
> 
> >  		unsigned long clearlen = xdp_frame_end - data;
> > @@ -2734,6 +2739,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> >  
> >  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
> >  {
> > +	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
> >  	void *meta = xdp->data_meta + offset;
> >  	unsigned long metalen = xdp->data - meta;
> >  
> > @@ -2742,6 +2748,11 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
> >  	if (unlikely(meta < xdp->data_hard_start ||
> >  		     meta > xdp->data))
> >  		return -EINVAL;
> > +
> > +	/* Disallow data_meta to use xdp_frame area */
> > +	if (unlikely(meta < xdp_frame_end))
> > +		return -EINVAL;  
> 
> (Ditto.)
> 
> >  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
> >  		     (metalen > 32)))
> >  		return -EACCES;  
> 
> The other, perhaps less invasive/complex option would be to just disallow
> moving anything into previous sizeof(struct xdp_frame) area. My original
> concern was that not all drivers use 256 bytes of headroom, e.g. afaik the
> i40e and ixgbe have around 192 bytes of headroom available, but that should
> actually still be plenty of space for encap + meta data, and potentially
> with meta data use I would expect that at best custom decap would be
> happening when pushing the packet up the stack. So might as well disallow
> going into that region and not worry about it. Thus, reverting e9e9545e10d3
> ("xdp: avoid leaking info stored in frame data on page reuse") and adding
> something like the below (uncompiled), should just do it:
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3bb0cb9..ad98ddd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2692,8 +2692,9 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
> 
>  BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>  {
> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>  	unsigned long metalen = xdp_get_metalen(xdp);
> -	void *data_start = xdp->data_hard_start + metalen;
> +	void *data_start = frame_end + metalen;
>  	void *data = xdp->data + offset;
> 
>  	if (unlikely(data < data_start ||
> @@ -2719,13 +2720,13 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> 
>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>  {
> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>  	void *meta = xdp->data_meta + offset;
>  	unsigned long metalen = xdp->data - meta;
> 
>  	if (xdp_data_meta_unsupported(xdp))
>  		return -ENOTSUPP;
> -	if (unlikely(meta < xdp->data_hard_start ||
> -		     meta > xdp->data))
> +	if (unlikely(meta < frame_end || meta > xdp->data))
>  		return -EINVAL;
>  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
>  		     (metalen > 32)))

Okay, so you say just disallow using xdp_frame area in general.  It
would be simpler.  

The advantage it that we don't run into strange situations, where the
user/bpf_prog increased headroom too much, such that convert_to_xdp_frame()
fails and thus XDP_REDIRECT action fails.  (That will be confusing to
users to debug/troubleshoot).


> On top of that, we could even store a bool in struct xdp_rxq_info whether
> the driver actually is able to participate in resp. has the XDP_REDIRECT
> support and if not do something like:
> 
> void *frame_end = xdp->data_hard_start + xdp->rxq->has_redir ? sizeof(struct xdp_frame) : 0;
> 
> But the latter is merely a small optimization. Eventually we want all native XDP
> drivers to support it. Thoughts?

I would _really_ like see all drivers support XDP_REDIRECT, but to be
realistic, this is happening way too slow...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox