* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-26 23:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Samudrala, Sridhar, Cornelia Huck, Alexander Duyck, virtio-dev,
aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180626044650-mutt-send-email-mst@kernel.org>
On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> > > > > Might not neccessarily be something wrong, but it's very limited to
>> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> > > I think Sridhar and Jiri might be better person to answer it. My
>> > > impression was that sync'ing the MAC address change between all 3
>> > > devices is challenging, as the failover driver uses MAC address to
>> > > match net_device internally.
>>
>> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
>> of the MAC between the PF and VF. Allowing the guest to change the MAC will require
>> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
>> don't allow changing guest MAC unless it is a trusted VF.
>
> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> For example I can see host just
> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
That's why I think pairing using MAC is fragile IMHO. When VF's MAC
got changed before virtio attempts to match and pair the device, it
ends up with no pairing found out at all. UUID is better.
-Siwei
>
> --
> MST
^ permalink raw reply
* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Flavio Leitner @ 2018-06-26 23:33 UTC (permalink / raw)
To: Cong Wang
Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpU8E5OuXx87Dm+jbqwbkkwETNF_RZh-VnUkF5seFPvv_A@mail.gmail.com>
On Tue, Jun 26, 2018 at 03:47:31PM -0700, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 3:03 PM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> > > On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > When a packet is attached to a socket, we should keep the association as much as possible.
> > >
> > > As much as possible within one stack, I agree. I still don't understand
> > > why we should keep it across the stack boundary.
> > >
> > > > Only when a new association needs to be done, skb_orphan() needs to be called.
> > > >
> > > > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > > > can evades SO_SNDBUF limits.
> > >
> > > Right before leaving the stack is not too soon, it is the latest
> > > actually, for veth case.
> >
> > Depends on how you view things - it's the same host/stack sharing the
> > same resources, so why should we not keep it?
>
> Because stacks are supposed to be independent, netdevices are
> isolated, iptables and route tables too. This is how netns is designed
> from the beginning. The trend today is actually more isolation instead
> of more sharing, given the popularity of containers.
It is still isolated, the sk carries the netns info and it is
orphaned when it re-enters the stack.
--
Flavio
^ permalink raw reply
* [PATCH v3 bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: dsahern @ 2018-06-26 23:21 UTC (permalink / raw)
To: netdev, borkmann, ast; +Cc: kafai, brouer, David Ahern
From: David Ahern <dsahern@gmail.com>
For ACLs implemented using either FIB rules or FIB entries, the BPF
program needs the FIB lookup status to be able to drop the packet.
Since the bpf_fib_lookup API has not reached a released kernel yet,
change the return code to contain an encoding of the FIB lookup
result and return the nexthop device index in the params struct.
In addition, inform the BPF program of any post FIB lookup reason as
to why the packet needs to go up the stack.
The fib result for unicast routes must have an egress device, so remove
the check that it is non-NULL.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
v3
- fixup bpf_skb_fib_lookup for the change in API
v2
- drop BPF_FIB_LKUP_RET_NO_NHDEV; check in dev in fib result not needed
- enhance documentation of BPF_FIB_LKUP_RET_ codes
---
include/uapi/linux/bpf.h | 28 ++++++++++++---
net/core/filter.c | 86 +++++++++++++++++++++++++++++-----------------
samples/bpf/xdp_fwd_kern.c | 8 ++---
3 files changed, 81 insertions(+), 41 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..b7db3261c62d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1857,7 +1857,8 @@ union bpf_attr {
* is resolved), the nexthop address is returned in ipv4_dst
* or ipv6_dst based on family, smac is set to mac address of
* egress device, dmac is set to nexthop mac address, rt_metric
- * is set to metric from route (IPv4/IPv6 only).
+ * is set to metric from route (IPv4/IPv6 only), and ifindex
+ * is set to the device index of the nexthop from the FIB lookup.
*
* *plen* argument is the size of the passed in struct.
* *flags* argument can be a combination of one or more of the
@@ -1873,9 +1874,10 @@ union bpf_attr {
* *ctx* is either **struct xdp_md** for XDP programs or
* **struct sk_buff** tc cls_act programs.
* Return
- * Egress device index on success, 0 if packet needs to continue
- * up the stack for further processing or a negative error in case
- * of failure.
+ * * < 0 if any input argument is invalid
+ * * 0 on success (packet is forwarded, nexthop neighbor exists)
+ * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
+ * * packet is not forwarded or needs assist from full stack
*
* int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
* Description
@@ -2612,6 +2614,18 @@ struct bpf_raw_tracepoint_args {
#define BPF_FIB_LOOKUP_DIRECT BIT(0)
#define BPF_FIB_LOOKUP_OUTPUT BIT(1)
+enum {
+ BPF_FIB_LKUP_RET_SUCCESS, /* lookup successful */
+ BPF_FIB_LKUP_RET_BLACKHOLE, /* dest is blackholed; can be dropped */
+ BPF_FIB_LKUP_RET_UNREACHABLE, /* dest is unreachable; can be dropped */
+ BPF_FIB_LKUP_RET_PROHIBIT, /* dest not allowed; can be dropped */
+ BPF_FIB_LKUP_RET_NOT_FWDED, /* packet is not forwarded */
+ BPF_FIB_LKUP_RET_FWD_DISABLED, /* fwding is not enabled on ingress */
+ BPF_FIB_LKUP_RET_UNSUPP_LWT, /* fwd requires encapsulation */
+ BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
+ BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */
+};
+
struct bpf_fib_lookup {
/* input: network family for lookup (AF_INET, AF_INET6)
* output: network family of egress nexthop
@@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
/* total length of packet from network header - used for MTU check */
__u16 tot_len;
- __u32 ifindex; /* L3 device index for lookup */
+
+ /* input: L3 device index for lookup
+ * output: device index from FIB lookup
+ */
+ __u32 ifindex;
union {
/* inputs to lookup */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..0ca6907d7efe 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4073,8 +4073,9 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+ params->ifindex = dev->ifindex;
- return dev->ifindex;
+ return 0;
}
#endif
@@ -4098,7 +4099,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
/* verify forwarding is enabled on this interface */
in_dev = __in_dev_get_rcu(dev);
if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
- return 0;
+ return BPF_FIB_LKUP_RET_FWD_DISABLED;
if (flags & BPF_FIB_LOOKUP_OUTPUT) {
fl4.flowi4_iif = 1;
@@ -4123,7 +4124,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
tb = fib_get_table(net, tbid);
if (unlikely(!tb))
- return 0;
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
} else {
@@ -4135,8 +4136,20 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
}
- if (err || res.type != RTN_UNICAST)
- return 0;
+ if (err) {
+ /* map fib lookup errors to RTN_ type */
+ if (err == -EINVAL)
+ return BPF_FIB_LKUP_RET_BLACKHOLE;
+ if (err == -EHOSTUNREACH)
+ return BPF_FIB_LKUP_RET_UNREACHABLE;
+ if (err == -EACCES)
+ return BPF_FIB_LKUP_RET_PROHIBIT;
+
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
+ }
+
+ if (res.type != RTN_UNICAST)
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
if (res.fi->fib_nhs > 1)
fib_select_path(net, &res, &fl4, NULL);
@@ -4144,19 +4157,16 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (check_mtu) {
mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
if (params->tot_len > mtu)
- return 0;
+ return BPF_FIB_LKUP_RET_FRAG_NEEDED;
}
nh = &res.fi->fib_nh[res.nh_sel];
/* do not handle lwt encaps right now */
if (nh->nh_lwtstate)
- return 0;
+ return BPF_FIB_LKUP_RET_UNSUPP_LWT;
dev = nh->nh_dev;
- if (unlikely(!dev))
- return 0;
-
if (nh->nh_gw)
params->ipv4_dst = nh->nh_gw;
@@ -4166,10 +4176,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
* rcu_read_lock_bh is not needed here
*/
neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
- if (neigh)
- return bpf_fib_set_fwd_params(params, neigh, dev);
+ if (!neigh)
+ return BPF_FIB_LKUP_RET_NO_NEIGH;
- return 0;
+ return bpf_fib_set_fwd_params(params, neigh, dev);
}
#endif
@@ -4190,7 +4200,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
/* link local addresses are never forwarded */
if (rt6_need_strict(dst) || rt6_need_strict(src))
- return 0;
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
dev = dev_get_by_index_rcu(net, params->ifindex);
if (unlikely(!dev))
@@ -4198,7 +4208,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
idev = __in6_dev_get_safely(dev);
if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
- return 0;
+ return BPF_FIB_LKUP_RET_FWD_DISABLED;
if (flags & BPF_FIB_LOOKUP_OUTPUT) {
fl6.flowi6_iif = 1;
@@ -4225,7 +4235,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
tb = ipv6_stub->fib6_get_table(net, tbid);
if (unlikely(!tb))
- return 0;
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
} else {
@@ -4238,11 +4248,23 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
}
if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
- return 0;
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
+
+ if (unlikely(f6i->fib6_flags & RTF_REJECT)) {
+ switch (f6i->fib6_type) {
+ case RTN_BLACKHOLE:
+ return BPF_FIB_LKUP_RET_BLACKHOLE;
+ case RTN_UNREACHABLE:
+ return BPF_FIB_LKUP_RET_UNREACHABLE;
+ case RTN_PROHIBIT:
+ return BPF_FIB_LKUP_RET_PROHIBIT;
+ default:
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
+ }
+ }
- if (unlikely(f6i->fib6_flags & RTF_REJECT ||
- f6i->fib6_type != RTN_UNICAST))
- return 0;
+ if (f6i->fib6_type != RTN_UNICAST)
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
@@ -4252,11 +4274,11 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (check_mtu) {
mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
if (params->tot_len > mtu)
- return 0;
+ return BPF_FIB_LKUP_RET_FRAG_NEEDED;
}
if (f6i->fib6_nh.nh_lwtstate)
- return 0;
+ return BPF_FIB_LKUP_RET_UNSUPP_LWT;
if (f6i->fib6_flags & RTF_GATEWAY)
*dst = f6i->fib6_nh.nh_gw;
@@ -4270,10 +4292,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
*/
neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
ndisc_hashfn, dst, dev);
- if (neigh)
- return bpf_fib_set_fwd_params(params, neigh, dev);
+ if (!neigh)
+ return BPF_FIB_LKUP_RET_NO_NEIGH;
- return 0;
+ return bpf_fib_set_fwd_params(params, neigh, dev);
}
#endif
@@ -4315,7 +4337,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
struct bpf_fib_lookup *, params, int, plen, u32, flags)
{
struct net *net = dev_net(skb->dev);
- int index = -EAFNOSUPPORT;
+ int rc = -EAFNOSUPPORT;
if (plen < sizeof(*params))
return -EINVAL;
@@ -4326,25 +4348,25 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
switch (params->family) {
#if IS_ENABLED(CONFIG_INET)
case AF_INET:
- index = bpf_ipv4_fib_lookup(net, params, flags, false);
+ rc = bpf_ipv4_fib_lookup(net, params, flags, false);
break;
#endif
#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
- index = bpf_ipv6_fib_lookup(net, params, flags, false);
+ rc = bpf_ipv6_fib_lookup(net, params, flags, false);
break;
#endif
}
- if (index > 0) {
+ if (!rc) {
struct net_device *dev;
- dev = dev_get_by_index_rcu(net, index);
+ dev = dev_get_by_index_rcu(net, params->ifindex);
if (!is_skb_forwardable(dev, skb))
- index = 0;
+ rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
}
- return index;
+ return rc;
}
static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index 6673cdb9f55c..a7e94e7ff87d 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -48,9 +48,9 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
struct ethhdr *eth = data;
struct ipv6hdr *ip6h;
struct iphdr *iph;
- int out_index;
u16 h_proto;
u64 nh_off;
+ int rc;
nh_off = sizeof(*eth);
if (data + nh_off > data_end)
@@ -101,7 +101,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
fib_params.ifindex = ctx->ingress_ifindex;
- out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
+ rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
/* verify egress index has xdp support
* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
@@ -109,7 +109,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
* NOTE: without verification that egress index supports XDP
* forwarding packets are dropped.
*/
- if (out_index > 0) {
+ if (rc == 0) {
if (h_proto == htons(ETH_P_IP))
ip_decrease_ttl(iph);
else if (h_proto == htons(ETH_P_IPV6))
@@ -117,7 +117,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
- return bpf_redirect_map(&tx_port, out_index, 0);
+ return bpf_redirect_map(&tx_port, fib_params.ifindex, 0);
}
return XDP_PASS;
--
2.11.0
^ permalink raw reply related
* Re: [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
From: Tom Herbert @ 2018-06-26 22:59 UTC (permalink / raw)
To: Amritha Nambiar
Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
Willem de Bruijn, Sridhar Samudrala, Alexander Duyck,
Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <152994989181.9733.4112250516573561282.stgit@anamhost.jf.intel.com>
On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Acked-by: Tom Herbert <tom@quantonium.net>
> ---
> Documentation/ABI/testing/sysfs-class-net-queues | 11 ++++
> Documentation/networking/scaling.txt | 57 ++++++++++++++++++----
> 2 files changed, 58 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
> index 0c0df91..978b763 100644
> --- a/Documentation/ABI/testing/sysfs-class-net-queues
> +++ b/Documentation/ABI/testing/sysfs-class-net-queues
> @@ -42,6 +42,17 @@ Description:
> network device transmit queue. Possible vaules depend on the
> number of available CPU(s) in the system.
>
> +What: /sys/class/<iface>/queues/tx-<queue>/xps_rxqs
> +Date: June 2018
> +KernelVersion: 4.18.0
> +Contact: netdev@vger.kernel.org
> +Description:
> + Mask of the receive queue(s) currently enabled to participate
> + into the Transmit Packet Steering packet processing flow for this
> + network device transmit queue. Possible values depend on the
> + number of available receive queue(s) in the network device.
> + Default is disabled.
> +
> What: /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/hold_time
> Date: November 2011
> KernelVersion: 3.3
> diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
> index f55639d..8336116 100644
> --- a/Documentation/networking/scaling.txt
> +++ b/Documentation/networking/scaling.txt
> @@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
>
> Transmit Packet Steering is a mechanism for intelligently selecting
> which transmit queue to use when transmitting a packet on a multi-queue
> -device. To accomplish this, a mapping from CPU to hardware queue(s) is
> -recorded. The goal of this mapping is usually to assign queues
> +device. This can be accomplished by recording two kinds of maps, either
> +a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
> +to hardware transmit queue(s).
> +
> +1. XPS using CPUs map
> +
> +The goal of this mapping is usually to assign queues
> exclusively to a subset of CPUs, where the transmit completions for
> these queues are processed on a CPU within this set. This choice
> provides two benefits. First, contention on the device queue lock is
> @@ -377,12 +382,35 @@ transmit queue). Secondly, cache miss rate on transmit completion is
> reduced, in particular for data cache lines that hold the sk_buff
> structures.
>
> -XPS is configured per transmit queue by setting a bitmap of CPUs that
> -may use that queue to transmit. The reverse mapping, from CPUs to
> -transmit queues, is computed and maintained for each network device.
> -When transmitting the first packet in a flow, the function
> -get_xps_queue() is called to select a queue. This function uses the ID
> -of the running CPU as a key into the CPU-to-queue lookup table. If the
> +2. XPS using receive queues map
> +
> +This mapping is used to pick transmit queue based on the receive
> +queue(s) map configuration set by the administrator. A set of receive
> +queues can be mapped to a set of transmit queues (many:many), although
> +the common use case is a 1:1 mapping. This will enable sending packets
> +on the same queue associations for transmit and receive. This is useful for
> +busy polling multi-threaded workloads where there are challenges in
> +associating a given CPU to a given application thread. The application
> +threads are not pinned to CPUs and each thread handles packets
> +received on a single queue. The receive queue number is cached in the
> +socket for the connection. In this model, sending the packets on the same
> +transmit queue corresponding to the associated receive queue has benefits
> +in keeping the CPU overhead low. Transmit completion work is locked into
> +the same queue-association that a given application is polling on. This
> +avoids the overhead of triggering an interrupt on another CPU. When the
> +application cleans up the packets during the busy poll, transmit completion
> +may be processed along with it in the same thread context and so result in
> +reduced latency.
> +
> +XPS is configured per transmit queue by setting a bitmap of
> +CPUs/receive-queues that may use that queue to transmit. The reverse
> +mapping, from CPUs to transmit queues or from receive-queues to transmit
> +queues, is computed and maintained for each network device. When
> +transmitting the first packet in a flow, the function get_xps_queue() is
> +called to select a queue. This function uses the ID of the receive queue
> +for the socket connection for a match in the receive queue-to-transmit queue
> +lookup table. Alternatively, this function can also use the ID of the
> +running CPU as a key into the CPU-to-queue lookup table. If the
> ID matches a single queue, that is used for transmission. If multiple
> queues match, one is selected by using the flow hash to compute an index
> into the set.
> @@ -404,11 +432,15 @@ acknowledged.
>
> XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
> default for SMP). The functionality remains disabled until explicitly
> -configured. To enable XPS, the bitmap of CPUs that may use a transmit
> -queue is configured using the sysfs file entry:
> +configured. To enable XPS, the bitmap of CPUs/receive-queues that may
> +use a transmit queue is configured using the sysfs file entry:
>
> +For selection based on CPUs map:
> /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
>
> +For selection based on receive-queues map:
> +/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
> +
> == Suggested Configuration
>
> For a network device with a single transmission queue, XPS configuration
> @@ -421,6 +453,11 @@ best CPUs to share a given queue are probably those that share the cache
> with the CPU that processes transmit completions for that queue
> (transmit interrupts).
>
> +For transmit queue selection based on receive queue(s), XPS has to be
> +explicitly configured mapping receive-queue(s) to transmit queue(s). If the
> +user configuration for receive-queue map does not apply, then the transmit
> +queue is selected based on the CPUs map.
> +
> Per TX Queue rate limitation:
> =============================
>
>
^ permalink raw reply
* Re: [net-next PATCH v4 2/7] net: Use static_key for XPS maps
From: Tom Herbert @ 2018-06-26 22:54 UTC (permalink / raw)
To: Amritha Nambiar
Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
Willem de Bruijn, Sridhar Samudrala, Alexander Duyck,
Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <152994986435.9733.17031817483990964500.stgit@anamhost.jf.intel.com>
On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Use static_key for XPS maps to reduce the cost of extra map checks,
> similar to how it is used for RPS and RFS. This includes static_key
> 'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
> Rx queues map.
>
Acked-by: Tom Herbert <tom@quantonium.net>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
> net/core/dev.c | 26 ++++++++++++++++++++------
> 1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2552556..df2a78d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
> EXPORT_SYMBOL(netdev_txq_to_tc);
>
> #ifdef CONFIG_XPS
> +struct static_key xps_needed __read_mostly;
> +EXPORT_SYMBOL(xps_needed);
> +struct static_key xps_rxqs_needed __read_mostly;
> +EXPORT_SYMBOL(xps_rxqs_needed);
> static DEFINE_MUTEX(xps_map_mutex);
> #define xmap_dereference(P) \
> rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
> @@ -2170,12 +2174,14 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>
> mutex_lock(&xps_map_mutex);
>
> - dev_maps = xmap_dereference(dev->xps_rxqs_map);
> - if (dev_maps) {
> - nr_ids = dev->num_rx_queues;
> - clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
> - count, true);
> -
> + if (static_key_false(&xps_rxqs_needed)) {
> + dev_maps = xmap_dereference(dev->xps_rxqs_map);
> + if (dev_maps) {
> + nr_ids = dev->num_rx_queues;
> + clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
> + offset, count, true);
> + }
> + static_key_slow_dec(&xps_rxqs_needed);
> }
>
> dev_maps = xmap_dereference(dev->xps_cpus_map);
> @@ -2189,6 +2195,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
> false);
>
> out_no_maps:
> + static_key_slow_dec(&xps_needed);
> mutex_unlock(&xps_map_mutex);
> }
>
> @@ -2297,6 +2304,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> if (!new_dev_maps)
> goto out_no_new_maps;
>
> + static_key_slow_inc(&xps_needed);
> + if (is_rxqs_map)
> + static_key_slow_inc(&xps_rxqs_needed);
> +
> for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> j < nr_ids;) {
> /* copy maps belonging to foreign traffic classes */
> @@ -3450,6 +3461,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
> struct xps_map *map;
> int queue_index = -1;
>
> + if (!static_key_false(&xps_needed))
> + return -1;
> +
> rcu_read_lock();
> dev_maps = rcu_dereference(dev->xps_cpus_map);
> if (dev_maps) {
>
^ permalink raw reply
* set_memory_* (was: Re: BUG: unable to handle kernel paging request in bpf_int_jit_compile)
From: Daniel Borkmann @ 2018-06-26 22:53 UTC (permalink / raw)
To: Ingo Molnar, David Miller
Cc: tglx, syzbot+a4eb8c7766952a1ca872, ast, hpa, kuznet, linux-kernel,
mingo, netdev, syzkaller-bugs, x86, yoshfuji, peterz, labbott,
keescook, torvalds, edumazet
In-Reply-To: <20180624100249.GA9493@gmail.com>
On 06/24/2018 12:02 PM, Ingo Molnar wrote:
> * David Miller <davem@davemloft.net> wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> Date: Sun, 24 Jun 2018 09:09:09 +0200 (CEST)
>>
>>> I'm really tempted to make the BPF config switch depend on BROKEN.
>>
>> This really isn't necessary Thomas.
>>
>> Whoever wrote the code didn't understand that set ro can legitimately
>> fail.
>
> No, that's *NOT* the only thing that happened, according to the Git history.
>
> The first use of set_memory_ro() in include/linux/filter.h was added by
> this commit almost four years ago:
>
> # 2014/09
> 60a3b2253c41 ("net: bpf: make eBPF interpreter images read-only")
>
> ... and yes, that commit didn't anticipate the (in hindsight) obvious property of
> a function that changes global kernel mappings that if it is used after bootup
> without locking it 'may fail'. So that commit slipping through is 'shit happens'
> and I don't think we ever complained about such things slipping through.
Hmm, back then I adapted the code similar from 314beb9bcabf ("x86: bpf_jit_comp:
secure bpf jit against spraying attacks") for interpreter images as well, and
from grepping through the kernel code none of the callers of set_memory_{ro,rw}()
at that time (& now except bpf) did check for the return code (e.g. module_enable_ro()
and module_disable_ro() as one example which could happen late after bootup has
finished when pulling in modules on the fly).
I did made the mistake in 9facc336876f ("bpf: reject any prog that failed read-only
lock") assuming that after the set_memory_ro() call it would either succeed or
it would not, but not leaving us in a state in the middle. That was silly assumption
and I'll fix this up in bpf, very sorry about that! I've been debugging the syzkaller
BUG at [1] and noticed that even though set_memory_ro() failed with an error, doing
a probe_kernel_write() on it afterwards failed with EFAULT, meaning the module_alloc()
memory was however set to read-only at that point triggering later the BUG when
attempting to change its memory (at least on the virtual mem). From debugging output,
it was a single 4k page and on x86_64 in the __change_page_attr_set_clr() we failed
in the cpa_process_alias() where the syzkaller fault injection happened. So latter
failure from cpa_process_alias() came from call to __change_page_attr_set_clr() with
primary to 0, where it tried to split a large page in __change_page_attr() but failed
in alloc_pages() thus returning the -ENOMEM from there. Testing subsequent undoing
via set_memory_rw() made it writable again, though.
In any case, for pairs like set_memory_ro() + set_memory_rw() that are also used
outside of bpf e.g. STRICT_MODULE_RWX and friends which are mostly default these
days for some archs, is the choice to not check errors from there by design or from
historical context that it originated from 'debugging code' in that sense (DEBUG_RODATA /
DEBUG_SET_MODULE_RONX) earlier? Also if no-one checks for errors (and if that would
infact be the recommendation it is agreed upon) should the API be changed to void,
or generally should actual error checking occur on these + potential rollback; but
then question is what about restoring part from prior set_memory_ro() via set_memory_rw()?
Kees/others, do you happen to have some more context on recommended use around this
by any chance? (Would probably also help if we add some doc around assumptions into
include/linux/set_memory.h for future users.)
Thanks a lot,
Daniel
[1] https://syzkaller.appspot.com/bug?extid=a4eb8c7766952a1ca872
^ permalink raw reply
* Re: [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues
From: Tom Herbert @ 2018-06-26 22:53 UTC (permalink / raw)
To: Amritha Nambiar
Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
Willem de Bruijn, Sridhar Samudrala, Alexander Duyck,
Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <152994985870.9733.538443146526394443.stgit@anamhost.jf.intel.com>
On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Refactor XPS code to support Tx queue selection based on
> CPU(s) map or Rx queue(s) map.
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
> include/linux/cpumask.h | 11 ++
> include/linux/netdevice.h | 100 +++++++++++++++++++++
> net/core/dev.c | 211 ++++++++++++++++++++++++++++++---------------
> net/core/net-sysfs.c | 4 -
> 4 files changed, 246 insertions(+), 80 deletions(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index bf53d89..57f20a0 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
> #define cpu_active(cpu) ((cpu) == 0)
> #endif
>
> -/* verify cpu argument to cpumask_* operators */
> -static inline unsigned int cpumask_check(unsigned int cpu)
> +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
> {
> #ifdef CONFIG_DEBUG_PER_CPU_MAPS
> - WARN_ON_ONCE(cpu >= nr_cpumask_bits);
> + WARN_ON_ONCE(cpu >= bits);
> #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> +}
> +
> +/* verify cpu argument to cpumask_* operators */
> +static inline unsigned int cpumask_check(unsigned int cpu)
> +{
> + cpu_max_bits_warn(cpu, nr_cpumask_bits);
> return cpu;
> }
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3ec9850..c534f03 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -730,10 +730,15 @@ struct xps_map {
> */
> struct xps_dev_maps {
> struct rcu_head rcu;
> - struct xps_map __rcu *cpu_map[0];
> + struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
> };
> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
> +
> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
> (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
> +
> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
> + (_rxqs * (_tcs) * sizeof(struct xps_map *)))
> +
> #endif /* CONFIG_XPS */
>
> #define TC_MAX_QUEUE 16
> @@ -1909,7 +1914,8 @@ struct net_device {
> int watchdog_timeo;
>
> #ifdef CONFIG_XPS
> - struct xps_dev_maps __rcu *xps_maps;
> + struct xps_dev_maps __rcu *xps_cpus_map;
> + struct xps_dev_maps __rcu *xps_rxqs_map;
> #endif
> #ifdef CONFIG_NET_CLS_ACT
> struct mini_Qdisc __rcu *miniq_egress;
> @@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
> #ifdef CONFIG_XPS
> int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> u16 index);
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> + u16 index, bool is_rxqs_map);
> +
> +/**
> + * attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask
> + * @j: CPU/Rx queue index
> + * @mask: bitmask of all cpus/rx queues
> + * @nr_bits: number of bits in the bitmask
> + *
> + * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
> + */
> +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
> + unsigned int nr_bits)
> +{
> + cpu_max_bits_warn(j, nr_bits);
> + return test_bit(j, mask);
> +}
> +
> +/**
> + * attr_test_online - Test for online CPU/Rx queue
> + * @j: CPU/Rx queue index
> + * @online_mask: bitmask for CPUs/Rx queues that are online
> + * @nr_bits: number of bits in the bitmask
> + *
> + * Returns true if a CPU/Rx queue is online.
> + */
> +static inline bool attr_test_online(unsigned long j,
> + const unsigned long *online_mask,
> + unsigned int nr_bits)
> +{
> + cpu_max_bits_warn(j, nr_bits);
> +
> + if (online_mask)
> + return test_bit(j, online_mask);
> +
> + if (j >= 0 && j < nr_bits)
j is unsigned so j >= 0 is superfluous.
> + return true;
> +
> + return false;
Could just do:
return (j < nr_bits);
> +}
> +
> +/**
> + * attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
> + * @n: CPU/Rx queue index
> + * @srcp: the cpumask/Rx queue mask pointer
> + * @nr_bits: number of bits in the bitmask
> + *
> + * Returns >= nr_bits if no further CPUs/Rx queues set.
> + */
> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
> + unsigned int nr_bits)
> +{
> + /* -1 is a legal arg here. */
> + if (n != -1)
> + cpu_max_bits_warn(n, nr_bits);
> +
> + if (srcp)
> + return find_next_bit(srcp, nr_bits, n + 1);
> +
> + return n + 1;
> +}
> +
> +/**
> + * attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
> + * @n: CPU/Rx queue index
> + * @src1p: the first CPUs/Rx queues mask pointer
> + * @src2p: the second CPUs/Rx queues mask pointer
> + * @nr_bits: number of bits in the bitmask
> + *
> + * Returns >= nr_bits if no further CPUs/Rx queues set in both.
> + */
> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
> + const unsigned long *src2p,
> + unsigned int nr_bits)
> +{
> + /* -1 is a legal arg here. */
> + if (n != -1)
> + cpu_max_bits_warn(n, nr_bits);
> +
> + if (src1p && src2p)
> + return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
> + else if (src1p)
> + return find_next_bit(src1p, nr_bits, n + 1);
> + else if (src2p)
> + return find_next_bit(src2p, nr_bits, n + 1);
> +
> + return n + 1;
> +}
> #else
> static inline int netif_set_xps_queue(struct net_device *dev,
> const struct cpumask *mask,
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a5aa1c7..2552556 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
> int pos;
>
> if (dev_maps)
> - map = xmap_dereference(dev_maps->cpu_map[tci]);
> + map = xmap_dereference(dev_maps->attr_map[tci]);
> if (!map)
> return false;
>
> @@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
> break;
> }
>
> - RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
> + RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
> kfree_rcu(map, rcu);
> return false;
> }
> @@ -2135,31 +2135,58 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
> return active;
> }
>
> +static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
> + struct xps_dev_maps *dev_maps, unsigned int nr_ids,
> + u16 offset, u16 count, bool is_rxqs_map)
> +{
> + bool active = false;
> + int i, j;
> +
> + for (j = -1; j = attrmask_next(j, mask, nr_ids),
> + j < nr_ids;)
> + active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
> + count);
> + if (!active) {
> + if (is_rxqs_map) {
> + RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
> + } else {
> + RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
> +
> + for (i = offset + (count - 1); count--; i--)
> + netdev_queue_numa_node_write(
> + netdev_get_tx_queue(dev, i),
> + NUMA_NO_NODE);
> + }
> + kfree_rcu(dev_maps, rcu);
> + }
> +}
> +
> static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
> u16 count)
> {
> + const unsigned long *possible_mask = NULL;
> struct xps_dev_maps *dev_maps;
> - int cpu, i;
> - bool active = false;
> + unsigned int nr_ids;
>
> mutex_lock(&xps_map_mutex);
> - dev_maps = xmap_dereference(dev->xps_maps);
>
> - if (!dev_maps)
> - goto out_no_maps;
> -
> - for_each_possible_cpu(cpu)
> - active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
> - offset, count);
> + dev_maps = xmap_dereference(dev->xps_rxqs_map);
> + if (dev_maps) {
> + nr_ids = dev->num_rx_queues;
> + clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
> + count, true);
>
> - if (!active) {
> - RCU_INIT_POINTER(dev->xps_maps, NULL);
> - kfree_rcu(dev_maps, rcu);
> }
>
> - for (i = offset + (count - 1); count--; i--)
> - netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
> - NUMA_NO_NODE);
> + dev_maps = xmap_dereference(dev->xps_cpus_map);
> + if (!dev_maps)
> + goto out_no_maps;
> +
> + if (num_possible_cpus() > 1)
> + possible_mask = cpumask_bits(cpu_possible_mask);
> + nr_ids = nr_cpu_ids;
> + clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
> + false);
>
> out_no_maps:
> mutex_unlock(&xps_map_mutex);
> @@ -2170,8 +2197,8 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
> netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
> }
>
> -static struct xps_map *expand_xps_map(struct xps_map *map,
> - int cpu, u16 index)
> +static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
> + u16 index, bool is_rxqs_map)
> {
> struct xps_map *new_map;
> int alloc_len = XPS_MIN_MAP_ALLOC;
> @@ -2183,7 +2210,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
> return map;
> }
>
> - /* Need to add queue to this CPU's existing map */
> + /* Need to add tx-queue to this CPU's/rx-queue's existing map */
> if (map) {
> if (pos < map->alloc_len)
> return map;
> @@ -2191,9 +2218,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
> alloc_len = map->alloc_len * 2;
> }
>
> - /* Need to allocate new map to store queue on this CPU's map */
> - new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> - cpu_to_node(cpu));
> + /* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
> + * map
> + */
> + if (is_rxqs_map)
> + new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
> + else
> + new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> + cpu_to_node(attr_index));
> if (!new_map)
> return NULL;
>
> @@ -2205,14 +2237,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
> return new_map;
> }
>
> -int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> - u16 index)
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> + u16 index, bool is_rxqs_map)
> {
> + const unsigned long *online_mask = NULL, *possible_mask = NULL;
> struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
> - int i, cpu, tci, numa_node_id = -2;
> + int i, j, tci, numa_node_id = -2;
> int maps_sz, num_tc = 1, tc = 0;
> struct xps_map *map, *new_map;
> bool active = false;
> + unsigned int nr_ids;
>
> if (dev->num_tc) {
> num_tc = dev->num_tc;
> @@ -2221,16 +2255,27 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> return -EINVAL;
> }
>
> - maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
> - if (maps_sz < L1_CACHE_BYTES)
> - maps_sz = L1_CACHE_BYTES;
> -
> mutex_lock(&xps_map_mutex);
> + if (is_rxqs_map) {
> + maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
> + dev_maps = xmap_dereference(dev->xps_rxqs_map);
> + nr_ids = dev->num_rx_queues;
> + } else {
> + maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
> + if (num_possible_cpus() > 1) {
> + online_mask = cpumask_bits(cpu_online_mask);
> + possible_mask = cpumask_bits(cpu_possible_mask);
> + }
> + dev_maps = xmap_dereference(dev->xps_cpus_map);
> + nr_ids = nr_cpu_ids;
> + }
>
> - dev_maps = xmap_dereference(dev->xps_maps);
> + if (maps_sz < L1_CACHE_BYTES)
> + maps_sz = L1_CACHE_BYTES;
>
> /* allocate memory for queue storage */
> - for_each_cpu_and(cpu, cpu_online_mask, mask) {
> + for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
> + j < nr_ids;) {
> if (!new_dev_maps)
> new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
> if (!new_dev_maps) {
> @@ -2238,73 +2283,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> return -ENOMEM;
> }
>
> - tci = cpu * num_tc + tc;
> - map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
> + tci = j * num_tc + tc;
> + map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
> NULL;
>
> - map = expand_xps_map(map, cpu, index);
> + map = expand_xps_map(map, j, index, is_rxqs_map);
> if (!map)
> goto error;
>
> - RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> + RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
> }
>
> if (!new_dev_maps)
> goto out_no_new_maps;
>
> - for_each_possible_cpu(cpu) {
> + for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> + j < nr_ids;) {
> /* copy maps belonging to foreign traffic classes */
> - for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
> + for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
> /* fill in the new device map from the old device map */
> - map = xmap_dereference(dev_maps->cpu_map[tci]);
> - RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> + map = xmap_dereference(dev_maps->attr_map[tci]);
> + RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
> }
>
> /* We need to explicitly update tci as prevous loop
> * could break out early if dev_maps is NULL.
> */
> - tci = cpu * num_tc + tc;
> + tci = j * num_tc + tc;
>
> - if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
> - /* add queue to CPU maps */
> + if (attr_test_mask(j, mask, nr_ids) &&
> + attr_test_online(j, online_mask, nr_ids)) {
> + /* add tx-queue to CPU/rx-queue maps */
> int pos = 0;
>
> - map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> + map = xmap_dereference(new_dev_maps->attr_map[tci]);
> while ((pos < map->len) && (map->queues[pos] != index))
> pos++;
>
> if (pos == map->len)
> map->queues[map->len++] = index;
> #ifdef CONFIG_NUMA
> - if (numa_node_id == -2)
> - numa_node_id = cpu_to_node(cpu);
> - else if (numa_node_id != cpu_to_node(cpu))
> - numa_node_id = -1;
Seems like there should be a comment here about meaning of -2 and -1
in NUMA node. Better yet, seems like there should be constants defined
for these special values. Maybe something to clean up in the future.
> + if (!is_rxqs_map) {
> + if (numa_node_id == -2)
> + numa_node_id = cpu_to_node(j);
> + else if (numa_node_id != cpu_to_node(j))
> + numa_node_id = -1;
> + }
> #endif
> } else if (dev_maps) {
> /* fill in the new device map from the old device map */
> - map = xmap_dereference(dev_maps->cpu_map[tci]);
> - RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> + map = xmap_dereference(dev_maps->attr_map[tci]);
> + RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
> }
>
> /* copy maps belonging to foreign traffic classes */
> for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
> /* fill in the new device map from the old device map */
> - map = xmap_dereference(dev_maps->cpu_map[tci]);
> - RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> + map = xmap_dereference(dev_maps->attr_map[tci]);
> + RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
> }
> }
>
> - rcu_assign_pointer(dev->xps_maps, new_dev_maps);
> + if (is_rxqs_map)
> + rcu_assign_pointer(dev->xps_rxqs_map, new_dev_maps);
> + else
> + rcu_assign_pointer(dev->xps_cpus_map, new_dev_maps);
>
> /* Cleanup old maps */
> if (!dev_maps)
> goto out_no_old_maps;
>
> - for_each_possible_cpu(cpu) {
> - for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> - new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> - map = xmap_dereference(dev_maps->cpu_map[tci]);
> + for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> + j < nr_ids;) {
> + for (i = num_tc, tci = j * num_tc; i--; tci++) {
> + new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
> + map = xmap_dereference(dev_maps->attr_map[tci]);
> if (map && map != new_map)
> kfree_rcu(map, rcu);
> }
> @@ -2317,19 +2370,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> active = true;
>
> out_no_new_maps:
> - /* update Tx queue numa node */
> - netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> - (numa_node_id >= 0) ? numa_node_id :
> - NUMA_NO_NODE);
> + if (!is_rxqs_map) {
> + /* update Tx queue numa node */
> + netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> + (numa_node_id >= 0) ?
> + numa_node_id : NUMA_NO_NODE);
> + }
>
> if (!dev_maps)
> goto out_no_maps;
>
> - /* removes queue from unused CPUs */
> - for_each_possible_cpu(cpu) {
> - for (i = tc, tci = cpu * num_tc; i--; tci++)
> + /* removes tx-queue from unused CPUs/rx-queues */
> + for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> + j < nr_ids;) {
> + for (i = tc, tci = j * num_tc; i--; tci++)
> active |= remove_xps_queue(dev_maps, tci, index);
> - if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
> + if (!attr_test_mask(j, mask, nr_ids) ||
> + !attr_test_online(j, online_mask, nr_ids))
> active |= remove_xps_queue(dev_maps, tci, index);
> for (i = num_tc - tc, tci++; --i; tci++)
> active |= remove_xps_queue(dev_maps, tci, index);
> @@ -2337,7 +2394,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>
> /* free map if not active */
> if (!active) {
> - RCU_INIT_POINTER(dev->xps_maps, NULL);
> + if (is_rxqs_map)
> + RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
> + else
> + RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
> kfree_rcu(dev_maps, rcu);
> }
>
> @@ -2347,11 +2407,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> return 0;
> error:
> /* remove any maps that we added */
> - for_each_possible_cpu(cpu) {
> - for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> - new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> + for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> + j < nr_ids;) {
> + for (i = num_tc, tci = j * num_tc; i--; tci++) {
> + new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
> map = dev_maps ?
> - xmap_dereference(dev_maps->cpu_map[tci]) :
> + xmap_dereference(dev_maps->attr_map[tci]) :
> NULL;
> if (new_map && new_map != map)
> kfree(new_map);
> @@ -2363,6 +2424,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> kfree(new_dev_maps);
> return -ENOMEM;
> }
> +
> +int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> + u16 index)
> +{
> + return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
> +}
> EXPORT_SYMBOL(netif_set_xps_queue);
>
> #endif
> @@ -3384,7 +3451,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
> int queue_index = -1;
>
> rcu_read_lock();
> - dev_maps = rcu_dereference(dev->xps_maps);
> + dev_maps = rcu_dereference(dev->xps_cpus_map);
> if (dev_maps) {
> unsigned int tci = skb->sender_cpu - 1;
>
> @@ -3393,7 +3460,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
> tci += netdev_get_prio_tc_map(dev, skb->priority);
> }
>
> - map = rcu_dereference(dev_maps->cpu_map[tci]);
> + map = rcu_dereference(dev_maps->attr_map[tci]);
> if (map) {
> if (map->len == 1)
> queue_index = map->queues[0];
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index bb7e80f..b39987c 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
> return -ENOMEM;
>
> rcu_read_lock();
> - dev_maps = rcu_dereference(dev->xps_maps);
> + dev_maps = rcu_dereference(dev->xps_cpus_map);
> if (dev_maps) {
> for_each_possible_cpu(cpu) {
> int i, tci = cpu * num_tc + tc;
> struct xps_map *map;
>
> - map = rcu_dereference(dev_maps->cpu_map[tci]);
> + map = rcu_dereference(dev_maps->attr_map[tci]);
> if (!map)
> continue;
>
>
Acked-by: Tom Herbert <tom@quantonium.net>
^ permalink raw reply
* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Cong Wang @ 2018-06-26 22:47 UTC (permalink / raw)
To: Flavio Leitner
Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
David Miller, Florian Westphal, NetFilter
In-Reply-To: <20180626220300.GT19565@plex.lan>
On Tue, Jun 26, 2018 at 3:03 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> > On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > When a packet is attached to a socket, we should keep the association as much as possible.
> >
> > As much as possible within one stack, I agree. I still don't understand
> > why we should keep it across the stack boundary.
> >
> > > Only when a new association needs to be done, skb_orphan() needs to be called.
> > >
> > > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > > can evades SO_SNDBUF limits.
> >
> > Right before leaving the stack is not too soon, it is the latest
> > actually, for veth case.
>
> Depends on how you view things - it's the same host/stack sharing the
> same resources, so why should we not keep it?
Because stacks are supposed to be independent, netdevices are
isolated, iptables and route tables too. This is how netns is designed
from the beginning. The trend today is actually more isolation instead
of more sharing, given the popularity of containers.
You need to justify why you want to break the TSQ's scope here,
which is obviously not compatible with netns design.
^ permalink raw reply
* Re: [PATCH RESEND bpf-next v6 2/2] samples/bpf: Add xdp_sample_pkts example
From: Song Liu @ 2018-06-26 22:37 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Networking
In-Reply-To: <152992950246.15897.18198690562616367907.stgit@alrua-kau>
On Mon, Jun 25, 2018 at 5:25 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Add an example program showing how to sample packets from XDP using the
> perf event buffer. The example userspace program just prints the ethernet
> header for every packet sampled.
>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Acked-by: Song Liu <songliubraving@fb.com>
> ---
> samples/bpf/Makefile | 4 +
> samples/bpf/xdp_sample_pkts_kern.c | 66 ++++++++++++++
> samples/bpf/xdp_sample_pkts_user.c | 169 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 239 insertions(+)
> create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
> create mode 100644 samples/bpf/xdp_sample_pkts_user.c
>
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 1303af10e54d..9ea2f7b64869 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
> hostprogs-y += xdpsock
> hostprogs-y += xdp_fwd
> hostprogs-y += task_fd_query
> +hostprogs-y += xdp_sample_pkts
>
> # Libbpf dependencies
> LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> @@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
> xdpsock-objs := bpf_load.o xdpsock_user.o
> xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
> task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
> +xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>
> # Tell kbuild to always build the programs
> always := $(hostprogs-y)
> @@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
> always += xdpsock_kern.o
> always += xdp_fwd_kern.o
> always += task_fd_query_kern.o
> +always += xdp_sample_pkts_kern.o
>
> HOSTCFLAGS += -I$(objtree)/usr/include
> HOSTCFLAGS += -I$(srctree)/tools/lib/
> @@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
> HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
> HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
> HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
> +HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
>
> HOST_LOADLIBES += $(LIBBPF) -lelf
> HOSTLOADLIBES_tracex4 += -lrt
> diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
> new file mode 100644
> index 000000000000..f7ca8b850978
> --- /dev/null
> +++ b/samples/bpf/xdp_sample_pkts_kern.c
> @@ -0,0 +1,66 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/ptrace.h>
> +#include <linux/version.h>
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +#define SAMPLE_SIZE 64ul
> +#define MAX_CPUS 128
> +
> +#define bpf_printk(fmt, ...) \
> +({ \
> + char ____fmt[] = fmt; \
> + bpf_trace_printk(____fmt, sizeof(____fmt), \
> + ##__VA_ARGS__); \
> +})
> +
> +struct bpf_map_def SEC("maps") my_map = {
> + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> + .key_size = sizeof(int),
> + .value_size = sizeof(u32),
> + .max_entries = MAX_CPUS,
> +};
> +
> +SEC("xdp_sample")
> +int xdp_sample_prog(struct xdp_md *ctx)
> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + void *data = (void *)(long)ctx->data;
> +
> + /* Metadata will be in the perf event before the packet data. */
> + struct S {
> + u16 cookie;
> + u16 pkt_len;
> + } __packed metadata;
> +
> + if (data < data_end) {
> + /* The XDP perf_event_output handler will use the upper 32 bits
> + * of the flags argument as a number of bytes to include of the
> + * packet payload in the event data. If the size is too big, the
> + * call to bpf_perf_event_output will fail and return -EFAULT.
> + *
> + * See bpf_xdp_event_output in net/core/filter.c.
> + *
> + * The BPF_F_CURRENT_CPU flag means that the event output fd
> + * will be indexed by the CPU number in the event map.
> + */
> + u64 flags = BPF_F_CURRENT_CPU;
> + u16 sample_size;
> + int ret;
> +
> + metadata.cookie = 0xdead;
> + metadata.pkt_len = (u16)(data_end - data);
> + sample_size = min(metadata.pkt_len, SAMPLE_SIZE);
> + flags |= (u64)sample_size << 32;
> +
> + ret = bpf_perf_event_output(ctx, &my_map, flags,
> + &metadata, sizeof(metadata));
> + if (ret)
> + bpf_printk("perf_event_output failed: %d\n", ret);
> + }
> +
> + return XDP_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;
> diff --git a/samples/bpf/xdp_sample_pkts_user.c b/samples/bpf/xdp_sample_pkts_user.c
> new file mode 100644
> index 000000000000..8dd87c1eb560
> --- /dev/null
> +++ b/samples/bpf/xdp_sample_pkts_user.c
> @@ -0,0 +1,169 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <linux/perf_event.h>
> +#include <linux/bpf.h>
> +#include <net/if.h>
> +#include <errno.h>
> +#include <assert.h>
> +#include <sys/sysinfo.h>
> +#include <sys/ioctl.h>
> +#include <signal.h>
> +#include <libbpf.h>
> +#include <bpf/bpf.h>
> +
> +#include "perf-sys.h"
> +#include "trace_helpers.h"
> +
> +#define MAX_CPUS 128
> +static int pmu_fds[MAX_CPUS], if_idx;
> +static struct perf_event_mmap_page *headers[MAX_CPUS];
> +static char *if_name;
> +
> +static int do_attach(int idx, int fd, const char *name)
> +{
> + int err;
> +
> + err = bpf_set_link_xdp_fd(idx, fd, 0);
> + if (err < 0)
> + printf("ERROR: failed to attach program to %s\n", name);
> +
> + return err;
> +}
> +
> +static int do_detach(int idx, const char *name)
> +{
> + int err;
> +
> + err = bpf_set_link_xdp_fd(idx, -1, 0);
> + if (err < 0)
> + printf("ERROR: failed to detach program from %s\n", name);
> +
> + return err;
> +}
> +
> +#define SAMPLE_SIZE 64
> +
> +static int print_bpf_output(void *data, int size)
> +{
> + struct {
> + __u16 cookie;
> + __u16 pkt_len;
> + __u8 pkt_data[SAMPLE_SIZE];
> + } __packed *e = data;
> + int i;
> +
> + if (e->cookie != 0xdead) {
> + printf("BUG cookie %x sized %d\n",
> + e->cookie, size);
> + return LIBBPF_PERF_EVENT_ERROR;
> + }
> +
> + printf("Pkt len: %-5d bytes. Ethernet hdr: ", e->pkt_len);
> + for (i = 0; i < 14 && i < e->pkt_len; i++)
> + printf("%02x ", e->pkt_data[i]);
> + printf("\n");
> +
> + return LIBBPF_PERF_EVENT_CONT;
> +}
> +
> +static void test_bpf_perf_event(int map_fd, int num)
> +{
> + struct perf_event_attr attr = {
> + .sample_type = PERF_SAMPLE_RAW,
> + .type = PERF_TYPE_SOFTWARE,
> + .config = PERF_COUNT_SW_BPF_OUTPUT,
> + .wakeup_events = 1, /* get an fd notification for every event */
> + };
> + int i;
> +
> + for (i = 0; i < num; i++) {
> + int key = i;
> +
> + pmu_fds[i] = sys_perf_event_open(&attr, -1/*pid*/, i/*cpu*/,
> + -1/*group_fd*/, 0);
> +
> + assert(pmu_fds[i] >= 0);
> + assert(bpf_map_update_elem(map_fd, &key,
> + &pmu_fds[i], BPF_ANY) == 0);
> + ioctl(pmu_fds[i], PERF_EVENT_IOC_ENABLE, 0);
> + }
> +}
> +
> +static void sig_handler(int signo)
> +{
> + do_detach(if_idx, if_name);
> + exit(0);
> +}
> +
> +int main(int argc, char **argv)
> +{
> + struct bpf_prog_load_attr prog_load_attr = {
> + .prog_type = BPF_PROG_TYPE_XDP,
> + };
> + struct bpf_object *obj;
> + struct bpf_map *map;
> + int prog_fd, map_fd;
> + char filename[256];
> + int ret, err, i;
> + int numcpus;
> +
> + if (argc < 2) {
> + printf("Usage: %s <ifname>\n", argv[0]);
> + return 1;
> + }
> +
> + numcpus = get_nprocs();
> + if (numcpus > MAX_CPUS)
> + numcpus = MAX_CPUS;
> +
> + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
> + prog_load_attr.file = filename;
> +
> + if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> + return 1;
> +
> + if (!prog_fd) {
> + printf("load_bpf_file: %s\n", strerror(errno));
> + return 1;
> + }
> +
> + map = bpf_map__next(NULL, obj);
> + if (!map) {
> + printf("finding a map in obj file failed\n");
> + return 1;
> + }
> + map_fd = bpf_map__fd(map);
> +
> + if_idx = if_nametoindex(argv[1]);
> + if (!if_idx)
> + if_idx = strtoul(argv[1], NULL, 0);
> +
> + if (!if_idx) {
> + fprintf(stderr, "Invalid ifname\n");
> + return 1;
> + }
> + if_name = argv[1];
> + err = do_attach(if_idx, prog_fd, argv[1]);
> + if (err)
> + return err;
> +
> + if (signal(SIGINT, sig_handler) ||
> + signal(SIGHUP, sig_handler) ||
> + signal(SIGTERM, sig_handler)) {
> + perror("signal");
> + return 1;
> + }
> +
> + test_bpf_perf_event(map_fd, numcpus);
> +
> + for (i = 0; i < numcpus; i++)
> + if (perf_event_mmap_header(pmu_fds[i], &headers[i]) < 0)
> + return 1;
> +
> + ret = perf_event_poller_multi(pmu_fds, headers, numcpus,
> + print_bpf_output);
> + kill(0, SIGINT);
> + return ret;
> +}
>
^ permalink raw reply
* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-26 22:35 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
Quentin Monnet, David S. Miller, netdev, kernel-team,
linux-kernel
In-Reply-To: <20180626222709.fsznwqauxojhhx7v@kafai-mbp.dhcp.thefacebook.com>
On Tue, 26 Jun 2018 15:27:09 -0700, Martin KaFai Lau wrote:
> On Tue, Jun 26, 2018 at 01:31:33PM -0700, Jakub Kicinski wrote:
> > On Tue, 26 Jun 2018 17:48:22 +0100, Okash Khawaja wrote:
> > > On Fri, Jun 22, 2018 at 05:26:39PM -0700, Martin KaFai Lau wrote:
> > > > On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:
> > > > > On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:
> > > > > > > > > > > > > > > "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > > > > > > ],
> > > > > > > > > > > > > > > "value_struct":{
> > > > > > > > > > > > > > > "src_ip":2,
> > > > > > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > > > > > Is it breaking backward compat?
> > > > > > > > > > i.e.
> > > > > > > > > > struct five_tuples {
> > > > > > > > > > - int src_ip;
> > > > > > > > > > + int src_ip[4];
> > > > > > > > > > /* ... */
> > > > > > > > > > };
> > > > > > > > >
> > > > > > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > > > > > not bpftool :) BTF changes so does the output.
> > > > > > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > > > > > Hence, "-j" and "-p" will stay as is. The whole existing json will
> > > > > > > > be backward compat instead of only partly backward compat.
> > > > > > >
> > > > > > > No. There is a difference between user of a facility changing their
> > > > > > > input and kernel/libraries providing different output in response to
> > > > > > > that, and the libraries suddenly changing the output on their own.
> > > > > > >
> > > > > > > Your example is like saying if user started using IPv6 addresses
> > > > > > > instead of IPv4 the netlink attributes in dumps will be different so
> > > > > > > kernel didn't keep backwards compat. While what you're doing is more
> > > > > > > equivalent to dropping support for old ioctl interfaces because there
> > > > > > > is a better mechanism now.
> > > > > > Sorry, I don't follow this. I don't see netlink suffer json issue like
> > > > > > the one on "key" and "value".
> > > > > >
> > > > > > All I can grasp is, the json should normally be backward compat but now
> > > > > > we are saying anything added by btf-output is an exception because
> > > > > > the script parsing it will treat it differently than "key" and "value"
> > > > >
> > > > > Backward compatibility means that if I run *the same* program against
> > > > > different kernels/libraries it continues to work. If someone decides
> > > > > to upgrade their program to work with IPv6 (which was your example)
> > > > > obviously there is no way system as a whole will look 1:1 the same.
> > > > >
> > > > > > > BTF in JSON is very useful, and will help people who writes simple
> > > > > > > orchestration/scripts based on bpftool *a* *lot*. I really appreciate
> > > > > > Can you share what the script will do? I want to understand why
> > > > > > it cannot directly use the BTF format and the map data.
> > > > >
> > > > > Think about a python script which wants to read a counter in a map.
> > > > > Right now it would have to get the BTF, find out which bytes are the
> > > > > counter, then convert the bytes into a larger int. With JSON BTF if
> > > > > just does entry["formatted"]["value"]["counter"].
> > > > >
> > > > > Real life example from my test code (conversion of 3 element counter
> > > > > array):
> > > > >
> > > > > def str2int(strtab):
> > > > > inttab = []
> > > > > for i in strtab:
> > > > > inttab.append(int(i, 16))
> > > > > ba = bytearray(inttab)
> > > > > if len(strtab) == 4:
> > > > > fmt = "I"
> > > > > elif len(strtab) == 8:
> > > > > fmt = "Q"
> > > > > else:
> > > > > raise Exception("String array of len %d can't be unpacked to an int" %
> > > > > (len(strtab)))
> > > > > return struct.unpack(fmt, ba)[0]
> > > > >
> > > > > def convert(elems, idx):
> > > > > val = []
> > > > > for i in range(3):
> > > > > part = elems[idx]["value"][i * length:(i + 1) * length]
> > > > > val.append(str2int(part))
> > > > > return val
> > > > >
> > > > > With BTF it would be:
> > > > >
> > > > > elems[idx]["formatted"]["value"]
> > > > >
> > > > > Which is fairly awesome.
> > > > Thanks for the example. Agree that with BTF, things are easier in general.
> > > >
> > > > btw, what more awesome is,
> > > > #> bpftool map find id 100 key 1
> > > > {
> > > > "counter_x": 1,
> > > > "counter_y": 10
> > > > }
> > > >
> > > > >
> > > > > > > this addition to bpftool and will start using it myself as soon as it
> > > > > > > lands. I'm not sure why the reluctance to slightly change the output
> > > > > > > format?
> > > > > > The initial change argument is because the json has to be backward compat.
> > > > > >
> > > > > > Then we show that btf-output is inherently not backward compat, so
> > > > > > printing it in json does not make sense at all.
> > > > > >
> > > > > > However, now it is saying part of it does not have to be backward compat.
> > > > >
> > > > > Compatibility of "formatted" member is defined as -> fields broken out
> > > > > according to BTF. So it is backward compatible. The definition of
> > > > > "value" member is -> an array of unfortunately formatted array of
> > > > > ugly hex strings :(
> > > > >
> > > > > > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > > > > > case, other than the double output is still confusing. Lets wait for
> > > > > > Okash's input.
> > > > > >
> > > > > > At the same time, the same output will be used as the default plaintext
> > > > > > output when BTF is available. Then the plaintext BTF output
> > > > > > will not be limited by the json restrictions when we want
> > > > > > to improve human readability later. Apparently, the
> > > > > > improvements on plaintext will not be always applicable
> > > > > > to json output.
> > > > >
> > >
> > > hi,
> > >
> > > so i guess following is what we want:
> > >
> > > 1. a "formatted" object nested inside -p and -j switches for bpf map
> > > dump. this will be JSON and backward compatible
> > > 2. an output for humans - which is like the current output. this will
> > > not be JSON. this won't have to be backward compatible. this output will
> > > be shown when neither of -j and -p are supplied and btf info is
> > > available.
> > >
> > > i can update the patches to v2 which covers 2 above + all other comments
> > > on the patchset. later we can follow up with a patch for 1.
> >
> > Please do both at the same time. I've learnt not to trust people when
> > they say things like "we can follow up with xyz" :( In my experience it
> > _always_ backfires.
> >
> > Implementing both outputs in one series will help you structure your
> > code to best suit both of the formats up front.
> hex and "formatted" are the only things missing? As always, things
> can be refactored when new use case comes up. Lets wait for
> Okash input.
>
> Regardless, plaintext is our current use case. Having the current
> patchset in does not stop us or others from contributing other use
> cases (json, "bpftool map find"...etc), and IMO it is actually
> the opposite. Others may help us get there faster than us alone.
> We should not stop making forward progress and take this patch
> as hostage because "abc" and "xyz" are not done together.
Parity between JSON and plain text output is non negotiable.
^ permalink raw reply
* Re: [PATCH RESEND bpf-next v6 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Song Liu @ 2018-06-26 22:35 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Networking
In-Reply-To: <152992950237.15897.9894201421576854943.stgit@alrua-kau>
On Mon, Jun 25, 2018 at 5:25 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Add two new helper functions to trace_helpers that supports polling
> multiple perf file descriptors for events. These are used to the XDP
> perf_event_output example, which needs to work with one perf fd per CPU.
>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Acked-by: Song Liu <songliubraving@fb.com>
> ---
> tools/testing/selftests/bpf/trace_helpers.c | 48 ++++++++++++++++++++++++++-
> tools/testing/selftests/bpf/trace_helpers.h | 4 ++
> 2 files changed, 50 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
> index 3868dcb63420..cabe2a3a3b30 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -88,7 +88,7 @@ static int page_size;
> static int page_cnt = 8;
> static struct perf_event_mmap_page *header;
>
> -int perf_event_mmap(int fd)
> +int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
> {
> void *base;
> int mmap_size;
> @@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
> return -1;
> }
>
> - header = base;
> + *header = base;
> return 0;
> }
>
> +int perf_event_mmap(int fd)
> +{
> + return perf_event_mmap_header(fd, &header);
> +}
> +
> static int perf_event_poll(int fd)
> {
> struct pollfd pfd = { .fd = fd, .events = POLLIN };
> @@ -163,3 +168,42 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
>
> return ret;
> }
> +
> +int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
> + int num_fds, perf_event_print_fn output_fn)
> +{
> + enum bpf_perf_event_ret ret;
> + struct pollfd *pfds;
> + void *buf = NULL;
> + size_t len = 0;
> + int i;
> +
> + pfds = calloc(num_fds, sizeof(*pfds));
> + if (!pfds)
> + return LIBBPF_PERF_EVENT_ERROR;
> +
> + for (i = 0; i < num_fds; i++) {
> + pfds[i].fd = fds[i];
> + pfds[i].events = POLLIN;
> + }
> +
> + for (;;) {
> + poll(pfds, num_fds, 1000);
> + for (i = 0; i < num_fds; i++) {
> + if (!pfds[i].revents)
> + continue;
> +
> + ret = bpf_perf_event_read_simple(headers[i],
> + page_cnt * page_size,
> + page_size, &buf, &len,
> + bpf_perf_event_print,
> + output_fn);
> + if (ret != LIBBPF_PERF_EVENT_CONT)
> + break;
> + }
> + }
> + free(buf);
> + free(pfds);
> +
> + return ret;
> +}
> diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
> index 3b4bcf7f5084..18924f23db1b 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.h
> +++ b/tools/testing/selftests/bpf/trace_helpers.h
> @@ -3,6 +3,7 @@
> #define __TRACE_HELPER_H
>
> #include <libbpf.h>
> +#include <linux/perf_event.h>
>
> struct ksym {
> long addr;
> @@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
> typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
>
> int perf_event_mmap(int fd);
> +int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
> /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
> int perf_event_poller(int fd, perf_event_print_fn output_fn);
> +int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
> + int num_fds, perf_event_print_fn output_fn);
> #endif
>
^ permalink raw reply
* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
From: Jakub Kicinski @ 2018-06-26 22:31 UTC (permalink / raw)
To: Or Gerlitz
Cc: John Hurley, Jiri Pirko, netdev, ASAP_Direct_Dev, simon.horman,
Andy Gospodarek
In-Reply-To: <c3e3790f-fd49-3106-718f-d87993a0c195@mellanox.com>
On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:
> > -------- Forwarded Message --------
> > Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
> > Date: Thu, 21 Jun 2018 14:35:55 +0100
> > From: John Hurley <john.hurley@netronome.com>
> > To: dev@openvswitch.org, roid@mellanox.com, gavi@mellanox.com, paulb@mellanox.com, fbl@sysclose.org, simon.horman@netronome.com
> > CC: John Hurley <john.hurley@netronome.com>
> >
> > This patchset extends OvS TC and the linux-netdev implementation to
> > support the offloading of Linux Link Aggregation devices (LAG) and their
> > slaves. TC blocks are used to provide this offload. Blocks, in TC, group
> > together a series of qdiscs. If a filter is added to one of these qdiscs
> > then it applied to all. Similarly, if a packet is matched on one of the
> > grouped qdiscs then the stats for the entire block are increased. The
> > basis of the LAG offload is that the LAG master (attached to the OvS
> > bridge) and slaves that may exist outside of OvS are all added to the same
> > TC block. OvS can then control the filters and collect the stats on the
> > slaves via its interaction with the LAG master.
> >
> > The TC API is extended within OvS to allow the addition of a block id to
> > ingress qdisc adds. Block ids are then assigned to each LAG master that is
> > attached to the OvS bridge. The linux netdev netlink socket is used to
> > monitor slave devices. If a LAG slave is found whose master is on the bridge
> > then it is added to the same block as its master. If the underlying slaves
> > belong to an offloadable device then the Linux LAG device can be offloaded
> > to hardware.
>
> Guys (J/J/J),
>
> Doing this here b/c
>
> a. this has impact on the kernel side of things
>
> b. I am more of a netdev and not openvswitch citizen..
>
> some comments,
>
> 1. this + Jakub's patch for the reply are really a great design
>
> 2. re the egress side of things. Some NIC HWs can't just use LAG
> as the egress port destination of an ACL (tc rule) and the HW rule
> needs to be duplicated to both HW ports. So... in that case, you
> see the HW driver doing the duplication (:() or we can somehow
> make it happen from user-space?
It's the TC core that does the duplication. Drivers which don't need
the duplication (e.g. mlxsw) will not register a new callback for each
port on which shared block is bound. They will keep one list of rules,
and a list of ports that those rules apply to.
Drivers which need duplication (multiplication) (all NICs?) have to
register a new callback for each port bound to a shared block. And TC
will call those drivers as many times as they have callbacks registered
== as many times as they have ports bound to the block. Each time
callback is invoked the driver will figure out the ingress port based
on the cb_priv and use <ingress, cookie> as the key in its rule table
(or have a separate rule table per ingress port).
So again you just register a callback every time shared block is bound,
and then TC core will send add/remove rule commands down to the driver,
relaying existing rules as well if needed. I may be wrong, but I think
you split the rules tables per port for mlx5, so this OvS bond offload
scheme "should just work" for you?
Does that clarify things or were you asking more about the active
passive thing John mentioned or some way to save rule space?
> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned
> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind
> to the tunnel device for ingress rules. If we have agreed way to identify
> uplink representors, can we do that from ovs too?
I'm not sure, there can be multiple tunnel devices. Plus we really
want to know the tunnel type, e.g. vxlan vs geneve, so simple shared
block propagation will probably not cut it. If that's what you're
referring to.
> does it matter if we are bonding + encapsulating or just
> encapsulating? note that under encap scheme the bond is typically not
> part of the OVS bridge.
I don't think that matters in general, driver doing bonding offload
should just start recognizing the bond master as "their port" and
register an egdev callback for redirects to master today (or equivalent
in the new scheme once that materializes...)
^ permalink raw reply
* Re: [PATCH bpf-next 6/7] nfp: bpf: support u32 divide using reciprocal_div.h
From: Song Liu @ 2018-06-26 22:28 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <20180625035421.2991-7-jakub.kicinski@netronome.com>
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> NFP doesn't have integer divide instruction, this patch use reciprocal
> algorithm (the basic one, reciprocal_div) to emulate it.
>
> For each u32 divide, we would need 11 instructions to finish the operation.
>
> 7 (for multiplication) + 4 (various ALUs) = 11
>
> Given NFP only supports multiplication no bigger than u32, we'd require
> divisor and dividend no bigger than that as well.
>
> Also eBPF doesn't support signed divide and has enforced this on C language
> level by failing compilation. However LLVM assembler hasn't enforced this,
> so it is possible for negative constant to leak in as a BPF_K operand
> through assembly code, we reject such cases as well.
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/net/ethernet/netronome/nfp/bpf/jit.c | 58 ++++++++++++++++++-
> drivers/net/ethernet/netronome/nfp/bpf/main.h | 5 ++
> .../net/ethernet/netronome/nfp/bpf/verifier.c | 31 ++++++++++
> 3 files changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> index 7d7061d93358..d732b6cfc356 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> @@ -34,10 +34,11 @@
> #define pr_fmt(fmt) "NFP net bpf: " fmt
>
> #include <linux/bug.h>
> -#include <linux/kernel.h>
> #include <linux/bpf.h>
> #include <linux/filter.h>
> +#include <linux/kernel.h>
> #include <linux/pkt_cls.h>
> +#include <linux/reciprocal_div.h>
> #include <linux/unistd.h>
>
> #include "main.h"
> @@ -1493,6 +1494,32 @@ wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
> return 0;
> }
>
> +static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 dst, u64 imm)
> +{
> + swreg tmp_both = imm_both(nfp_prog), dst_both = reg_both(dst);
> + swreg dst_a = reg_a(dst), dst_b = reg_a(dst);
> + struct reciprocal_value rvalue;
> + swreg tmp_b = imm_b(nfp_prog);
> + swreg magic;
> +
> + if (imm > U32_MAX) {
> + wrp_immed(nfp_prog, dst_both, 0);
> + return 0;
> + }
> +
> + rvalue = reciprocal_value(imm);
> + magic = re_load_imm_any(nfp_prog, rvalue.m, imm_b(nfp_prog));
> + wrp_mul_u32(nfp_prog, tmp_both, tmp_both, dst_a, magic, true);
> + emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB, tmp_b);
> + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
> + SHF_SC_R_SHF, rvalue.sh1);
> + emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD, tmp_b);
> + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
> + SHF_SC_R_SHF, rvalue.sh2);
> +
> + return 0;
> +}
> +
> static int adjust_head(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> {
> swreg tmp = imm_a(nfp_prog), tmp_len = imm_b(nfp_prog);
> @@ -1807,6 +1834,21 @@ static int mul_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> return wrp_mul(nfp_prog, meta, true, false);
> }
>
> +static int div_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + const struct bpf_insn *insn = &meta->insn;
> +
> + return wrp_div_imm(nfp_prog, insn->dst_reg * 2, insn->imm);
> +}
> +
> +static int div_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + /* NOTE: verifier hook has rejected cases for which verifier doesn't
> + * know whether the source operand is constant or not.
> + */
> + return wrp_div_imm(nfp_prog, meta->insn.dst_reg * 2, meta->umin_src);
> +}
> +
> static int neg_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> {
> const struct bpf_insn *insn = &meta->insn;
> @@ -2230,6 +2272,16 @@ static int mul_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> return wrp_mul(nfp_prog, meta, false, false);
> }
>
> +static int div_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + return div_reg64(nfp_prog, meta);
> +}
> +
> +static int div_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + return div_imm64(nfp_prog, meta);
> +}
> +
> static int neg_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> {
> u8 dst = meta->insn.dst_reg * 2;
> @@ -2983,6 +3035,8 @@ static const instr_cb_t instr_cb[256] = {
> [BPF_ALU64 | BPF_SUB | BPF_K] = sub_imm64,
> [BPF_ALU64 | BPF_MUL | BPF_X] = mul_reg64,
> [BPF_ALU64 | BPF_MUL | BPF_K] = mul_imm64,
> + [BPF_ALU64 | BPF_DIV | BPF_X] = div_reg64,
> + [BPF_ALU64 | BPF_DIV | BPF_K] = div_imm64,
> [BPF_ALU64 | BPF_NEG] = neg_reg64,
> [BPF_ALU64 | BPF_LSH | BPF_X] = shl_reg64,
> [BPF_ALU64 | BPF_LSH | BPF_K] = shl_imm64,
> @@ -3004,6 +3058,8 @@ static const instr_cb_t instr_cb[256] = {
> [BPF_ALU | BPF_SUB | BPF_K] = sub_imm,
> [BPF_ALU | BPF_MUL | BPF_X] = mul_reg,
> [BPF_ALU | BPF_MUL | BPF_K] = mul_imm,
> + [BPF_ALU | BPF_DIV | BPF_X] = div_reg,
> + [BPF_ALU | BPF_DIV | BPF_K] = div_imm,
> [BPF_ALU | BPF_NEG] = neg_reg,
> [BPF_ALU | BPF_LSH | BPF_K] = shl_imm,
> [BPF_ALU | BPF_END | BPF_X] = end_reg32,
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> index c10079b1a312..9845c1a2d4c2 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> @@ -399,6 +399,11 @@ static inline bool is_mbpf_mul(const struct nfp_insn_meta *meta)
> return is_mbpf_alu(meta) && mbpf_op(meta) == BPF_MUL;
> }
>
> +static inline bool is_mbpf_div(const struct nfp_insn_meta *meta)
> +{
> + return is_mbpf_alu(meta) && mbpf_op(meta) == BPF_DIV;
> +}
> +
> /**
> * struct nfp_prog - nfp BPF program
> * @bpf: backpointer to the bpf app priv structure
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> index 30d4f1580693..f0f07e988c46 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> @@ -558,6 +558,37 @@ nfp_bpf_check_alu(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
> }
> }
>
> + /* NFP doesn't have divide instructions, we support divide by constant
> + * through reciprocal multiplication. Given NFP support multiplication
> + * no bigger than u32, we'd require divisor and dividend no bigger than
> + * that as well.
> + *
> + * Also eBPF doesn't support signed divide and has enforced this on C
> + * language level by failing compilation. However LLVM assembler hasn't
> + * enforced this, so it is possible for negative constant to leak in as
> + * a BPF_K operand through assembly code, we reject such cases as well.
> + */
> + if (is_mbpf_div(meta)) {
> + if (meta->umax_dst > U32_MAX) {
> + pr_vlog(env, "divisor is not within u32 value range\n");
> + return -EINVAL;
> + }
> + if (mbpf_src(meta) == BPF_X) {
> + if (meta->umin_src != meta->umax_src) {
> + pr_vlog(env, "dividend is not constant\n");
> + return -EINVAL;
> + }
> + if (meta->umax_src > U32_MAX) {
> + pr_vlog(env, "dividend is not within u32 value range\n");
> + return -EINVAL;
> + }
> + }
> + if (mbpf_src(meta) == BPF_K && meta->insn.imm < 0) {
> + pr_vlog(env, "divide by negative constant is not supported\n");
> + return -EINVAL;
> + }
> + }
> +
> return 0;
> }
>
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Martin KaFai Lau @ 2018-06-26 22:27 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
Quentin Monnet, David S. Miller, netdev, kernel-team,
linux-kernel
In-Reply-To: <20180626133133.618af1d3@cakuba.netronome.com>
On Tue, Jun 26, 2018 at 01:31:33PM -0700, Jakub Kicinski wrote:
> On Tue, 26 Jun 2018 17:48:22 +0100, Okash Khawaja wrote:
> > On Fri, Jun 22, 2018 at 05:26:39PM -0700, Martin KaFai Lau wrote:
> > > On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:
> > > > On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:
> > > > > > > > > > > > > > "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > > > > > ],
> > > > > > > > > > > > > > "value_struct":{
> > > > > > > > > > > > > > "src_ip":2,
> > > > > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > > > > Is it breaking backward compat?
> > > > > > > > > i.e.
> > > > > > > > > struct five_tuples {
> > > > > > > > > - int src_ip;
> > > > > > > > > + int src_ip[4];
> > > > > > > > > /* ... */
> > > > > > > > > };
> > > > > > > >
> > > > > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > > > > not bpftool :) BTF changes so does the output.
> > > > > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > > > > Hence, "-j" and "-p" will stay as is. The whole existing json will
> > > > > > > be backward compat instead of only partly backward compat.
> > > > > >
> > > > > > No. There is a difference between user of a facility changing their
> > > > > > input and kernel/libraries providing different output in response to
> > > > > > that, and the libraries suddenly changing the output on their own.
> > > > > >
> > > > > > Your example is like saying if user started using IPv6 addresses
> > > > > > instead of IPv4 the netlink attributes in dumps will be different so
> > > > > > kernel didn't keep backwards compat. While what you're doing is more
> > > > > > equivalent to dropping support for old ioctl interfaces because there
> > > > > > is a better mechanism now.
> > > > > Sorry, I don't follow this. I don't see netlink suffer json issue like
> > > > > the one on "key" and "value".
> > > > >
> > > > > All I can grasp is, the json should normally be backward compat but now
> > > > > we are saying anything added by btf-output is an exception because
> > > > > the script parsing it will treat it differently than "key" and "value"
> > > >
> > > > Backward compatibility means that if I run *the same* program against
> > > > different kernels/libraries it continues to work. If someone decides
> > > > to upgrade their program to work with IPv6 (which was your example)
> > > > obviously there is no way system as a whole will look 1:1 the same.
> > > >
> > > > > > BTF in JSON is very useful, and will help people who writes simple
> > > > > > orchestration/scripts based on bpftool *a* *lot*. I really appreciate
> > > > > Can you share what the script will do? I want to understand why
> > > > > it cannot directly use the BTF format and the map data.
> > > >
> > > > Think about a python script which wants to read a counter in a map.
> > > > Right now it would have to get the BTF, find out which bytes are the
> > > > counter, then convert the bytes into a larger int. With JSON BTF if
> > > > just does entry["formatted"]["value"]["counter"].
> > > >
> > > > Real life example from my test code (conversion of 3 element counter
> > > > array):
> > > >
> > > > def str2int(strtab):
> > > > inttab = []
> > > > for i in strtab:
> > > > inttab.append(int(i, 16))
> > > > ba = bytearray(inttab)
> > > > if len(strtab) == 4:
> > > > fmt = "I"
> > > > elif len(strtab) == 8:
> > > > fmt = "Q"
> > > > else:
> > > > raise Exception("String array of len %d can't be unpacked to an int" %
> > > > (len(strtab)))
> > > > return struct.unpack(fmt, ba)[0]
> > > >
> > > > def convert(elems, idx):
> > > > val = []
> > > > for i in range(3):
> > > > part = elems[idx]["value"][i * length:(i + 1) * length]
> > > > val.append(str2int(part))
> > > > return val
> > > >
> > > > With BTF it would be:
> > > >
> > > > elems[idx]["formatted"]["value"]
> > > >
> > > > Which is fairly awesome.
> > > Thanks for the example. Agree that with BTF, things are easier in general.
> > >
> > > btw, what more awesome is,
> > > #> bpftool map find id 100 key 1
> > > {
> > > "counter_x": 1,
> > > "counter_y": 10
> > > }
> > >
> > > >
> > > > > > this addition to bpftool and will start using it myself as soon as it
> > > > > > lands. I'm not sure why the reluctance to slightly change the output
> > > > > > format?
> > > > > The initial change argument is because the json has to be backward compat.
> > > > >
> > > > > Then we show that btf-output is inherently not backward compat, so
> > > > > printing it in json does not make sense at all.
> > > > >
> > > > > However, now it is saying part of it does not have to be backward compat.
> > > >
> > > > Compatibility of "formatted" member is defined as -> fields broken out
> > > > according to BTF. So it is backward compatible. The definition of
> > > > "value" member is -> an array of unfortunately formatted array of
> > > > ugly hex strings :(
> > > >
> > > > > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > > > > case, other than the double output is still confusing. Lets wait for
> > > > > Okash's input.
> > > > >
> > > > > At the same time, the same output will be used as the default plaintext
> > > > > output when BTF is available. Then the plaintext BTF output
> > > > > will not be limited by the json restrictions when we want
> > > > > to improve human readability later. Apparently, the
> > > > > improvements on plaintext will not be always applicable
> > > > > to json output.
> > > >
> >
> > hi,
> >
> > so i guess following is what we want:
> >
> > 1. a "formatted" object nested inside -p and -j switches for bpf map
> > dump. this will be JSON and backward compatible
> > 2. an output for humans - which is like the current output. this will
> > not be JSON. this won't have to be backward compatible. this output will
> > be shown when neither of -j and -p are supplied and btf info is
> > available.
> >
> > i can update the patches to v2 which covers 2 above + all other comments
> > on the patchset. later we can follow up with a patch for 1.
>
> Please do both at the same time. I've learnt not to trust people when
> they say things like "we can follow up with xyz" :( In my experience it
> _always_ backfires.
>
> Implementing both outputs in one series will help you structure your
> code to best suit both of the formats up front.
hex and "formatted" are the only things missing? As always, things
can be refactored when new use case comes up. Lets wait for
Okash input.
Regardless, plaintext is our current use case. Having the current
patchset in does not stop us or others from contributing other use
cases (json, "bpftool map find"...etc), and IMO it is actually
the opposite. Others may help us get there faster than us alone.
We should not stop making forward progress and take this patch
as hostage because "abc" and "xyz" are not done together.
^ permalink raw reply
* Re: [PATCH bpf-next 5/7] nfp: bpf: support u16 and u32 multiplications
From: Song Liu @ 2018-06-26 22:23 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <20180625035421.2991-6-jakub.kicinski@netronome.com>
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per
> step, therefore we need 2 steps for u16 and 4 steps for u32.
>
> We also need one start instruction to initialize the sequence and one or
> two instructions to fetch the result depending on either you need the high
> halve of u32 multiplication.
>
> For ALU64, if either operand is beyond u32's value range, we reject it. One
> thing to note, if the source operand is BPF_K, then we need to check "imm"
> field directly, and we'd reject it if it is negative. Because for ALU64,
> "imm" (with s32 type) is expected to be sign extended to s64 which NFP mul
> doesn't support. For ALU32, it is fine for "imm" be negative though,
> because the result is 32-bits and here is no difference on the low halve
> of result for signed/unsigned mul, so we will get correct result.
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/net/ethernet/netronome/nfp/bpf/jit.c | 137 ++++++++++++++++++
> drivers/net/ethernet/netronome/nfp/bpf/main.h | 5 +
> .../net/ethernet/netronome/nfp/bpf/verifier.c | 58 ++++++--
> drivers/net/ethernet/netronome/nfp/nfp_asm.h | 28 ++++
> 4 files changed, 217 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> index 4a629e9b5c0f..7d7061d93358 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> @@ -415,6 +415,60 @@ emit_alu(struct nfp_prog *nfp_prog, swreg dst,
> reg.dst_lmextn, reg.src_lmextn);
> }
>
> +static void
> +__emit_mul(struct nfp_prog *nfp_prog, enum alu_dst_ab dst_ab, u16 areg,
> + enum mul_type type, enum mul_step step, u16 breg, bool swap,
> + bool wr_both, bool dst_lmextn, bool src_lmextn)
> +{
> + u64 insn;
> +
> + insn = OP_MUL_BASE |
> + FIELD_PREP(OP_MUL_A_SRC, areg) |
> + FIELD_PREP(OP_MUL_B_SRC, breg) |
> + FIELD_PREP(OP_MUL_STEP, step) |
> + FIELD_PREP(OP_MUL_DST_AB, dst_ab) |
> + FIELD_PREP(OP_MUL_SW, swap) |
> + FIELD_PREP(OP_MUL_TYPE, type) |
> + FIELD_PREP(OP_MUL_WR_AB, wr_both) |
> + FIELD_PREP(OP_MUL_SRC_LMEXTN, src_lmextn) |
> + FIELD_PREP(OP_MUL_DST_LMEXTN, dst_lmextn);
> +
> + nfp_prog_push(nfp_prog, insn);
> +}
> +
> +static void
> +emit_mul(struct nfp_prog *nfp_prog, swreg lreg, enum mul_type type,
> + enum mul_step step, swreg rreg)
> +{
> + struct nfp_insn_ur_regs reg;
> + u16 areg;
> + int err;
> +
> + if (type == MUL_TYPE_START && step != MUL_STEP_NONE) {
> + nfp_prog->error = -EINVAL;
> + return;
> + }
> +
> + if (step == MUL_LAST || step == MUL_LAST_2) {
> + /* When type is step and step Number is LAST or LAST2, left
> + * source is used as destination.
> + */
> + err = swreg_to_unrestricted(lreg, reg_none(), rreg, ®);
> + areg = reg.dst;
> + } else {
> + err = swreg_to_unrestricted(reg_none(), lreg, rreg, ®);
> + areg = reg.areg;
> + }
> +
> + if (err) {
> + nfp_prog->error = err;
> + return;
> + }
> +
> + __emit_mul(nfp_prog, reg.dst_ab, areg, type, step, reg.breg, reg.swap,
> + reg.wr_both, reg.dst_lmextn, reg.src_lmextn);
> +}
> +
> static void
> __emit_ld_field(struct nfp_prog *nfp_prog, enum shf_sc sc,
> u8 areg, u8 bmask, u8 breg, u8 shift, bool imm8,
> @@ -1380,6 +1434,65 @@ static void wrp_end32(struct nfp_prog *nfp_prog, swreg reg_in, u8 gpr_out)
> SHF_SC_R_ROT, 16);
> }
>
> +static void
> +wrp_mul_u32(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg,
> + swreg rreg, bool gen_high_half)
> +{
> + emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg);
> + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_1, rreg);
> + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_2, rreg);
> + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_3, rreg);
> + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_4, rreg);
> + emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_32x32, MUL_LAST, reg_none());
> + if (gen_high_half)
> + emit_mul(nfp_prog, dst_hi, MUL_TYPE_STEP_32x32, MUL_LAST_2,
> + reg_none());
> + else
> + wrp_immed(nfp_prog, dst_hi, 0);
> +}
> +
> +static void
> +wrp_mul_u16(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg,
> + swreg rreg)
> +{
> + emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg);
> + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_1, rreg);
> + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_2, rreg);
> + emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_16x16, MUL_LAST, reg_none());
> +}
> +
> +static int
> +wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
> + bool gen_high_half, bool ropnd_from_reg)
> +{
> + swreg multiplier, multiplicand, dst_hi, dst_lo;
> + const struct bpf_insn *insn = &meta->insn;
> + u32 lopnd_max, ropnd_max;
> + u8 dst_reg;
> +
> + dst_reg = insn->dst_reg;
> + multiplicand = reg_a(dst_reg * 2);
> + dst_hi = reg_both(dst_reg * 2 + 1);
> + dst_lo = reg_both(dst_reg * 2);
> + lopnd_max = meta->umax_dst;
> + if (ropnd_from_reg) {
> + multiplier = reg_b(insn->src_reg * 2);
> + ropnd_max = meta->umax_src;
> + } else {
> + u32 imm = insn->imm;
> +
> + multiplier = re_load_imm_any(nfp_prog, imm, imm_b(nfp_prog));
> + ropnd_max = imm;
> + }
> + if (lopnd_max > U16_MAX || ropnd_max > U16_MAX)
> + wrp_mul_u32(nfp_prog, dst_hi, dst_lo, multiplicand, multiplier,
> + gen_high_half);
> + else
> + wrp_mul_u16(nfp_prog, dst_hi, dst_lo, multiplicand, multiplier);
> +
> + return 0;
> +}
> +
> static int adjust_head(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> {
> swreg tmp = imm_a(nfp_prog), tmp_len = imm_b(nfp_prog);
> @@ -1684,6 +1797,16 @@ static int sub_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> return 0;
> }
>
> +static int mul_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + return wrp_mul(nfp_prog, meta, true, true);
> +}
> +
> +static int mul_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + return wrp_mul(nfp_prog, meta, true, false);
> +}
> +
> static int neg_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> {
> const struct bpf_insn *insn = &meta->insn;
> @@ -2097,6 +2220,16 @@ static int sub_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> return wrp_alu32_imm(nfp_prog, meta, ALU_OP_SUB, !meta->insn.imm);
> }
>
> +static int mul_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + return wrp_mul(nfp_prog, meta, false, true);
> +}
> +
> +static int mul_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> + return wrp_mul(nfp_prog, meta, false, false);
> +}
> +
> static int neg_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> {
> u8 dst = meta->insn.dst_reg * 2;
> @@ -2848,6 +2981,8 @@ static const instr_cb_t instr_cb[256] = {
> [BPF_ALU64 | BPF_ADD | BPF_K] = add_imm64,
> [BPF_ALU64 | BPF_SUB | BPF_X] = sub_reg64,
> [BPF_ALU64 | BPF_SUB | BPF_K] = sub_imm64,
> + [BPF_ALU64 | BPF_MUL | BPF_X] = mul_reg64,
> + [BPF_ALU64 | BPF_MUL | BPF_K] = mul_imm64,
> [BPF_ALU64 | BPF_NEG] = neg_reg64,
> [BPF_ALU64 | BPF_LSH | BPF_X] = shl_reg64,
> [BPF_ALU64 | BPF_LSH | BPF_K] = shl_imm64,
> @@ -2867,6 +3002,8 @@ static const instr_cb_t instr_cb[256] = {
> [BPF_ALU | BPF_ADD | BPF_K] = add_imm,
> [BPF_ALU | BPF_SUB | BPF_X] = sub_reg,
> [BPF_ALU | BPF_SUB | BPF_K] = sub_imm,
> + [BPF_ALU | BPF_MUL | BPF_X] = mul_reg,
> + [BPF_ALU | BPF_MUL | BPF_K] = mul_imm,
> [BPF_ALU | BPF_NEG] = neg_reg,
> [BPF_ALU | BPF_LSH | BPF_K] = shl_imm,
> [BPF_ALU | BPF_END | BPF_X] = end_reg32,
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> index c985d0ac61a3..c10079b1a312 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> @@ -394,6 +394,11 @@ static inline bool is_mbpf_xadd(const struct nfp_insn_meta *meta)
> return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_XADD);
> }
>
> +static inline bool is_mbpf_mul(const struct nfp_insn_meta *meta)
> +{
> + return is_mbpf_alu(meta) && mbpf_op(meta) == BPF_MUL;
> +}
> +
> /**
> * struct nfp_prog - nfp BPF program
> * @bpf: backpointer to the bpf app priv structure
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> index 7bd9666bd8ff..30d4f1580693 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> @@ -516,6 +516,51 @@ nfp_bpf_check_xadd(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
> return nfp_bpf_check_ptr(nfp_prog, meta, env, meta->insn.dst_reg);
> }
>
> +static int
> +nfp_bpf_check_alu(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
> + struct bpf_verifier_env *env)
> +{
> + const struct bpf_reg_state *sreg =
> + cur_regs(env) + meta->insn.src_reg;
> + const struct bpf_reg_state *dreg =
> + cur_regs(env) + meta->insn.dst_reg;
> +
> + meta->umin_src = min(meta->umin_src, sreg->umin_value);
> + meta->umax_src = max(meta->umax_src, sreg->umax_value);
> + meta->umin_dst = min(meta->umin_dst, dreg->umin_value);
> + meta->umax_dst = max(meta->umax_dst, dreg->umax_value);
> +
> + /* NFP supports u16 and u32 multiplication.
> + *
> + * For ALU64, if either operand is beyond u32's value range, we reject
> + * it. One thing to note, if the source operand is BPF_K, then we need
> + * to check "imm" field directly, and we'd reject it if it is negative.
> + * Because for ALU64, "imm" (with s32 type) is expected to be sign
> + * extended to s64 which NFP mul doesn't support.
> + *
> + * For ALU32, it is fine for "imm" be negative though, because the
> + * result is 32-bits and there is no difference on the low halve of
> + * the result for signed/unsigned mul, so we will get correct result.
> + */
> + if (is_mbpf_mul(meta)) {
> + if (meta->umax_dst > U32_MAX) {
> + pr_vlog(env, "multiplier is not within u32 value range\n");
> + return -EINVAL;
> + }
> + if (mbpf_src(meta) == BPF_X && meta->umax_src > U32_MAX) {
> + pr_vlog(env, "multiplicand is not within u32 value range\n");
> + return -EINVAL;
> + }
> + if (mbpf_class(meta) == BPF_ALU64 &&
> + mbpf_src(meta) == BPF_K && meta->insn.imm < 0) {
> + pr_vlog(env, "sign extended multiplicand won't be within u32 value range\n");
> + return -EINVAL;
> + }
> + }
> +
> + return 0;
> +}
> +
> static int
> nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx)
> {
> @@ -551,17 +596,8 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx)
> if (is_mbpf_xadd(meta))
> return nfp_bpf_check_xadd(nfp_prog, meta, env);
>
> - if (is_mbpf_alu(meta)) {
> - const struct bpf_reg_state *sreg =
> - cur_regs(env) + meta->insn.src_reg;
> - const struct bpf_reg_state *dreg =
> - cur_regs(env) + meta->insn.dst_reg;
> -
> - meta->umin_src = min(meta->umin_src, sreg->umin_value);
> - meta->umax_src = max(meta->umax_src, sreg->umax_value);
> - meta->umin_dst = min(meta->umin_dst, dreg->umin_value);
> - meta->umax_dst = max(meta->umax_dst, dreg->umax_value);
> - }
> + if (is_mbpf_alu(meta))
> + return nfp_bpf_check_alu(nfp_prog, meta, env);
>
> return 0;
> }
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.h b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
> index f6677bc9875a..cdc4e065f6f5 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_asm.h
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
> @@ -426,4 +426,32 @@ static inline u32 nfp_get_ind_csr_ctx_ptr_offs(u32 read_offset)
> return (read_offset & ~NFP_IND_ME_CTX_PTR_BASE_MASK) | NFP_CSR_CTX_PTR;
> }
>
> +enum mul_type {
> + MUL_TYPE_START = 0x00,
> + MUL_TYPE_STEP_24x8 = 0x01,
> + MUL_TYPE_STEP_16x16 = 0x02,
> + MUL_TYPE_STEP_32x32 = 0x03,
> +};
> +
> +enum mul_step {
> + MUL_STEP_1 = 0x00,
> + MUL_STEP_NONE = MUL_STEP_1,
> + MUL_STEP_2 = 0x01,
> + MUL_STEP_3 = 0x02,
> + MUL_STEP_4 = 0x03,
> + MUL_LAST = 0x04,
> + MUL_LAST_2 = 0x05,
> +};
> +
> +#define OP_MUL_BASE 0x0f800000000ULL
> +#define OP_MUL_A_SRC 0x000000003ffULL
> +#define OP_MUL_B_SRC 0x000000ffc00ULL
> +#define OP_MUL_STEP 0x00000700000ULL
> +#define OP_MUL_DST_AB 0x00000800000ULL
> +#define OP_MUL_SW 0x00040000000ULL
> +#define OP_MUL_TYPE 0x00180000000ULL
> +#define OP_MUL_WR_AB 0x20000000000ULL
> +#define OP_MUL_SRC_LMEXTN 0x40000000000ULL
> +#define OP_MUL_DST_LMEXTN 0x80000000000ULL
> +
> #endif
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Flavio Leitner @ 2018-06-26 22:03 UTC (permalink / raw)
To: Cong Wang
Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpU0jp=wvutGbSLzVYX5qQeW0W8ARvR=-gp4MYJqWeee_A@mail.gmail.com>
On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > When a packet is attached to a socket, we should keep the association as much as possible.
>
> As much as possible within one stack, I agree. I still don't understand
> why we should keep it across the stack boundary.
>
> > Only when a new association needs to be done, skb_orphan() needs to be called.
> >
> > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > can evades SO_SNDBUF limits.
>
> Right before leaving the stack is not too soon, it is the latest
> actually, for veth case.
Depends on how you view things - it's the same host/stack sharing the
same resources, so why should we not keep it?
--
Flavio
^ permalink raw reply
* Re: [PATCH net-next v2 1/3] net: phy: xgmiitorgmii: Check phy_driver ready before accessing
From: Florian Fainelli @ 2018-06-26 21:59 UTC (permalink / raw)
To: Brandon Maier, netdev
Cc: andrew, davem, michal.simek, clayton.shotwell, kristopher.cory,
linux-kernel
In-Reply-To: <20180626175050.71165-1-brandon.maier@rockwellcollins.com>
On 06/26/2018 10:50 AM, Brandon Maier wrote:
> Since a phy_device is added to the global mdio_bus list during
> phy_device_register(), but a phy_device's phy_driver doesn't get
> attached until phy_probe(). It's possible of_phy_find_device() in
> xgmiitorgmii will return a valid phy with a NULL phy_driver. Leading to
> a NULL pointer access during the memcpy().
>
> Fixes this Oops:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> pgd = c0004000
> [00000000] *pgd=00000000
> Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.40 #1
> Hardware name: Xilinx Zynq Platform
> task: ce4c8d00 task.stack: ce4ca000
> PC is at memcpy+0x48/0x330
> LR is at xgmiitorgmii_probe+0x90/0xe8
> pc : [<c074bc68>] lr : [<c0529548>] psr: 20000013
> sp : ce4cbb54 ip : 00000000 fp : ce4cbb8c
> r10: 00000000 r9 : 00000000 r8 : c0c49178
> r7 : 00000000 r6 : cdc14718 r5 : ce762800 r4 : cdc14710
> r3 : 00000000 r2 : 00000054 r1 : 00000000 r0 : cdc14718
> Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> Control: 18c5387d Table: 0000404a DAC: 00000051
> Process swapper/0 (pid: 1, stack limit = 0xce4ca210)
> ...
> [<c074bc68>] (memcpy) from [<c0529548>] (xgmiitorgmii_probe+0x90/0xe8)
> [<c0529548>] (xgmiitorgmii_probe) from [<c0526a94>] (mdio_probe+0x28/0x34)
> [<c0526a94>] (mdio_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
> [<c04db98c>] (driver_probe_device) from [<c04dbd58>] (__device_attach_driver+0xac/0x10c)
> [<c04dbd58>] (__device_attach_driver) from [<c04d96f4>] (bus_for_each_drv+0x84/0xc8)
> [<c04d96f4>] (bus_for_each_drv) from [<c04db5bc>] (__device_attach+0xd0/0x134)
> [<c04db5bc>] (__device_attach) from [<c04dbdd4>] (device_initial_probe+0x1c/0x20)
> [<c04dbdd4>] (device_initial_probe) from [<c04da8fc>] (bus_probe_device+0x98/0xa0)
> [<c04da8fc>] (bus_probe_device) from [<c04d8660>] (device_add+0x43c/0x5d0)
> [<c04d8660>] (device_add) from [<c0526cb8>] (mdio_device_register+0x34/0x80)
> [<c0526cb8>] (mdio_device_register) from [<c0580b48>] (of_mdiobus_register+0x170/0x30c)
> [<c0580b48>] (of_mdiobus_register) from [<c05349c4>] (macb_probe+0x710/0xc00)
> [<c05349c4>] (macb_probe) from [<c04dd700>] (platform_drv_probe+0x44/0x80)
> [<c04dd700>] (platform_drv_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
> [<c04db98c>] (driver_probe_device) from [<c04dbc58>] (__driver_attach+0x10c/0x118)
> [<c04dbc58>] (__driver_attach) from [<c04d9600>] (bus_for_each_dev+0x8c/0xd0)
> [<c04d9600>] (bus_for_each_dev) from [<c04db1fc>] (driver_attach+0x2c/0x30)
> [<c04db1fc>] (driver_attach) from [<c04daa98>] (bus_add_driver+0x50/0x260)
> [<c04daa98>] (bus_add_driver) from [<c04dc440>] (driver_register+0x88/0x108)
> [<c04dc440>] (driver_register) from [<c04dd6b4>] (__platform_driver_register+0x50/0x58)
> [<c04dd6b4>] (__platform_driver_register) from [<c0b31248>] (macb_driver_init+0x24/0x28)
> [<c0b31248>] (macb_driver_init) from [<c010203c>] (do_one_initcall+0x60/0x1a4)
> [<c010203c>] (do_one_initcall) from [<c0b00f78>] (kernel_init_freeable+0x15c/0x1f8)
> [<c0b00f78>] (kernel_init_freeable) from [<c0763d10>] (kernel_init+0x18/0x124)
> [<c0763d10>] (kernel_init) from [<c0112d74>] (ret_from_fork+0x14/0x20)
> Code: ba000002 f5d1f03c f5d1f05c f5d1f07c (e8b151f8)
> ---[ end trace 3e4ec21905820a1f ]---
>
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
--
Florian
^ permalink raw reply
* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Cong Wang @ 2018-06-26 21:48 UTC (permalink / raw)
To: Eric Dumazet
Cc: Flavio Leitner, Linux Kernel Network Developers, Paolo Abeni,
David Miller, Florian Westphal, NetFilter
In-Reply-To: <48e15faf-f935-0166-e1db-18f7286e7264@gmail.com>
On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/25/2018 09:15 PM, Cong Wang wrote:
> > On Mon, Jun 25, 2018 at 8:59 AM Flavio Leitner <fbl@redhat.com> wrote:
> >>
> >> The sock reference is lost when scrubbing the packet and that breaks
> >> TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
> >> performance impacts of about 50% in a single TCP stream when crossing
> >> network namespaces.
> >>
> >> XPS breaks because the queue mapping stored in the socket is not
> >> available, so another random queue might be selected when the stack
> >> needs to transmit something like a TCP ACK, or TCP Retransmissions.
> >> That causes packet re-ordering and/or performance issues.
> >>
> >> TSQ breaks because it orphans the packet while it is still in the
> >> host, so packets are queued contributing to the buffer bloat problem.
> >
> > Why should TSQ in one stack care about buffer bloat in another stack?
> >
> > Actually, I think the current behavior is correct, once the packet leaves
> > its current stack (or netns), it should relief the backpressure on TCP
> > socket in this stack, whether it will be queued in another stack is beyond
> > its concern. This breaks the isolation between networking stacks.
> >
>
> We discussed about this during netconf Cong, nobody was against this planned removal.
I agreed to keep skb->sk, but didn't realize it actually impacts TSQ too.
>
> When a packet is attached to a socket, we should keep the association as much as possible.
As much as possible within one stack, I agree. I still don't understand
why we should keep it across the stack boundary.
>
> Only when a new association needs to be done, skb_orphan() needs to be called.
>
> Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> can evades SO_SNDBUF limits.
Right before leaving the stack is not too soon, it is the latest
actually, for veth case.
^ permalink raw reply
* Re: [PATCH] selftests: bpf: config: add config fragments
From: William Tu @ 2018-06-26 21:27 UTC (permalink / raw)
To: Anders Roxell
Cc: Daniel Borkmann, Alexei Starovoitov, Shuah Khan, Networking,
Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK
In-Reply-To: <CADYN=9JAOim7CvoZzWnbHNEF+4yG3Lzvry1rZLqGp+7hQxPdkw@mail.gmail.com>
>> >
>> > --- ::11 ping statistics ---
>> > 5 packets transmitted, 3 packets received, 40% packet loss
>> > round-trip min/avg/max = 0.139/1.857/5.293 ms
>> > + ip netns exec at_ns0 ping -c 3 -w 10 -q 10.1.1.200
>> > PING 10.1.1.200 (10.1.1.200): 56 data bytes
>> >
>> > --- 10.1.1.200 ping statistics ---
>> > 3 packets transmitted, 3 packets received, 0% packet loss
>> > round-trip min/avg/max = 0.214/0.256/0.305 ms
>> > + ping -c 3 -w 10 -q 10.1.1.100
>> > PING 10.1.1.100 (10.1.1.100): 56 data bytes
>> >
>> > --- 10.1.1.100 ping statistics ---
>> > 3 packets transmitted, 3 packets received, 0% packet loss
>> > round-trip min/avg/max = 0.210/0.211/0.213 ms
>> > + check_err 0
>> > + '[' 0 -eq 0 ']'
>> > + ret=0
So looks like the ipv4 over ipv6 gre passes.
>> > + ip netns exec at_ns0 ping6 -c 3 -w 10 -q fc80::200
>> > PING fc80::200 (fc80::200): 56 data bytes
>> >
>> > --- fc80::200 ping statistics ---
>> > 10 packets transmitted, 0 packets received, 100% packet loss
`
but the ipv6 over ipv6 gre fails.
Do you have any firewall rules that block this traffic?
or if possible, the packet might get dropped at function ip6gre_xmit_ipv6
can you print the return value of this function?
>> > + check_err 1
>> > + '[' 0 -eq 0 ']'
>> > + ret=1
>> > + ip -s link show ip6gretap11
>> > 19: ip6gretap11@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1434 qdisc
>> > pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
>> > link/ether de:d2:0c:53:80:8c brd ff:ff:ff:ff:ff:ff
>> > RX: bytes packets errors dropped overrun mcast
>> > 2096 25 0 0 0 0
>> > TX: bytes packets errors dropped carrier collsns
>> > 5324 36 5 5 0 0
>>
>> So there are 5 errors at TX.
>
> and today when I tried it on next-20180620 I saw 8 errors at TX.
>
>> I couldn't reproduce in my local machine using 4.17-rc6.
>> How do I checkin the "next-20180613" source code?
>
> You can find the source code here [1], and I would look in the latest tag that I
> said that I was able to reproduce it on above.
>
> Cheers,
> Anders
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
Hi Anders,
I'm still not able to reproduce the issue on next-20180620.
Below is my test.
Testing IP6GRETAP tunnel...
PING ::11(::11) 56 data bytes
--- ::11 ping statistics ---
5 packets transmitted, 3 received, 40% packet loss, time 4048ms
rtt min/avg/max/mdev = 0.040/32.118/96.235/45.337 ms
PING 10.1.1.200 (10.1.1.200) 56(84) bytes of data.
--- 10.1.1.200 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2026ms
rtt min/avg/max/mdev = 0.074/0.099/0.117/0.018 ms
PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data.
--- 10.1.1.100 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2054ms
rtt min/avg/max/mdev = 0.069/0.113/0.187/0.052 ms
PING fc80::200(fc80::200) 56 data bytes
--- fc80::200 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2054ms
rtt min/avg/max/mdev = 0.069/0.104/0.142/0.031 ms
PASS: ip6gretap
root@osboxes:~/linux-next/tools/testing/selftests/bpf#
root@osboxes:~/linux-next/tools/testing/selftests/bpf# uname -a
Linux osboxes 4.18.0-rc1-next-20180620 #1 SMP Tue Jun 26 12:26:00 PDT
2018 x86_64 x86_64 x86_64 GNU/Linux
^ permalink raw reply
* Re: [PATCH v3 net-next 3/4] netdevsim: add ipsec offload testing
From: Jakub Kicinski @ 2018-06-26 21:23 UTC (permalink / raw)
To: Shannon Nelson; +Cc: davem, netdev, anders.roxell, linux-kselftest
In-Reply-To: <1530032875-30482-4-git-send-email-shannon.nelson@oracle.com>
On Tue, 26 Jun 2018 10:07:54 -0700, Shannon Nelson wrote:
> Implement the IPsec/XFRM offload API for testing.
>
> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
> ---
> V2 - addressed formatting comments from Jakub Kicinski
> V3 - a couple more little xmas tree nits
Thank you! :)
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
^ permalink raw reply
* [PATCH bpf-next] selftests/bpf: Test sys_connect BPF hooks with TFO
From: Andrey Ignatov @ 2018-06-26 21:22 UTC (permalink / raw)
To: netdev; +Cc: Andrey Ignatov, ast, daniel, kernel-team
TCP Fast Open is triggered by sys_sendmsg with MSG_FASTOPEN flag for
SOCK_STREAM socket.
Even though it's sys_sendmsg, it eventually calls __inet_stream_connect
the same way sys_connect does for TCP. __inet_stream_connect, in turn,
already has BPF hooks for sys_connect.
That means TFO is already covered by BPF_CGROUP_INET{4,6}_CONNECT and
the only missing piece is selftest. The patch adds selftest for TFO.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
---
tools/testing/selftests/bpf/test_sock_addr.c | 37 ++++++++++++++++----
1 file changed, 31 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
index a5e76b9219b9..2e45c92d1111 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -998,8 +998,9 @@ int init_pktinfo(int domain, struct cmsghdr *cmsg)
return 0;
}
-static int sendmsg_to_server(const struct sockaddr_storage *addr,
- socklen_t addr_len, int set_cmsg, int *syscall_err)
+static int sendmsg_to_server(int type, const struct sockaddr_storage *addr,
+ socklen_t addr_len, int set_cmsg, int flags,
+ int *syscall_err)
{
union {
char buf[CMSG_SPACE(sizeof(struct in6_pktinfo))];
@@ -1022,7 +1023,7 @@ static int sendmsg_to_server(const struct sockaddr_storage *addr,
goto err;
}
- fd = socket(domain, SOCK_DGRAM, 0);
+ fd = socket(domain, type, 0);
if (fd == -1) {
log_err("Failed to create client socket");
goto err;
@@ -1052,7 +1053,7 @@ static int sendmsg_to_server(const struct sockaddr_storage *addr,
}
}
- if (sendmsg(fd, &hdr, 0) != sizeof(data)) {
+ if (sendmsg(fd, &hdr, flags) != sizeof(data)) {
log_err("Fail to send message to server");
*syscall_err = errno;
goto err;
@@ -1066,6 +1067,15 @@ static int sendmsg_to_server(const struct sockaddr_storage *addr,
return fd;
}
+static int fastconnect_to_server(const struct sockaddr_storage *addr,
+ socklen_t addr_len)
+{
+ int sendmsg_err;
+
+ return sendmsg_to_server(SOCK_STREAM, addr, addr_len, /*set_cmsg*/0,
+ MSG_FASTOPEN, &sendmsg_err);
+}
+
static int recvmsg_from_client(int sockfd, struct sockaddr_storage *src_addr)
{
struct timeval tv;
@@ -1185,6 +1195,20 @@ static int run_connect_test_case(const struct sock_addr_test *test)
if (cmp_local_ip(clientfd, &expected_src_addr))
goto err;
+ if (test->type == SOCK_STREAM) {
+ /* Test TCP Fast Open scenario */
+ clientfd = fastconnect_to_server(&requested_addr, addr_len);
+ if (clientfd == -1)
+ goto err;
+
+ /* Make sure src and dst addrs were overridden properly */
+ if (cmp_peer_addr(clientfd, &expected_addr))
+ goto err;
+
+ if (cmp_local_ip(clientfd, &expected_src_addr))
+ goto err;
+ }
+
goto out;
err:
err = -1;
@@ -1222,8 +1246,9 @@ static int run_sendmsg_test_case(const struct sock_addr_test *test)
if (clientfd >= 0)
close(clientfd);
- clientfd = sendmsg_to_server(&requested_addr, addr_len,
- set_cmsg, &err);
+ clientfd = sendmsg_to_server(test->type, &requested_addr,
+ addr_len, set_cmsg, /*flags*/0,
+ &err);
if (err)
goto out;
else if (clientfd == -1)
--
2.17.1
^ permalink raw reply related
* [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero
From: Shannon Nelson @ 2018-06-26 21:19 UTC (permalink / raw)
To: steffen.klassert; +Cc: netdev
The offload_handle should be an opaque data cookie for the driver
to use, much like the data cookie for a timer or alarm callback.
Thus, the XFRM stack should not be checking for non-zero, because
the driver might use that to store an array reference, which could
be zero, or some other zero but meaningful value.
We can remove the checks for non-zero because there are plenty
other attributes also being checked to see if there is an offload
in place for the SA in question.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
net/ipv4/esp4_offload.c | 6 ++----
net/ipv6/esp6_offload.c | 6 ++----
net/xfrm/xfrm_device.c | 6 +++---
3 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index bbeecd1..58834a1 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -135,8 +135,7 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
skb->encap_hdr_csum = 1;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev))
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev)
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
esp_features = features & ~NETIF_F_CSUM_MASK;
@@ -179,8 +178,7 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features_
if (!xo)
return -EINVAL;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev)) {
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) {
xo->flags |= CRYPTO_FALLBACK;
hw_offload = false;
}
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index ddfa533..6177e21 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -162,8 +162,7 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
skb->encap_hdr_csum = 1;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev))
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev)
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
esp_features = features & ~NETIF_F_CSUM_MASK;
@@ -207,8 +206,7 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features
if (!xo)
return -EINVAL;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev)) {
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) {
xo->flags |= CRYPTO_FALLBACK;
hw_offload = false;
}
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 175941e..9265dd6 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -56,7 +56,7 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
if (skb_is_gso(skb)) {
struct net_device *dev = skb->dev;
- if (unlikely(!x->xso.offload_handle || (x->xso.dev != dev))) {
+ if (unlikely(x->xso.dev != dev)) {
struct sk_buff *segs;
/* Packet got rerouted, fixup features and segment it. */
@@ -210,8 +210,8 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
if (!x->type_offload || x->encap)
return false;
- if ((!dev || (x->xso.offload_handle && (dev == xfrm_dst_path(dst)->dev))) &&
- (!xdst->child->xfrm && x->type->get_mtu)) {
+ if ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
+ (!xdst->child->xfrm && x->type->get_mtu)) {
mtu = x->type->get_mtu(x, xdst->child_mtu_cached);
if (skb->len <= mtu)
--
2.7.4
^ permalink raw reply related
* Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jakub Kicinski @ 2018-06-26 21:18 UTC (permalink / raw)
To: Jiri Pirko
Cc: Linux Netdev List, David Miller, Jamal Hadi Salim, Cong Wang,
Simon Horman, John Hurley, David Ahern, mlxsw
In-Reply-To: <20180626071217.GR2161@nanopsycho>
On Tue, 26 Jun 2018 09:12:17 +0200, Jiri Pirko wrote:
> Tue, Jun 26, 2018 at 09:00:45AM CEST, jakub.kicinski@netronome.com wrote:
> >On Mon, Jun 25, 2018 at 11:43 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> >> Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicinski@netronome.com wrote:
> >>>On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
> >>>> From: Jiri Pirko <jiri@mellanox.com>
> >>>>
> >>>> For the TC clsact offload these days, some of HW drivers need
> >>>> to hold a magic ball. The reason is, with the first inserted rule inside
> >>>> HW they need to guess what fields will be used for the matching. If
> >>>> later on this guess proves to be wrong and user adds a filter with a
> >>>> different field to match, there's a problem. Mlxsw resolves it now with
> >>>> couple of patterns. Those try to cover as many match fields as possible.
> >>>> This aproach is far from optimal, both performance-wise and scale-wise.
> >>>> Also, there is a combination of filters that in certain order won't
> >>>> succeed.
> >>>>
> >>>> Most of the time, when user inserts filters in chain, he knows right away
> >>>> how the filters are going to look like - what type and option will they
> >>>> have. For example, he knows that he will only insert filters of type
> >>>> flower matching destination IP address. He can specify a template that
> >>>> would cover all the filters in the chain.
> >>>
> >>>Perhaps it's lack of sleep, but this paragraph threw me a little off
> >>>the track. IIUC the goal of this set is to provide a way to inform the
> >>>HW about expected matches before any rule is programmed into the HW.
> >>>Not before any rule is added to a particular chain. One can just use
> >>>the first rule in the chain to make a guess about the chain, but thanks
> >>>to this set user can configure *all* chains before any rules are added.
> >>
> >> The template is per-chain. User can use template for chain x and
> >> not-use it for chain y. Up to him.
> >
> >Makes sense.
> >
> >I can't help but wonder if it'd be better to associate the
> >constraints/rules with chains instead of creating a new "template"
> >object. It seems more natural to create a chain with specific
> >constraints in place than add and delete template of which there can
> >be at most one to a chain... Perhaps that's more about the user space
> >tc command line. Anyway, not a strong objection, just a thought.
>
> Hmm. I don't think it is good idea. User should see the template in a
> "show" command per chain. We would have to have 2 show commands, one to
> list the template objects and one to list templates per chains. It makes
> things more complicated for no good reason. I think that this simple
> chain-lock is easier and serves the purpose.
Hm, I think the dump is fine, what I was thinking about was:
# tc chain add dev dummy0 ingress chain_index 22 \
^^^^^
template proto ip \
^^^^^^^^
flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
instead of:
# tc filter template add dev dummy0 ingress \
^^^^^^^^^^^^^^^
proto ip chain_index 22 \
flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
And then delete becomes:
# tc chain del dev dummy0 ingress chain_index 22
Error: The chain is not empty.
The fact that template is very much like a filter is sort of an
implementation detail, from user perspective it may be more intuitive
to model template as an attribute of the chain, not a filter object
added to a chain.
But I could well be the only person who feels that way :)
^ permalink raw reply
* Re: [PATCH bpf-next 7/7] nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h
From: Jakub Kicinski @ 2018-06-26 20:59 UTC (permalink / raw)
To: Jiong Wang; +Cc: alexei.starovoitov, daniel, oss-drivers, netdev
In-Reply-To: <20180625035421.2991-8-jakub.kicinski@netronome.com>
On Sun, 24 Jun 2018 20:54:21 -0700, Jakub Kicinski wrote:
> + * NOTE: because we are using "reciprocal_value_adv" which doesn't
> + * support dividend with MSB set, so we need to JIT separate NFP
> + * sequence to handle such case. It could be a simple sequence if there
> + * is conditional move, however there isn't for NFP. So, we don't bother
> + * generating compare-if-set-branch sequence by rejecting the program
> + * straight away when the u32 dividend has MSB set. Divide by such a
> + * large constant would be rare in practice. Also, the programmer could
> + * simply rewrite it as "result = divisor >= the_const".
Thinking about this again, can we just use carry bit? The code may end
up shorter than the explanation why we don't support that case :P
immed[c, 0]
alu[--, a, -, b]
alu[c, c, +carry, 0]
Should be equivalent to:
c = a >= b
(Thanks to Edwin for double-checking the carry semantics.)
^ permalink raw reply
* Re: [PATCH bpf-next 2/7] lib: reciprocal_div: implement the improved algorithm on the paper mentioned
From: Jakub Kicinski @ 2018-06-26 20:52 UTC (permalink / raw)
To: Song Liu
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <CAPhsuW4P5J1ZgUVKGfwWs7vX_7thc5cr1x6mrY8Wx3d2dqEDTw@mail.gmail.com>
On Mon, 25 Jun 2018 23:21:10 -0700, Song Liu wrote:
> > +struct reciprocal_value_adv reciprocal_value_adv(u32 d, u8 prec)
> > +{
> > + struct reciprocal_value_adv R;
> > + u32 l, post_shift;
> > + u64 mhigh, mlow;
> > +
> > + l = fls(d - 1);
> > + post_shift = l;
> > + /* NOTE: mlow/mhigh could overflow u64 when l == 32 which means d has
> > + * MSB set. This case needs to be handled before calling
> > + * "reciprocal_value_adv", please see the comment at
> > + * include/linux/reciprocal_div.h.
> > + */
>
> Shall we handle l == 32 case better? I guess the concern here is extra
> handling may slow down the fast path? If that's the case, we should
> at least add a WARNING on the slow path.
Agreed, I think Jiong is travelling, hence no response. We'll respin.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox