* Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly
From: Christophe LEROY @ 2018-05-24 6:20 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
In-Reply-To: <20180523183447.GV17342@gate.crashing.org>
Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
>> The generic csum_ipv6_magic() generates a pretty bad result
>
> <snip>
>
> Please try with a more recent compiler, what you used is pretty ancient.
> It's not like recent compilers do great on this either, but it's not
> *that* bad anymore ;-)
>
>> --- a/arch/powerpc/lib/checksum_32.S
>> +++ b/arch/powerpc/lib/checksum_32.S
>> @@ -293,3 +293,36 @@ dst_error:
>> EX_TABLE(51b, dst_error);
>>
>> EXPORT_SYMBOL(csum_partial_copy_generic)
>> +
>> +/*
>> + * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
>> + * const struct in6_addr *daddr,
>> + * __u32 len, __u8 proto, __wsum sum)
>> + */
>> +
>> +_GLOBAL(csum_ipv6_magic)
>> + lwz r8, 0(r3)
>> + lwz r9, 4(r3)
>> + lwz r10, 8(r3)
>> + lwz r11, 12(r3)
>> + addc r0, r5, r6
>> + adde r0, r0, r7
>> + adde r0, r0, r8
>> + adde r0, r0, r9
>> + adde r0, r0, r10
>> + adde r0, r0, r11
>> + lwz r8, 0(r4)
>> + lwz r9, 4(r4)
>> + lwz r10, 8(r4)
>> + lwz r11, 12(r4)
>> + adde r0, r0, r8
>> + adde r0, r0, r9
>> + adde r0, r0, r10
>> + adde r0, r0, r11
>> + addze r0, r0
>> + rotlwi r3, r0, 16
>> + add r3, r0, r3
>> + not r3, r3
>> + rlwinm r3, r3, 16, 16, 31
>> + blr
>> +EXPORT_SYMBOL(csum_ipv6_magic)
>
> Clustering the loads and carry insns together is pretty much the worst you
> can do on most 32-bit CPUs.
Oh, really ? __csum_partial is written that way too.
Right, now I tried interleaving the lwz and adde. I get no improvment at
all on a 885, but I get a 15% improvment on a 8321.
Christophe
>
>
> Segher
>
^ permalink raw reply
* Re: STMMAC driver with TSO enabled issue
From: Bhadram Varka @ 2018-05-24 5:58 UTC (permalink / raw)
To: Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <a7c17a56-dc40-2bd2-b621-cf73db50cd6e@synopsys.com>
Hi Jose,
On 5/17/2018 7:43 PM, Jose Abreu wrote:
> Hi Bhadram,
>
> On 15-05-2018 07:44, Bhadram Varka wrote:
>> Hi Jose,
>>
>> On 5/10/2018 9:15 PM, Jose Abreu wrote:
>>>
>>>
>>> On 10-05-2018 16:08, Bhadram Varka wrote:
>>>> Hi Jose,
>>>>
>>>> On 5/10/2018 7:59 PM, Jose Abreu wrote:
>>>>> Hi Bhadram,
>>>>>
>>>>> On 10-05-2018 09:55, Jose Abreu wrote:
>>>>>> ++net-dev
>>>>>>
>>>>>> Hi Bhadram,
>>>>>>
>>>>>> On 09-05-2018 12:03, Bhadram Varka wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for responding.
>>>>>>>
>>>>>>> Tried below suggested way. Still observing the issue -
>>>>>> It seems stmmac has a bug in the RX side when using TSO
>>>>>> which is
>>>>>> causing all the RX descriptors to be consumed. The stmmac_rx()
>>>>>> function will need to be refactored. I will send a fix ASAP.
>>>>>
>>>>> Are you using this patch [1] ? Because there is a problem with
>>>>> the patch. By adding the previously removed call to
>>>>> stmmac_init_rx_desc() TSO works okay in my setup.
>>>>>
>>>>
>>>> No. I don't have this change in my code base. I am using
>>>> net-next tree.
>>>>
>>>> Can you please post the change for which TSO works ? I can help
>>>> you with the testing.
>>>
>>> It should work with net-next because patch was not merged yet ...
>>> Please send me the output of "dmesg | grep -i stmmac", since boot
>>> and your full register values (from 0x0 to 0x12E4).
>>>
>>
>> [root@alarm ~]# dmesg | grep -i dwc
>> [ 6.925005] dwc-eth-dwmac 2490000.ethernet: Cannot get CSR
>> clock
>> [ 6.933657] dwc-eth-dwmac 2490000.ethernet: no reset control
>> found
>> [ 6.955325] dwc-eth-dwmac 2490000.ethernet: User ID: 0x10,
>> Synopsys ID: 0x41
>> [ 6.962379] dwc-eth-dwmac 2490000.ethernet: DWMAC4/5
>> [ 6.967434] dwc-eth-dwmac 2490000.ethernet: DMA HW
>> capability register supported
>> [ 6.974827] dwc-eth-dwmac 2490000.ethernet: RX Checksum
>> Offload Engine supported
>> [ 6.982915] dwc-eth-dwmac 2490000.ethernet: TX Checksum
>> insertion supported
>> [ 6.991235] dwc-eth-dwmac 2490000.ethernet: Wake-Up On Lan
>> supported
>> [ 6.998974] dwc-eth-dwmac 2490000.ethernet: TSO supported
>> [ 7.006422] dwc-eth-dwmac 2490000.ethernet: TSO feature enabled
>> [ 7.012581] dwc-eth-dwmac 2490000.ethernet: Enable RX
>> Mitigation via HW Watchdog Timer
>> [ 7.236391] dwc-eth-dwmac 2490000.ethernet eth0: device MAC
>> address 4a:d1:e3:58:cb:7a
>> [ 7.333414] dwc-eth-dwmac 2490000.ethernet eth0: IEEE
>> 1588-2008 Advanced Timestamp supported
>> [ 7.342441] dwc-eth-dwmac 2490000.ethernet eth0: registered
>> PTP clock
>> [ 10.157066] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up
>> - 1Gbps/Full - flow control off
>> [root@alarm ~]# dmesg | grep -i stmma
>> [ 7.020567] libphy: stmmac: probed
>> [ 7.316295] Broadcom BCM89610 stmmac-0:00: attached PHY
>> driver [Broadcom BCM89610] (mii_bus:phy_addr=stmmac-0:00, irq=64)
>>
>> I will get the register details -
>>
>> FYI - TSO works fine with single channel. I see the issue only
>> if multi channel enabled (supports 4 Tx/Rx channels).
>>
>
> And normal data transfer works okay with multi channel, right? I
> will need the register details to proceed ... You could also try
> git bisect ...
>
Yes - normal data transfers works fine. Issue observed only driver gets
TSO packet. Looks like TX DMA channel hang.
After adding few debug logs - observed that while processing second or
third descriptor TX DMA hangs.
[85788.137498] stmmac_tso_xmit: tcphdrlen 32, hdr_len 66, pay_len 1392,
mss 1448
[85788.144634] skb->len 7306, skb->data_len 5848
[..]
[85788.274876] 025 [0x82795190]: 0x0 0x0 0x5a8 0xc4000000
[85788.280020] 026 [0x827951a0]: 0xf854e000 0xf854e042 0x5700042 0xa0441c48
[85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8 0x90000000
[...]
After some time if check Tx descriptor status - then I see only below
[..]
[85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8 0x90000000
index 025 and 026 descriptors processed but not index 027.
At this stage Tx DMA is always in below state -
■ 3'b011: Running (Reading Data from system memory
buffer and queuing it to the Tx buffer (Tx FIFO))
Thanks,
Bhadram.
^ permalink raw reply
* Re: [PATCH net-next v2 0/2] net: phy: improve PHY suspend/resume
From: Heiner Kallweit @ 2018-05-24 5:52 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <20180523220418.GB5128@lunn.ch>
Am 24.05.2018 um 00:04 schrieb Andrew Lunn:
> On Wed, May 23, 2018 at 10:15:29PM +0200, Heiner Kallweit wrote:
>> I have the issue that suspending the MAC-integrated PHY gives an
>> error during system suspend. The sequence is:
>>
>> 1. unconnected PHY/MAC are runtime-suspended already
>> 2. system suspend commences
>> 3. mdio_bus_phy_suspend is called
>> 4. suspend callback of the network driver is called (implicitly
>> MAC/PHY are runtime-resumed before)
>> 5. suspend callback suspends MAC/PHY
>>
>> The problem occurs in step 3. phy_suspend() fails because the MDIO
>> bus isn't accessible due to the chip being runtime-suspended.
>
> I think you are fixing the wrong problem. I've had the same with the
> FEC driver. I fixed it by making the MDIO operations runtime-suspend
> aware:
>
Interesting, didn't see it from that angle yet. Sounds plausible.
Thanks a lot for the feedback and I'll have a look at the FEC driver.
Heiner
> commit 8fff755e9f8d0f70a595e79f248695ce6aef5cc3
> Author: Andrew Lunn <andrew@lunn.ch>
> Date: Sat Jul 25 22:38:02 2015 +0200
>
> net: fec: Ensure clocks are enabled while using mdio bus
>
> When a switch is attached to the mdio bus, the mdio bus can be used
> while the interface is not open. If the IPG clock is not enabled, MDIO
> reads/writes will simply time out.
>
> Add support for runtime PM to control this clock. Enable/disable this
> clock using runtime PM, with open()/close() and mdio read()/write()
> function triggering runtime PM operations. Since PM is optional, the
> IPG clock is enabled at probe and is no longer modified by
> fec_enet_clk_enable(), thus if PM is not enabled in the kernel, it is
> guaranteed the clock is running when MDIO operations are performed.
>
> Don't copy this patch 1:1. I introduced a few bugs which took a while
> to be shaken out :-(
>
> Andrew
>
^ permalink raw reply
* Re: [PATCH bpf-next v4 3/7] tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in libbpf
From: Martin KaFai Lau @ 2018-05-24 5:12 UTC (permalink / raw)
To: Yonghong Song; +Cc: peterz, ast, daniel, netdev, kernel-team
In-Reply-To: <20180524001844.1175727-4-yhs@fb.com>
On Wed, May 23, 2018 at 05:18:43PM -0700, Yonghong Song wrote:
> Sync kernel header bpf.h to tools/include/uapi/linux/bpf.h and
> implement bpf_task_fd_query() in libbpf. The test programs
> in samples/bpf and tools/testing/selftests/bpf, and later bpftool
> will use this libbpf function to query kernel.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
^ permalink raw reply
* Re: [PATCH bpf-next v4 2/7] bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
From: Martin KaFai Lau @ 2018-05-24 5:07 UTC (permalink / raw)
To: Yonghong Song; +Cc: peterz, ast, daniel, netdev, kernel-team
In-Reply-To: <20180524001844.1175727-3-yhs@fb.com>
On Wed, May 23, 2018 at 05:18:42PM -0700, Yonghong Song wrote:
> Currently, suppose a userspace application has loaded a bpf program
> and attached it to a tracepoint/kprobe/uprobe, and a bpf
> introspection tool, e.g., bpftool, wants to show which bpf program
> is attached to which tracepoint/kprobe/uprobe. Such attachment
> information will be really useful to understand the overall bpf
> deployment in the system.
>
> There is a name field (16 bytes) for each program, which could
> be used to encode the attachment point. There are some drawbacks
> for this approaches. First, bpftool user (e.g., an admin) may not
> really understand the association between the name and the
> attachment point. Second, if one program is attached to multiple
> places, encoding a proper name which can imply all these
> attachments becomes difficult.
>
> This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
> Given a pid and fd, if the <pid, fd> is associated with a
> tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
> . prog_id
> . tracepoint name, or
> . k[ret]probe funcname + offset or kernel addr, or
> . u[ret]probe filename + offset
> to the userspace.
> The user can use "bpftool prog" to find more information about
> bpf program itself with prog_id.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> include/linux/trace_events.h | 17 +++++++
> include/uapi/linux/bpf.h | 26 ++++++++++
> kernel/bpf/syscall.c | 115 +++++++++++++++++++++++++++++++++++++++++++
> kernel/trace/bpf_trace.c | 48 ++++++++++++++++++
> kernel/trace/trace_kprobe.c | 29 +++++++++++
> kernel/trace/trace_uprobe.c | 22 +++++++++
> 6 files changed, 257 insertions(+)
>
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 2bde3ef..d34144a 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -473,6 +473,9 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info);
> int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
> int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
> struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
> +int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
> + u32 *fd_type, const char **buf,
> + u64 *probe_offset, u64 *probe_addr);
> #else
> static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
> {
> @@ -504,6 +507,13 @@ static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name
> {
> return NULL;
> }
> +static inline int bpf_get_perf_event_info(const struct perf_event *event,
> + u32 *prog_id, u32 *fd_type,
> + const char **buf, u64 *probe_offset,
> + u64 *probe_addr)
> +{
> + return -EOPNOTSUPP;
> +}
> #endif
>
> enum {
> @@ -560,10 +570,17 @@ extern void perf_trace_del(struct perf_event *event, int flags);
> #ifdef CONFIG_KPROBE_EVENTS
> extern int perf_kprobe_init(struct perf_event *event, bool is_retprobe);
> extern void perf_kprobe_destroy(struct perf_event *event);
> +extern int bpf_get_kprobe_info(const struct perf_event *event,
> + u32 *fd_type, const char **symbol,
> + u64 *probe_offset, u64 *probe_addr,
> + bool perf_type_tracepoint);
> #endif
> #ifdef CONFIG_UPROBE_EVENTS
> extern int perf_uprobe_init(struct perf_event *event, bool is_retprobe);
> extern void perf_uprobe_destroy(struct perf_event *event);
> +extern int bpf_get_uprobe_info(const struct perf_event *event,
> + u32 *fd_type, const char **filename,
> + u64 *probe_offset, bool perf_type_tracepoint);
> #endif
> extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
> char *filter_str);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c3e502d..0d51946 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -97,6 +97,7 @@ enum bpf_cmd {
> BPF_RAW_TRACEPOINT_OPEN,
> BPF_BTF_LOAD,
> BPF_BTF_GET_FD_BY_ID,
> + BPF_TASK_FD_QUERY,
> };
>
> enum bpf_map_type {
> @@ -379,6 +380,22 @@ union bpf_attr {
> __u32 btf_log_size;
> __u32 btf_log_level;
> };
> +
> + struct {
> + __u32 pid; /* input: pid */
> + __u32 fd; /* input: fd */
> + __u32 flags; /* input: flags */
> + __u32 buf_len; /* input/output: buf len */
> + __aligned_u64 buf; /* input/output:
> + * tp_name for tracepoint
> + * symbol for kprobe
> + * filename for uprobe
> + */
> + __u32 prog_id; /* output: prod_id */
> + __u32 fd_type; /* output: BPF_FD_TYPE_* */
> + __u64 probe_offset; /* output: probe_offset */
> + __u64 probe_addr; /* output: probe_addr */
> + } task_fd_query;
> } __attribute__((aligned(8)));
>
> /* The description below is an attempt at providing documentation to eBPF
> @@ -2458,4 +2475,13 @@ struct bpf_fib_lookup {
> __u8 dmac[6]; /* ETH_ALEN */
> };
>
> +enum bpf_task_fd_type {
> + BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */
> + BPF_FD_TYPE_TRACEPOINT, /* tp name */
> + BPF_FD_TYPE_KPROBE, /* (symbol + offset) or addr */
> + BPF_FD_TYPE_KRETPROBE, /* (symbol + offset) or addr */
> + BPF_FD_TYPE_UPROBE, /* filename + offset */
> + BPF_FD_TYPE_URETPROBE, /* filename + offset */
> +};
> +
> #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0b4c945..7dd8c86 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -18,7 +18,9 @@
> #include <linux/vmalloc.h>
> #include <linux/mmzone.h>
> #include <linux/anon_inodes.h>
> +#include <linux/fdtable.h>
> #include <linux/file.h>
> +#include <linux/fs.h>
> #include <linux/license.h>
> #include <linux/filter.h>
> #include <linux/version.h>
> @@ -2102,6 +2104,116 @@ static int bpf_btf_get_fd_by_id(const union bpf_attr *attr)
> return btf_get_fd_by_id(attr->btf_id);
> }
>
> +static int bpf_task_fd_query_copy(const union bpf_attr *attr,
> + union bpf_attr __user *uattr,
> + u32 prog_id, u32 fd_type,
> + const char *buf, u64 probe_offset,
> + u64 probe_addr)
> +{
> + void __user *ubuf = u64_to_user_ptr(attr->task_fd_query.buf);
> + u32 len = buf ? strlen(buf) + 1 : 0, input_len;
> + int err = 0;
> +
> + if (put_user(len, &uattr->task_fd_query.buf_len))
> + return -EFAULT;
> + input_len = attr->task_fd_query.buf_len;
> + if (input_len && len && ubuf) {
When len is 0 and input_len > 0, ubuf will not be touched (and
so not null terminated).
It may be helpful to note in uapi bpf.h that !output_buf_len has to be
checked on top of checking the syscall return value. It is reasonable for
the userspace to assume that ubuf can be directly used with
strlen()/printf()... as long as the syscall does not return -1/ENOSPC.
I think the comment change could be done in a follow up patch.
or
always null terminate ubuf as long as input_len > 0
and the output_buf_len should be strlen(buf) instead of
strlen(buf) + 1 (i.e. exclude the null char in output_buf_len)
such that the !buf case will have output_buf_len == 0.
The user can depend on ENOSPC or input_buf_len <= output_buf_len
to decide the truncated condition. This convention should be
closer to the snprintf() situation.
Other than that,
Acked-by: Martin KaFai Lau <kafai@fb.com>
> + if (input_len < len) {
> + err = -ENOSPC;
> + len = input_len;
> + }
> + if (copy_to_user(ubuf, buf, len))
> + return -EFAULT;
> + }
> +
> + if (put_user(prog_id, &uattr->task_fd_query.prog_id) ||
> + put_user(fd_type, &uattr->task_fd_query.fd_type) ||
> + put_user(probe_offset, &uattr->task_fd_query.probe_offset) ||
> + put_user(probe_addr, &uattr->task_fd_query.probe_addr))
> + return -EFAULT;
> +
> + return err;
> +}
> +
> +#define BPF_TASK_FD_QUERY_LAST_FIELD task_fd_query.probe_addr
> +
> +static int bpf_task_fd_query(const union bpf_attr *attr,
> + union bpf_attr __user *uattr)
> +{
> + pid_t pid = attr->task_fd_query.pid;
> + u32 fd = attr->task_fd_query.fd;
> + const struct perf_event *event;
> + struct files_struct *files;
> + struct task_struct *task;
> + struct file *file;
> + int err;
> +
> + if (CHECK_ATTR(BPF_TASK_FD_QUERY))
> + return -EINVAL;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (attr->task_fd_query.flags != 0)
> + return -EINVAL;
> +
> + task = get_pid_task(find_vpid(pid), PIDTYPE_PID);
> + if (!task)
> + return -ENOENT;
> +
> + files = get_files_struct(task);
> + put_task_struct(task);
> + if (!files)
> + return -ENOENT;
> +
> + err = 0;
> + spin_lock(&files->file_lock);
> + file = fcheck_files(files, fd);
> + if (!file)
> + err = -EBADF;
> + else
> + get_file(file);
> + spin_unlock(&files->file_lock);
> + put_files_struct(files);
> +
> + if (err)
> + goto out;
> +
> + if (file->f_op == &bpf_raw_tp_fops) {
> + struct bpf_raw_tracepoint *raw_tp = file->private_data;
> + struct bpf_raw_event_map *btp = raw_tp->btp;
> +
> + err = bpf_task_fd_query_copy(attr, uattr,
> + raw_tp->prog->aux->id,
> + BPF_FD_TYPE_RAW_TRACEPOINT,
> + btp->tp->name, 0, 0);
> + goto put_file;
> + }
> +
> + event = perf_get_event(file);
> + if (!IS_ERR(event)) {
> + u64 probe_offset, probe_addr;
> + u32 prog_id, fd_type;
> + const char *buf;
> +
> + err = bpf_get_perf_event_info(event, &prog_id, &fd_type,
> + &buf, &probe_offset,
> + &probe_addr);
> + if (!err)
> + err = bpf_task_fd_query_copy(attr, uattr, prog_id,
> + fd_type, buf,
> + probe_offset,
> + probe_addr);
> + goto put_file;
> + }
> +
> + err = -ENOTSUPP;
> +put_file:
> + fput(file);
> +out:
> + return err;
> +}
> +
> SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
> {
> union bpf_attr attr = {};
> @@ -2188,6 +2300,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> case BPF_BTF_GET_FD_BY_ID:
> err = bpf_btf_get_fd_by_id(&attr);
> break;
> + case BPF_TASK_FD_QUERY:
> + err = bpf_task_fd_query(&attr, uattr);
> + break;
> default:
> err = -EINVAL;
> break;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ce2cbbf..81fdf2f 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -14,6 +14,7 @@
> #include <linux/uaccess.h>
> #include <linux/ctype.h>
> #include <linux/kprobes.h>
> +#include <linux/syscalls.h>
> #include <linux/error-injection.h>
>
> #include "trace_probe.h"
> @@ -1163,3 +1164,50 @@ int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
> mutex_unlock(&bpf_event_mutex);
> return err;
> }
> +
> +int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
> + u32 *fd_type, const char **buf,
> + u64 *probe_offset, u64 *probe_addr)
> +{
> + bool is_tracepoint, is_syscall_tp;
> + struct bpf_prog *prog;
> + int flags, err = 0;
> +
> + prog = event->prog;
> + if (!prog)
> + return -ENOENT;
> +
> + /* not supporting BPF_PROG_TYPE_PERF_EVENT yet */
> + if (prog->type == BPF_PROG_TYPE_PERF_EVENT)
> + return -EOPNOTSUPP;
> +
> + *prog_id = prog->aux->id;
> + flags = event->tp_event->flags;
> + is_tracepoint = flags & TRACE_EVENT_FL_TRACEPOINT;
> + is_syscall_tp = is_syscall_trace_event(event->tp_event);
> +
> + if (is_tracepoint || is_syscall_tp) {
> + *buf = is_tracepoint ? event->tp_event->tp->name
> + : event->tp_event->name;
> + *fd_type = BPF_FD_TYPE_TRACEPOINT;
> + *probe_offset = 0x0;
> + *probe_addr = 0x0;
> + } else {
> + /* kprobe/uprobe */
> + err = -EOPNOTSUPP;
> +#ifdef CONFIG_KPROBE_EVENTS
> + if (flags & TRACE_EVENT_FL_KPROBE)
> + err = bpf_get_kprobe_info(event, fd_type, buf,
> + probe_offset, probe_addr,
> + event->attr.type == PERF_TYPE_TRACEPOINT);
> +#endif
> +#ifdef CONFIG_UPROBE_EVENTS
> + if (flags & TRACE_EVENT_FL_UPROBE)
> + err = bpf_get_uprobe_info(event, fd_type, buf,
> + probe_offset,
> + event->attr.type == PERF_TYPE_TRACEPOINT);
> +#endif
> + }
> +
> + return err;
> +}
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 02aed76..daa8157 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -1287,6 +1287,35 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
> head, NULL);
> }
> NOKPROBE_SYMBOL(kretprobe_perf_func);
> +
> +int bpf_get_kprobe_info(const struct perf_event *event, u32 *fd_type,
> + const char **symbol, u64 *probe_offset,
> + u64 *probe_addr, bool perf_type_tracepoint)
> +{
> + const char *pevent = trace_event_name(event->tp_event);
> + const char *group = event->tp_event->class->system;
> + struct trace_kprobe *tk;
> +
> + if (perf_type_tracepoint)
> + tk = find_trace_kprobe(pevent, group);
> + else
> + tk = event->tp_event->data;
> + if (!tk)
> + return -EINVAL;
> +
> + *fd_type = trace_kprobe_is_return(tk) ? BPF_FD_TYPE_KRETPROBE
> + : BPF_FD_TYPE_KPROBE;
> + if (tk->symbol) {
> + *symbol = tk->symbol;
> + *probe_offset = tk->rp.kp.offset;
> + *probe_addr = 0;
> + } else {
> + *symbol = NULL;
> + *probe_offset = 0;
> + *probe_addr = (unsigned long)tk->rp.kp.addr;
> + }
> + return 0;
> +}
> #endif /* CONFIG_PERF_EVENTS */
>
> /*
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index ac89287..bf89a51 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1161,6 +1161,28 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func,
> {
> __uprobe_perf_func(tu, func, regs, ucb, dsize);
> }
> +
> +int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type,
> + const char **filename, u64 *probe_offset,
> + bool perf_type_tracepoint)
> +{
> + const char *pevent = trace_event_name(event->tp_event);
> + const char *group = event->tp_event->class->system;
> + struct trace_uprobe *tu;
> +
> + if (perf_type_tracepoint)
> + tu = find_probe_event(pevent, group);
> + else
> + tu = event->tp_event->data;
> + if (!tu)
> + return -EINVAL;
> +
> + *fd_type = is_ret_probe(tu) ? BPF_FD_TYPE_URETPROBE
> + : BPF_FD_TYPE_UPROBE;
> + *filename = tu->filename;
> + *probe_offset = tu->offset;
> + return 0;
> +}
> #endif /* CONFIG_PERF_EVENTS */
>
> static int
> --
> 2.9.5
>
^ permalink raw reply
* Re: [PATCH bpf-next 0/5] fix test_sockmap
From: John Fastabend @ 2018-05-24 4:58 UTC (permalink / raw)
To: Prashant Bhole, Alexei Starovoitov, Daniel Borkmann
Cc: David S . Miller, Shuah Khan, netdev
In-Reply-To: <85c79205-5bb6-f6c8-e4a1-abed059c2619@lab.ntt.co.jp>
On 05/23/2018 09:47 PM, Prashant Bhole wrote:
>
>
> On 5/23/2018 6:44 PM, Prashant Bhole wrote:
>>
>>
>> On 5/22/2018 2:08 AM, John Fastabend wrote:
>>> On 05/20/2018 10:13 PM, Prashant Bhole wrote:
>>>>
>>>>
>>>> On 5/19/2018 1:42 AM, John Fastabend wrote:
>>>>> On 05/18/2018 12:17 AM, Prashant Bhole wrote:
>>>>>> This series fixes bugs in test_sockmap code. They weren't caught
>>>>>> previously because failure in RX/TX thread was not notified to the
>>>>>> main thread.
>>>>>>
>>>>>> Also fixed data verification logic and slightly improved test output
>>>>>> such that parameters values (cork, apply, start, end) of failed test
>>>>>> can be easily seen.
>>>>>>
>>>>>
>>>>> Great, this was on my list so thanks for taking care of it.
>>>>>
>>>>>> Note: Even after fixing above problems there are issues with tests
>>>>>> which set cork parameter. Tests fail (RX thread timeout) when cork
>>>>>> value is non-zero and overall data sent by TX thread isn't multiples
>>>>>> of cork value.
>>>>>
>>>>>
>>>>> This is expected. When 'cork' is set the sender should only xmit
>>>>> the data when 'cork' bytes are available. If the user doesn't
>>>>> provide the N bytes the data is cork'ed waiting for the bytes and
>>>>> if the socket is closed the state is cleaned up. What these tests
>>>>> are testing is the cleanup path when a user doesn't provide the
>>>>> N bytes. In practice this is used to validate headers and prevent
>>>>> users from sending partial headers. We want to keep these tests because
>>>>> they verify a tear-down path in the code.
>>>>
>>>> Ok.
>>>>
>>>>>
>>>>> After your changes do these get reported as failures? If so we
>>>>> need to account for the above in the calculations.
>>>>
>>>> Yes, cork related test are reported as failures because of RX thread
>>>> timeout.
>>>>
>>>> So with your above description, I think we need to differentiate cork
>>>> tests with partial data and full data. In partial data test we can have
>>>> something like "timeout_expected" flag. Any other way to fix it?
>>>>
>>>
>>> Adding a flag seems reasonable to me. Lets do this for now. Also I
>>> plan to add more negative tests so we can either use the same
>>> flag or a new one for those cases as well.
>>>
>>
>> John,
>> I worked on this for some time and noticed that the RX-timeout of
>> tests with cork parameter is dependent on various parameters. So we
>> can not set a flag like the way 'drop_expected' flag is set before
>> executing the test.
>>
>> So I decided to write a function which judges all parameters before
>> each test and decides whether a test with cork parameter will
>> timeout or not. Then the conditions in the function became
>> complicated. For example some tests fail if opt->rate < 17 (with
>> some other conditions). Here is 17 is related to FRAGS_PER_SKB.
>> Consider following two examples.
> I'm sorry. Correction: s/FRAGS_PER_SKB/MAX_SKB_FRAGS/
>
>>
>> ./test_sockmap --cgroup /mnt/cgroup2 -r 16 -i 1 -l 30 -t sendpage
>> --txmsg --txmsg_cork 1024 # RX timeout occurs
>>
>> ./test_sockmap --cgroup /mnt/cgroup2 -r 17 -i 1 -l 30 -t sendpage
>> --txmsg --txmsg_cork 1024 # Success!
>>
Ah yes this hits the buffer limit and flushes the queue. The kernel
side doesn't know how to merge those specific sendpage requests so
it gives each request its own buffer and when the limit is reached
we flush it.
>> Do we need to keep such tests? if yes, then I will continue with
>> adding such conditions in the function.
>>
Yes, these tests are needed because they are testing the edge cases.
These are probably the most important tests because my normal usage
will catch any issues in the "good" cases its these types of things
that can go unnoticed (at least for a short while) if we don't have
specific tests for them.
Thanks for doing this.
John
>> -Prashant
>>
>>
>>
>
^ permalink raw reply
* Re: [Cake] [PATCH net-next v15 4/7] sch_cake: Add NAT awareness to packet classifier
From: Kevin Darbyshire-Bryant @ 2018-05-24 4:52 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: David Miller, Cake List, Linux Kernel Network Developers,
netfilter-devel@vger.kernel.org
In-Reply-To: <878t8axafk.fsf@toke.dk>
[-- Attachment #1: Type: text/plain, Size: 1839 bytes --]
> On 23 May 2018, at 23:40, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
<snip>
>
> Hmm, and we still have an issue with ingress filtering (where cake is
> running on an ifb interface). That runs pre-NAT in the conntrack case,
> and we can't do the RX trick. Here we do the lookup manually in
> conntrack (and this part is actually what brings in most of the
> dependencies). Any neat tricks up your sleeve for this case? :)
I wonder here if our terminology with ‘ingress’ is causing confusion. For avoidance of doubt:
Typical use case of cake on LAN/WAN router requires two instances. One instance (the egress) is on the WAN interface itself. It is post conntrack and hence uses skb->nfct to work out the real pre-nat source address of the LAN hosts.
Since we cannot apply this qdisc to the ingress of our WAN interface we use an IFB to mirror the ingress packets, and then use a cake instance on the ifb interface on its egress path to in essence control the ingress traffic.
Cake has two modes, the normal ‘egress’ mode which is designed to be used when controlling egress traffic output, and shapes post any dropped packets. ‘ingress’ mode is designed to be used on the egress of our ingress IFB, where the shaper counts all packets used (well they got here!) even if we decide to drop them a bit later.
The ifb positioned cake has the additional fun factor that the conntrack field hasn’t yet been filled in, so the qdisc has to go looking in the conntrack tables itself to see if any NATting has taken place and balance LAN host fairness based on that.
As far as I understand it, the flow dissector doesn’t obviously help with working out the pre-NAT addressing as the flow has already been mangled in the egress case, and is awaiting mangling on the ingress case.
Kevin
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH bpf-next 0/5] fix test_sockmap
From: Prashant Bhole @ 2018-05-24 4:47 UTC (permalink / raw)
To: John Fastabend, Alexei Starovoitov, Daniel Borkmann
Cc: David S . Miller, Shuah Khan, netdev
In-Reply-To: <67df711c-75c0-b674-e394-148645353a5a@lab.ntt.co.jp>
On 5/23/2018 6:44 PM, Prashant Bhole wrote:
>
>
> On 5/22/2018 2:08 AM, John Fastabend wrote:
>> On 05/20/2018 10:13 PM, Prashant Bhole wrote:
>>>
>>>
>>> On 5/19/2018 1:42 AM, John Fastabend wrote:
>>>> On 05/18/2018 12:17 AM, Prashant Bhole wrote:
>>>>> This series fixes bugs in test_sockmap code. They weren't caught
>>>>> previously because failure in RX/TX thread was not notified to the
>>>>> main thread.
>>>>>
>>>>> Also fixed data verification logic and slightly improved test output
>>>>> such that parameters values (cork, apply, start, end) of failed test
>>>>> can be easily seen.
>>>>>
>>>>
>>>> Great, this was on my list so thanks for taking care of it.
>>>>
>>>>> Note: Even after fixing above problems there are issues with tests
>>>>> which set cork parameter. Tests fail (RX thread timeout) when cork
>>>>> value is non-zero and overall data sent by TX thread isn't multiples
>>>>> of cork value.
>>>>
>>>>
>>>> This is expected. When 'cork' is set the sender should only xmit
>>>> the data when 'cork' bytes are available. If the user doesn't
>>>> provide the N bytes the data is cork'ed waiting for the bytes and
>>>> if the socket is closed the state is cleaned up. What these tests
>>>> are testing is the cleanup path when a user doesn't provide the
>>>> N bytes. In practice this is used to validate headers and prevent
>>>> users from sending partial headers. We want to keep these tests because
>>>> they verify a tear-down path in the code.
>>>
>>> Ok.
>>>
>>>>
>>>> After your changes do these get reported as failures? If so we
>>>> need to account for the above in the calculations.
>>>
>>> Yes, cork related test are reported as failures because of RX thread
>>> timeout.
>>>
>>> So with your above description, I think we need to differentiate cork
>>> tests with partial data and full data. In partial data test we can have
>>> something like "timeout_expected" flag. Any other way to fix it?
>>>
>>
>> Adding a flag seems reasonable to me. Lets do this for now. Also I
>> plan to add more negative tests so we can either use the same
>> flag or a new one for those cases as well.
>>
>
> John,
> I worked on this for some time and noticed that the RX-timeout of tests
> with cork parameter is dependent on various parameters. So we can not
> set a flag like the way 'drop_expected' flag is set before executing the
> test.
>
> So I decided to write a function which judges all parameters before each
> test and decides whether a test with cork parameter will timeout or not.
> Then the conditions in the function became complicated. For example some
> tests fail if opt->rate < 17 (with some other conditions). Here is 17 is
> related to FRAGS_PER_SKB. Consider following two examples.
I'm sorry. Correction: s/FRAGS_PER_SKB/MAX_SKB_FRAGS/
>
> ./test_sockmap --cgroup /mnt/cgroup2 -r 16 -i 1 -l 30 -t sendpage
> --txmsg --txmsg_cork 1024 # RX timeout occurs
>
> ./test_sockmap --cgroup /mnt/cgroup2 -r 17 -i 1 -l 30 -t sendpage
> --txmsg --txmsg_cork 1024 # Success!
>
> Do we need to keep such tests? if yes, then I will continue with adding
> such conditions in the function.
>
>
> -Prashant
>
>
>
^ permalink raw reply
* [PATCH net-next] bpfilter: fix build dependency
From: Alexei Starovoitov @ 2018-05-24 4:29 UTC (permalink / raw)
To: David S . Miller; +Cc: daniel, jakub.kicinski, netdev, kernel-team
BPFILTER could have been enabled without INET causing this build error:
ERROR: "bpfilter_process_sockopt" [net/bpfilter/bpfilter.ko] undefined!
Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
net/bpfilter/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
index 60725c5f79db..a948b072c28f 100644
--- a/net/bpfilter/Kconfig
+++ b/net/bpfilter/Kconfig
@@ -1,7 +1,7 @@
menuconfig BPFILTER
bool "BPF based packet filtering framework (BPFILTER)"
default n
- depends on NET && BPF
+ depends on NET && BPF && INET
help
This builds experimental bpfilter framework that is aiming to
provide netfilter compatible functionality via BPF
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 8/8] nfp: flower: compute link aggregation action
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, John Hurley
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
If the egress device of an offloaded rule is a LAG port, then encode the
output port to the NFP with a LAG identifier and the offloaded group ID.
A prelag action is also offloaded which must be the first action of the
series (although may appear after other pre-actions - e.g. tunnels). This
causes the FW to check that it has the necessary information to output to
the requested LAG port. If it does not, the packet is sent to the kernel
before any other actions are applied to it.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
.../ethernet/netronome/nfp/flower/action.c | 131 ++++++++++++++----
.../net/ethernet/netronome/nfp/flower/cmsg.h | 13 ++
.../ethernet/netronome/nfp/flower/lag_conf.c | 42 ++++++
.../net/ethernet/netronome/nfp/flower/main.h | 9 +-
.../ethernet/netronome/nfp/flower/offload.c | 2 +-
5 files changed, 169 insertions(+), 28 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 80df9a5d4217..4a6d2db75071 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -72,6 +72,42 @@ nfp_fl_push_vlan(struct nfp_fl_push_vlan *push_vlan,
push_vlan->vlan_tci = cpu_to_be16(tmp_push_vlan_tci);
}
+static int
+nfp_fl_pre_lag(struct nfp_app *app, const struct tc_action *action,
+ struct nfp_fl_payload *nfp_flow, int act_len)
+{
+ size_t act_size = sizeof(struct nfp_fl_pre_lag);
+ struct nfp_fl_pre_lag *pre_lag;
+ struct net_device *out_dev;
+ int err;
+
+ out_dev = tcf_mirred_dev(action);
+ if (!out_dev || !netif_is_lag_master(out_dev))
+ return 0;
+
+ if (act_len + act_size > NFP_FL_MAX_A_SIZ)
+ return -EOPNOTSUPP;
+
+ /* Pre_lag action must be first on action list.
+ * If other actions already exist they need pushed forward.
+ */
+ if (act_len)
+ memmove(nfp_flow->action_data + act_size,
+ nfp_flow->action_data, act_len);
+
+ pre_lag = (struct nfp_fl_pre_lag *)nfp_flow->action_data;
+ err = nfp_flower_lag_populate_pre_action(app, out_dev, pre_lag);
+ if (err)
+ return err;
+
+ pre_lag->head.jump_id = NFP_FL_ACTION_OPCODE_PRE_LAG;
+ pre_lag->head.len_lw = act_size >> NFP_FL_LW_SIZ;
+
+ nfp_flow->meta.shortcut = cpu_to_be32(NFP_FL_SC_ACT_NULL);
+
+ return act_size;
+}
+
static bool nfp_fl_netdev_is_tunnel_type(struct net_device *out_dev,
enum nfp_flower_tun_type tun_type)
{
@@ -88,12 +124,13 @@ static bool nfp_fl_netdev_is_tunnel_type(struct net_device *out_dev,
}
static int
-nfp_fl_output(struct nfp_fl_output *output, const struct tc_action *action,
- struct nfp_fl_payload *nfp_flow, bool last,
- struct net_device *in_dev, enum nfp_flower_tun_type tun_type,
- int *tun_out_cnt)
+nfp_fl_output(struct nfp_app *app, struct nfp_fl_output *output,
+ const struct tc_action *action, struct nfp_fl_payload *nfp_flow,
+ bool last, struct net_device *in_dev,
+ enum nfp_flower_tun_type tun_type, int *tun_out_cnt)
{
size_t act_size = sizeof(struct nfp_fl_output);
+ struct nfp_flower_priv *priv = app->priv;
struct net_device *out_dev;
u16 tmp_flags;
@@ -118,6 +155,15 @@ nfp_fl_output(struct nfp_fl_output *output, const struct tc_action *action,
output->flags = cpu_to_be16(tmp_flags |
NFP_FL_OUT_FLAGS_USE_TUN);
output->port = cpu_to_be32(NFP_FL_PORT_TYPE_TUN | tun_type);
+ } else if (netif_is_lag_master(out_dev) &&
+ priv->flower_ext_feats & NFP_FL_FEATS_LAG) {
+ int gid;
+
+ output->flags = cpu_to_be16(tmp_flags);
+ gid = nfp_flower_lag_get_output_id(app, out_dev);
+ if (gid < 0)
+ return gid;
+ output->port = cpu_to_be32(NFP_FL_LAG_OUT | gid);
} else {
/* Set action output parameters. */
output->flags = cpu_to_be16(tmp_flags);
@@ -164,7 +210,7 @@ static struct nfp_fl_pre_tunnel *nfp_fl_pre_tunnel(char *act_data, int act_len)
struct nfp_fl_pre_tunnel *pre_tun_act;
/* Pre_tunnel action must be first on action list.
- * If other actions already exist they need pushed forward.
+ * If other actions already exist they need to be pushed forward.
*/
if (act_len)
memmove(act_data + act_size, act_data, act_len);
@@ -443,42 +489,73 @@ nfp_fl_pedit(const struct tc_action *action, char *nfp_action, int *a_len)
}
static int
-nfp_flower_loop_action(const struct tc_action *a,
+nfp_flower_output_action(struct nfp_app *app, const struct tc_action *a,
+ struct nfp_fl_payload *nfp_fl, int *a_len,
+ struct net_device *netdev, bool last,
+ enum nfp_flower_tun_type *tun_type, int *tun_out_cnt,
+ int *out_cnt)
+{
+ struct nfp_flower_priv *priv = app->priv;
+ struct nfp_fl_output *output;
+ int err, prelag_size;
+
+ if (*a_len + sizeof(struct nfp_fl_output) > NFP_FL_MAX_A_SIZ)
+ return -EOPNOTSUPP;
+
+ output = (struct nfp_fl_output *)&nfp_fl->action_data[*a_len];
+ err = nfp_fl_output(app, output, a, nfp_fl, last, netdev, *tun_type,
+ tun_out_cnt);
+ if (err)
+ return err;
+
+ *a_len += sizeof(struct nfp_fl_output);
+
+ if (priv->flower_ext_feats & NFP_FL_FEATS_LAG) {
+ /* nfp_fl_pre_lag returns -err or size of prelag action added.
+ * This will be 0 if it is not egressing to a lag dev.
+ */
+ prelag_size = nfp_fl_pre_lag(app, a, nfp_fl, *a_len);
+ if (prelag_size < 0)
+ return prelag_size;
+ else if (prelag_size > 0 && (!last || *out_cnt))
+ return -EOPNOTSUPP;
+
+ *a_len += prelag_size;
+ }
+ (*out_cnt)++;
+
+ return 0;
+}
+
+static int
+nfp_flower_loop_action(struct nfp_app *app, const struct tc_action *a,
struct nfp_fl_payload *nfp_fl, int *a_len,
struct net_device *netdev,
- enum nfp_flower_tun_type *tun_type, int *tun_out_cnt)
+ enum nfp_flower_tun_type *tun_type, int *tun_out_cnt,
+ int *out_cnt)
{
struct nfp_fl_set_ipv4_udp_tun *set_tun;
struct nfp_fl_pre_tunnel *pre_tun;
struct nfp_fl_push_vlan *psh_v;
struct nfp_fl_pop_vlan *pop_v;
- struct nfp_fl_output *output;
int err;
if (is_tcf_gact_shot(a)) {
nfp_fl->meta.shortcut = cpu_to_be32(NFP_FL_SC_ACT_DROP);
} else if (is_tcf_mirred_egress_redirect(a)) {
- if (*a_len + sizeof(struct nfp_fl_output) > NFP_FL_MAX_A_SIZ)
- return -EOPNOTSUPP;
-
- output = (struct nfp_fl_output *)&nfp_fl->action_data[*a_len];
- err = nfp_fl_output(output, a, nfp_fl, true, netdev, *tun_type,
- tun_out_cnt);
+ err = nfp_flower_output_action(app, a, nfp_fl, a_len, netdev,
+ true, tun_type, tun_out_cnt,
+ out_cnt);
if (err)
return err;
- *a_len += sizeof(struct nfp_fl_output);
} else if (is_tcf_mirred_egress_mirror(a)) {
- if (*a_len + sizeof(struct nfp_fl_output) > NFP_FL_MAX_A_SIZ)
- return -EOPNOTSUPP;
-
- output = (struct nfp_fl_output *)&nfp_fl->action_data[*a_len];
- err = nfp_fl_output(output, a, nfp_fl, false, netdev, *tun_type,
- tun_out_cnt);
+ err = nfp_flower_output_action(app, a, nfp_fl, a_len, netdev,
+ false, tun_type, tun_out_cnt,
+ out_cnt);
if (err)
return err;
- *a_len += sizeof(struct nfp_fl_output);
} else if (is_tcf_vlan(a) && tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
if (*a_len + sizeof(struct nfp_fl_pop_vlan) > NFP_FL_MAX_A_SIZ)
return -EOPNOTSUPP;
@@ -535,11 +612,12 @@ nfp_flower_loop_action(const struct tc_action *a,
return 0;
}
-int nfp_flower_compile_action(struct tc_cls_flower_offload *flow,
+int nfp_flower_compile_action(struct nfp_app *app,
+ struct tc_cls_flower_offload *flow,
struct net_device *netdev,
struct nfp_fl_payload *nfp_flow)
{
- int act_len, act_cnt, err, tun_out_cnt;
+ int act_len, act_cnt, err, tun_out_cnt, out_cnt;
enum nfp_flower_tun_type tun_type;
const struct tc_action *a;
LIST_HEAD(actions);
@@ -550,11 +628,12 @@ int nfp_flower_compile_action(struct tc_cls_flower_offload *flow,
act_len = 0;
act_cnt = 0;
tun_out_cnt = 0;
+ out_cnt = 0;
tcf_exts_to_list(flow->exts, &actions);
list_for_each_entry(a, &actions, list) {
- err = nfp_flower_loop_action(a, nfp_flow, &act_len, netdev,
- &tun_type, &tun_out_cnt);
+ err = nfp_flower_loop_action(app, a, nfp_flow, &act_len, netdev,
+ &tun_type, &tun_out_cnt, &out_cnt);
if (err)
return err;
act_cnt++;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index 3a42a1fc55cb..4a7f3510a296 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -92,6 +92,7 @@
#define NFP_FL_ACTION_OPCODE_SET_IPV6_DST 12
#define NFP_FL_ACTION_OPCODE_SET_UDP 14
#define NFP_FL_ACTION_OPCODE_SET_TCP 15
+#define NFP_FL_ACTION_OPCODE_PRE_LAG 16
#define NFP_FL_ACTION_OPCODE_PRE_TUNNEL 17
#define NFP_FL_ACTION_OPCODE_NUM 32
@@ -103,6 +104,9 @@
#define NFP_FL_PUSH_VLAN_CFI BIT(12)
#define NFP_FL_PUSH_VLAN_VID GENMASK(11, 0)
+/* LAG ports */
+#define NFP_FL_LAG_OUT 0xC0DE0000
+
/* Tunnel ports */
#define NFP_FL_PORT_TYPE_TUN 0x50000000
#define NFP_FL_IPV4_TUNNEL_TYPE GENMASK(7, 4)
@@ -177,6 +181,15 @@ struct nfp_fl_pop_vlan {
__be16 reserved;
};
+struct nfp_fl_pre_lag {
+ struct nfp_fl_act_head head;
+ __be16 group_id;
+ u8 lag_version[3];
+ u8 instance;
+};
+
+#define NFP_FL_PRE_LAG_VER_OFF 8
+
struct nfp_fl_pre_tunnel {
struct nfp_fl_act_head head;
__be16 reserved;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c b/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
index a09fe2778250..0c4c957717ea 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
@@ -184,6 +184,48 @@ nfp_fl_lag_find_group_for_master_with_lag(struct nfp_fl_lag *lag,
return NULL;
}
+int nfp_flower_lag_populate_pre_action(struct nfp_app *app,
+ struct net_device *master,
+ struct nfp_fl_pre_lag *pre_act)
+{
+ struct nfp_flower_priv *priv = app->priv;
+ struct nfp_fl_lag_group *group = NULL;
+ __be32 temp_vers;
+
+ mutex_lock(&priv->nfp_lag.lock);
+ group = nfp_fl_lag_find_group_for_master_with_lag(&priv->nfp_lag,
+ master);
+ if (!group) {
+ mutex_unlock(&priv->nfp_lag.lock);
+ return -ENOENT;
+ }
+
+ pre_act->group_id = cpu_to_be16(group->group_id);
+ temp_vers = cpu_to_be32(priv->nfp_lag.batch_ver <<
+ NFP_FL_PRE_LAG_VER_OFF);
+ memcpy(pre_act->lag_version, &temp_vers, 3);
+ pre_act->instance = group->group_inst;
+ mutex_unlock(&priv->nfp_lag.lock);
+
+ return 0;
+}
+
+int nfp_flower_lag_get_output_id(struct nfp_app *app, struct net_device *master)
+{
+ struct nfp_flower_priv *priv = app->priv;
+ struct nfp_fl_lag_group *group = NULL;
+ int group_id = -ENOENT;
+
+ mutex_lock(&priv->nfp_lag.lock);
+ group = nfp_fl_lag_find_group_for_master_with_lag(&priv->nfp_lag,
+ master);
+ if (group)
+ group_id = group->group_id;
+ mutex_unlock(&priv->nfp_lag.lock);
+
+ return group_id;
+}
+
static int
nfp_fl_lag_config_group(struct nfp_fl_lag *lag, struct nfp_fl_lag_group *group,
struct net_device **active_members,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 2fd75c155ccb..bbe5764d26cb 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -45,6 +45,7 @@
#include <linux/workqueue.h>
#include <linux/idr.h>
+struct nfp_fl_pre_lag;
struct net_device;
struct nfp_app;
@@ -253,7 +254,8 @@ int nfp_flower_compile_flow_match(struct tc_cls_flower_offload *flow,
struct net_device *netdev,
struct nfp_fl_payload *nfp_flow,
enum nfp_flower_tun_type tun_type);
-int nfp_flower_compile_action(struct tc_cls_flower_offload *flow,
+int nfp_flower_compile_action(struct nfp_app *app,
+ struct tc_cls_flower_offload *flow,
struct net_device *netdev,
struct nfp_fl_payload *nfp_flow);
int nfp_compile_flow_metadata(struct nfp_app *app,
@@ -284,5 +286,10 @@ void nfp_flower_lag_init(struct nfp_fl_lag *lag);
void nfp_flower_lag_cleanup(struct nfp_fl_lag *lag);
int nfp_flower_lag_reset(struct nfp_fl_lag *lag);
bool nfp_flower_lag_unprocessed_msg(struct nfp_app *app, struct sk_buff *skb);
+int nfp_flower_lag_populate_pre_action(struct nfp_app *app,
+ struct net_device *master,
+ struct nfp_fl_pre_lag *pre_act);
+int nfp_flower_lag_get_output_id(struct nfp_app *app,
+ struct net_device *master);
#endif
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 70ec9d821b91..c42e64f32333 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -440,7 +440,7 @@ nfp_flower_add_offload(struct nfp_app *app, struct net_device *netdev,
if (err)
goto err_destroy_flow;
- err = nfp_flower_compile_action(flow, netdev, flow_pay);
+ err = nfp_flower_compile_action(app, flow, netdev, flow_pay);
if (err)
goto err_destroy_flow;
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 7/8] nfp: flower: implement host cmsg handler for LAG
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, John Hurley
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
Adds the control message handler to synchronize offloaded group config
with that of the kernel. Such messages are sent from fw to driver and
feature the following 3 flags:
- Data: an attached cmsg could not be processed - store for retransmission
- Xon: FW can accept new messages - retransmit any stored cmsgs
- Sync: full sync requested so retransmit all kernel LAG group info
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
.../net/ethernet/netronome/nfp/flower/cmsg.c | 8 +-
.../ethernet/netronome/nfp/flower/lag_conf.c | 95 +++++++++++++++++++
.../net/ethernet/netronome/nfp/flower/main.h | 4 +
3 files changed, 105 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
index 03aae2ed9983..cb8565222621 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
@@ -242,6 +242,7 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct sk_buff *skb)
struct nfp_flower_priv *app_priv = app->priv;
struct nfp_flower_cmsg_hdr *cmsg_hdr;
enum nfp_flower_cmsg_type_port type;
+ bool skb_stored = false;
cmsg_hdr = nfp_flower_cmsg_get_hdr(skb);
@@ -260,8 +261,10 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct sk_buff *skb)
nfp_tunnel_keep_alive(app, skb);
break;
case NFP_FLOWER_CMSG_TYPE_LAG_CONFIG:
- if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG)
+ if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG) {
+ skb_stored = nfp_flower_lag_unprocessed_msg(app, skb);
break;
+ }
/* fall through */
default:
nfp_flower_cmsg_warn(app, "Cannot handle invalid repr control type %u\n",
@@ -269,7 +272,8 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct sk_buff *skb)
goto out;
}
- dev_consume_skb_any(skb);
+ if (!skb_stored)
+ dev_consume_skb_any(skb);
return;
out:
dev_kfree_skb_any(skb);
diff --git a/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c b/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
index 35a700b879d7..a09fe2778250 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
@@ -36,6 +36,9 @@
/* LAG group config flags. */
#define NFP_FL_LAG_LAST BIT(1)
#define NFP_FL_LAG_FIRST BIT(2)
+#define NFP_FL_LAG_DATA BIT(3)
+#define NFP_FL_LAG_XON BIT(4)
+#define NFP_FL_LAG_SYNC BIT(5)
#define NFP_FL_LAG_SWITCH BIT(6)
#define NFP_FL_LAG_RESET BIT(7)
@@ -108,6 +111,8 @@ struct nfp_fl_lag_group {
/* wait for more config */
#define NFP_FL_LAG_DELAY (msecs_to_jiffies(2))
+#define NFP_FL_LAG_RETRANS_LIMIT 100 /* max retrans cmsgs to store */
+
static unsigned int nfp_fl_get_next_pkt_number(struct nfp_fl_lag *lag)
{
lag->pkt_num++;
@@ -360,6 +365,92 @@ static void nfp_fl_lag_do_work(struct work_struct *work)
mutex_unlock(&lag->lock);
}
+static int
+nfp_fl_lag_put_unprocessed(struct nfp_fl_lag *lag, struct sk_buff *skb)
+{
+ struct nfp_flower_cmsg_lag_config *cmsg_payload;
+
+ cmsg_payload = nfp_flower_cmsg_get_data(skb);
+ if (be32_to_cpu(cmsg_payload->group_id) >= NFP_FL_LAG_GROUP_MAX)
+ return -EINVAL;
+
+ /* Drop cmsg retrans if storage limit is exceeded to prevent
+ * overloading. If the fw notices that expected messages have not been
+ * received in a given time block, it will request a full resync.
+ */
+ if (skb_queue_len(&lag->retrans_skbs) >= NFP_FL_LAG_RETRANS_LIMIT)
+ return -ENOSPC;
+
+ __skb_queue_tail(&lag->retrans_skbs, skb);
+
+ return 0;
+}
+
+static void nfp_fl_send_unprocessed(struct nfp_fl_lag *lag)
+{
+ struct nfp_flower_priv *priv;
+ struct sk_buff *skb;
+
+ priv = container_of(lag, struct nfp_flower_priv, nfp_lag);
+
+ while ((skb = __skb_dequeue(&lag->retrans_skbs)))
+ nfp_ctrl_tx(priv->app->ctrl, skb);
+}
+
+bool nfp_flower_lag_unprocessed_msg(struct nfp_app *app, struct sk_buff *skb)
+{
+ struct nfp_flower_cmsg_lag_config *cmsg_payload;
+ struct nfp_flower_priv *priv = app->priv;
+ struct nfp_fl_lag_group *group_entry;
+ unsigned long int flags;
+ bool store_skb = false;
+ int err;
+
+ cmsg_payload = nfp_flower_cmsg_get_data(skb);
+ flags = cmsg_payload->ctrl_flags;
+
+ /* Note the intentional fall through below. If DATA and XON are both
+ * set, the message will stored and sent again with the rest of the
+ * unprocessed messages list.
+ */
+
+ /* Store */
+ if (flags & NFP_FL_LAG_DATA)
+ if (!nfp_fl_lag_put_unprocessed(&priv->nfp_lag, skb))
+ store_skb = true;
+
+ /* Send stored */
+ if (flags & NFP_FL_LAG_XON)
+ nfp_fl_send_unprocessed(&priv->nfp_lag);
+
+ /* Resend all */
+ if (flags & NFP_FL_LAG_SYNC) {
+ /* To resend all config:
+ * 1) Clear all unprocessed messages
+ * 2) Mark all groups dirty
+ * 3) Reset NFP group config
+ * 4) Schedule a LAG config update
+ */
+
+ __skb_queue_purge(&priv->nfp_lag.retrans_skbs);
+
+ mutex_lock(&priv->nfp_lag.lock);
+ list_for_each_entry(group_entry, &priv->nfp_lag.group_list,
+ list)
+ group_entry->dirty = true;
+
+ err = nfp_flower_lag_reset(&priv->nfp_lag);
+ if (err)
+ nfp_flower_cmsg_warn(priv->app,
+ "mem err in group reset msg\n");
+ mutex_unlock(&priv->nfp_lag.lock);
+
+ schedule_delayed_work(&priv->nfp_lag.work, 0);
+ }
+
+ return store_skb;
+}
+
static void
nfp_fl_lag_schedule_group_remove(struct nfp_fl_lag *lag,
struct nfp_fl_lag_group *group)
@@ -565,6 +656,8 @@ void nfp_flower_lag_init(struct nfp_fl_lag *lag)
mutex_init(&lag->lock);
ida_init(&lag->ida_handle);
+ __skb_queue_head_init(&lag->retrans_skbs);
+
/* 0 is a reserved batch version so increment to first valid value. */
nfp_fl_increment_version(lag);
@@ -577,6 +670,8 @@ void nfp_flower_lag_cleanup(struct nfp_fl_lag *lag)
cancel_delayed_work_sync(&lag->work);
+ __skb_queue_purge(&lag->retrans_skbs);
+
/* Remove all groups. */
mutex_lock(&lag->lock);
list_for_each_entry_safe(entry, storage, &lag->group_list, list) {
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index e03efb034948..2fd75c155ccb 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -109,6 +109,8 @@ struct nfp_mtu_conf {
* @batch_ver: Incremented for each batch of config packets
* @global_inst: Instance allocator for groups
* @rst_cfg: Marker to reset HW LAG config
+ * @retrans_skbs: Cmsgs that could not be processed by HW and require
+ * retransmission
*/
struct nfp_fl_lag {
struct notifier_block lag_nb;
@@ -120,6 +122,7 @@ struct nfp_fl_lag {
unsigned int batch_ver;
u8 global_inst;
bool rst_cfg;
+ struct sk_buff_head retrans_skbs;
};
/**
@@ -280,5 +283,6 @@ int nfp_flower_setup_tc_egress_cb(enum tc_setup_type type, void *type_data,
void nfp_flower_lag_init(struct nfp_fl_lag *lag);
void nfp_flower_lag_cleanup(struct nfp_fl_lag *lag);
int nfp_flower_lag_reset(struct nfp_fl_lag *lag);
+bool nfp_flower_lag_unprocessed_msg(struct nfp_app *app, struct sk_buff *skb);
#endif
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 6/8] nfp: flower: monitor and offload LAG groups
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, John Hurley
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
Monitor LAG events via the NETDEV_CHANGEUPPER/NETDEV_CHANGELOWERSTATE
notifiers to maintain a list of offloadable groups. Sync these groups with
HW via a delayed workqueue to prevent excessive re-configuration. When the
workqueue is triggered it may generate multiple control messages for
different groups. These messages are linked via a batch ID and flags to
indicate a new batch and the end of a batch.
Update private data in each repr to track their LAG lower state flags. The
state of a repr is used to determine the active netdevs that can be
offloaded. For example, in active-backup mode, we only offload the netdev
currently active.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/Makefile | 1 +
.../ethernet/netronome/nfp/flower/lag_conf.c | 589 ++++++++++++++++++
.../net/ethernet/netronome/nfp/flower/main.c | 29 +-
.../net/ethernet/netronome/nfp/flower/main.h | 30 +
4 files changed, 646 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
diff --git a/drivers/net/ethernet/netronome/nfp/Makefile b/drivers/net/ethernet/netronome/nfp/Makefile
index 6373f56205fd..4afb10375397 100644
--- a/drivers/net/ethernet/netronome/nfp/Makefile
+++ b/drivers/net/ethernet/netronome/nfp/Makefile
@@ -37,6 +37,7 @@ ifeq ($(CONFIG_NFP_APP_FLOWER),y)
nfp-objs += \
flower/action.o \
flower/cmsg.o \
+ flower/lag_conf.o \
flower/main.o \
flower/match.o \
flower/metadata.o \
diff --git a/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c b/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
new file mode 100644
index 000000000000..35a700b879d7
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
@@ -0,0 +1,589 @@
+/*
+ * Copyright (C) 2018 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below. You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "main.h"
+
+/* LAG group config flags. */
+#define NFP_FL_LAG_LAST BIT(1)
+#define NFP_FL_LAG_FIRST BIT(2)
+#define NFP_FL_LAG_SWITCH BIT(6)
+#define NFP_FL_LAG_RESET BIT(7)
+
+/* LAG port state flags. */
+#define NFP_PORT_LAG_LINK_UP BIT(0)
+#define NFP_PORT_LAG_TX_ENABLED BIT(1)
+#define NFP_PORT_LAG_CHANGED BIT(2)
+
+enum nfp_fl_lag_batch {
+ NFP_FL_LAG_BATCH_FIRST,
+ NFP_FL_LAG_BATCH_MEMBER,
+ NFP_FL_LAG_BATCH_FINISHED
+};
+
+/**
+ * struct nfp_flower_cmsg_lag_config - control message payload for LAG config
+ * @ctrl_flags: Configuration flags
+ * @reserved: Reserved for future use
+ * @ttl: Time to live of packet - host always sets to 0xff
+ * @pkt_number: Config message packet number - increment for each message
+ * @batch_ver: Batch version of messages - increment for each batch of messages
+ * @group_id: Group ID applicable
+ * @group_inst: Group instance number - increment when group is reused
+ * @members: Array of 32-bit words listing all active group members
+ */
+struct nfp_flower_cmsg_lag_config {
+ u8 ctrl_flags;
+ u8 reserved[2];
+ u8 ttl;
+ __be32 pkt_number;
+ __be32 batch_ver;
+ __be32 group_id;
+ __be32 group_inst;
+ __be32 members[];
+};
+
+/**
+ * struct nfp_fl_lag_group - list entry for each LAG group
+ * @group_id: Assigned group ID for host/kernel sync
+ * @group_inst: Group instance in case of ID reuse
+ * @list: List entry
+ * @master_ndev: Group master Netdev
+ * @dirty: Marked if the group needs synced to HW
+ * @offloaded: Marked if the group is currently offloaded to NIC
+ * @to_remove: Marked if the group should be removed from NIC
+ * @to_destroy: Marked if the group should be removed from driver
+ * @slave_cnt: Number of slaves in group
+ */
+struct nfp_fl_lag_group {
+ unsigned int group_id;
+ u8 group_inst;
+ struct list_head list;
+ struct net_device *master_ndev;
+ bool dirty;
+ bool offloaded;
+ bool to_remove;
+ bool to_destroy;
+ unsigned int slave_cnt;
+};
+
+#define NFP_FL_LAG_PKT_NUMBER_MASK GENMASK(30, 0)
+#define NFP_FL_LAG_VERSION_MASK GENMASK(22, 0)
+#define NFP_FL_LAG_HOST_TTL 0xff
+
+/* Use this ID with zero members to ack a batch config */
+#define NFP_FL_LAG_SYNC_ID 0
+#define NFP_FL_LAG_GROUP_MIN 1 /* ID 0 reserved */
+#define NFP_FL_LAG_GROUP_MAX 32 /* IDs 1 to 31 are valid */
+
+/* wait for more config */
+#define NFP_FL_LAG_DELAY (msecs_to_jiffies(2))
+
+static unsigned int nfp_fl_get_next_pkt_number(struct nfp_fl_lag *lag)
+{
+ lag->pkt_num++;
+ lag->pkt_num &= NFP_FL_LAG_PKT_NUMBER_MASK;
+
+ return lag->pkt_num;
+}
+
+static void nfp_fl_increment_version(struct nfp_fl_lag *lag)
+{
+ /* LSB is not considered by firmware so add 2 for each increment. */
+ lag->batch_ver += 2;
+ lag->batch_ver &= NFP_FL_LAG_VERSION_MASK;
+
+ /* Zero is reserved by firmware. */
+ if (!lag->batch_ver)
+ lag->batch_ver += 2;
+}
+
+static struct nfp_fl_lag_group *
+nfp_fl_lag_group_create(struct nfp_fl_lag *lag, struct net_device *master)
+{
+ struct nfp_fl_lag_group *group;
+ struct nfp_flower_priv *priv;
+ int id;
+
+ priv = container_of(lag, struct nfp_flower_priv, nfp_lag);
+
+ id = ida_simple_get(&lag->ida_handle, NFP_FL_LAG_GROUP_MIN,
+ NFP_FL_LAG_GROUP_MAX, GFP_KERNEL);
+ if (id < 0) {
+ nfp_flower_cmsg_warn(priv->app,
+ "No more bonding groups available\n");
+ return ERR_PTR(id);
+ }
+
+ group = kmalloc(sizeof(*group), GFP_KERNEL);
+ if (!group) {
+ ida_simple_remove(&lag->ida_handle, id);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ group->group_id = id;
+ group->master_ndev = master;
+ group->dirty = true;
+ group->offloaded = false;
+ group->to_remove = false;
+ group->to_destroy = false;
+ group->slave_cnt = 0;
+ group->group_inst = ++lag->global_inst;
+ list_add_tail(&group->list, &lag->group_list);
+
+ return group;
+}
+
+static struct nfp_fl_lag_group *
+nfp_fl_lag_find_group_for_master_with_lag(struct nfp_fl_lag *lag,
+ struct net_device *master)
+{
+ struct nfp_fl_lag_group *entry;
+
+ if (!master)
+ return NULL;
+
+ list_for_each_entry(entry, &lag->group_list, list)
+ if (entry->master_ndev == master)
+ return entry;
+
+ return NULL;
+}
+
+static int
+nfp_fl_lag_config_group(struct nfp_fl_lag *lag, struct nfp_fl_lag_group *group,
+ struct net_device **active_members,
+ unsigned int member_cnt, enum nfp_fl_lag_batch *batch)
+{
+ struct nfp_flower_cmsg_lag_config *cmsg_payload;
+ struct nfp_flower_priv *priv;
+ unsigned long int flags;
+ unsigned int size, i;
+ struct sk_buff *skb;
+
+ priv = container_of(lag, struct nfp_flower_priv, nfp_lag);
+ size = sizeof(*cmsg_payload) + sizeof(__be32) * member_cnt;
+ skb = nfp_flower_cmsg_alloc(priv->app, size,
+ NFP_FLOWER_CMSG_TYPE_LAG_CONFIG,
+ GFP_KERNEL);
+ if (!skb)
+ return -ENOMEM;
+
+ cmsg_payload = nfp_flower_cmsg_get_data(skb);
+ flags = 0;
+
+ /* Increment batch version for each new batch of config messages. */
+ if (*batch == NFP_FL_LAG_BATCH_FIRST) {
+ flags |= NFP_FL_LAG_FIRST;
+ nfp_fl_increment_version(lag);
+ *batch = NFP_FL_LAG_BATCH_MEMBER;
+ }
+
+ /* If it is a reset msg then it is also the end of the batch. */
+ if (lag->rst_cfg) {
+ flags |= NFP_FL_LAG_RESET;
+ *batch = NFP_FL_LAG_BATCH_FINISHED;
+ }
+
+ /* To signal the end of a batch, both the switch and last flags are set
+ * and the the reserved SYNC group ID is used.
+ */
+ if (*batch == NFP_FL_LAG_BATCH_FINISHED) {
+ flags |= NFP_FL_LAG_SWITCH | NFP_FL_LAG_LAST;
+ lag->rst_cfg = false;
+ cmsg_payload->group_id = cpu_to_be32(NFP_FL_LAG_SYNC_ID);
+ cmsg_payload->group_inst = 0;
+ } else {
+ cmsg_payload->group_id = cpu_to_be32(group->group_id);
+ cmsg_payload->group_inst = cpu_to_be32(group->group_inst);
+ }
+
+ cmsg_payload->reserved[0] = 0;
+ cmsg_payload->reserved[1] = 0;
+ cmsg_payload->ttl = NFP_FL_LAG_HOST_TTL;
+ cmsg_payload->ctrl_flags = flags;
+ cmsg_payload->batch_ver = cpu_to_be32(lag->batch_ver);
+ cmsg_payload->pkt_number = cpu_to_be32(nfp_fl_get_next_pkt_number(lag));
+
+ for (i = 0; i < member_cnt; i++)
+ cmsg_payload->members[i] =
+ cpu_to_be32(nfp_repr_get_port_id(active_members[i]));
+
+ nfp_ctrl_tx(priv->app->ctrl, skb);
+ return 0;
+}
+
+static void nfp_fl_lag_do_work(struct work_struct *work)
+{
+ enum nfp_fl_lag_batch batch = NFP_FL_LAG_BATCH_FIRST;
+ struct nfp_fl_lag_group *entry, *storage;
+ struct delayed_work *delayed_work;
+ struct nfp_flower_priv *priv;
+ struct nfp_fl_lag *lag;
+ int err;
+
+ delayed_work = to_delayed_work(work);
+ lag = container_of(delayed_work, struct nfp_fl_lag, work);
+ priv = container_of(lag, struct nfp_flower_priv, nfp_lag);
+
+ mutex_lock(&lag->lock);
+ list_for_each_entry_safe(entry, storage, &lag->group_list, list) {
+ struct net_device *iter_netdev, **acti_netdevs;
+ struct nfp_flower_repr_priv *repr_priv;
+ int active_count = 0, slaves = 0;
+ struct nfp_repr *repr;
+ unsigned long *flags;
+
+ if (entry->to_remove) {
+ /* Active count of 0 deletes group on hw. */
+ err = nfp_fl_lag_config_group(lag, entry, NULL, 0,
+ &batch);
+ if (!err) {
+ entry->to_remove = false;
+ entry->offloaded = false;
+ } else {
+ nfp_flower_cmsg_warn(priv->app,
+ "group delete failed\n");
+ schedule_delayed_work(&lag->work,
+ NFP_FL_LAG_DELAY);
+ continue;
+ }
+
+ if (entry->to_destroy) {
+ ida_simple_remove(&lag->ida_handle,
+ entry->group_id);
+ list_del(&entry->list);
+ kfree(entry);
+ }
+ continue;
+ }
+
+ acti_netdevs = kmalloc_array(entry->slave_cnt,
+ sizeof(*acti_netdevs), GFP_KERNEL);
+
+ /* Include sanity check in the loop. It may be that a bond has
+ * changed between processing the last notification and the
+ * work queue triggering. If the number of slaves has changed
+ * or it now contains netdevs that cannot be offloaded, ignore
+ * the group until pending notifications are processed.
+ */
+ rcu_read_lock();
+ for_each_netdev_in_bond_rcu(entry->master_ndev, iter_netdev) {
+ if (!nfp_netdev_is_nfp_repr(iter_netdev)) {
+ slaves = 0;
+ break;
+ }
+
+ repr = netdev_priv(iter_netdev);
+
+ if (repr->app != priv->app) {
+ slaves = 0;
+ break;
+ }
+
+ slaves++;
+ if (slaves > entry->slave_cnt)
+ break;
+
+ /* Check the ports for state changes. */
+ repr_priv = repr->app_priv;
+ flags = &repr_priv->lag_port_flags;
+
+ if (*flags & NFP_PORT_LAG_CHANGED) {
+ *flags &= ~NFP_PORT_LAG_CHANGED;
+ entry->dirty = true;
+ }
+
+ if ((*flags & NFP_PORT_LAG_TX_ENABLED) &&
+ (*flags & NFP_PORT_LAG_LINK_UP))
+ acti_netdevs[active_count++] = iter_netdev;
+ }
+ rcu_read_unlock();
+
+ if (slaves != entry->slave_cnt || !entry->dirty) {
+ kfree(acti_netdevs);
+ continue;
+ }
+
+ err = nfp_fl_lag_config_group(lag, entry, acti_netdevs,
+ active_count, &batch);
+ if (!err) {
+ entry->offloaded = true;
+ entry->dirty = false;
+ } else {
+ nfp_flower_cmsg_warn(priv->app,
+ "group offload failed\n");
+ schedule_delayed_work(&lag->work, NFP_FL_LAG_DELAY);
+ }
+
+ kfree(acti_netdevs);
+ }
+
+ /* End the config batch if at least one packet has been batched. */
+ if (batch == NFP_FL_LAG_BATCH_MEMBER) {
+ batch = NFP_FL_LAG_BATCH_FINISHED;
+ err = nfp_fl_lag_config_group(lag, NULL, NULL, 0, &batch);
+ if (err)
+ nfp_flower_cmsg_warn(priv->app,
+ "group batch end cmsg failed\n");
+ }
+
+ mutex_unlock(&lag->lock);
+}
+
+static void
+nfp_fl_lag_schedule_group_remove(struct nfp_fl_lag *lag,
+ struct nfp_fl_lag_group *group)
+{
+ group->to_remove = true;
+
+ schedule_delayed_work(&lag->work, NFP_FL_LAG_DELAY);
+}
+
+static int
+nfp_fl_lag_schedule_group_delete(struct nfp_fl_lag *lag,
+ struct net_device *master)
+{
+ struct nfp_fl_lag_group *group;
+
+ mutex_lock(&lag->lock);
+ group = nfp_fl_lag_find_group_for_master_with_lag(lag, master);
+ if (!group) {
+ mutex_unlock(&lag->lock);
+ return -ENOENT;
+ }
+
+ group->to_remove = true;
+ group->to_destroy = true;
+ mutex_unlock(&lag->lock);
+
+ schedule_delayed_work(&lag->work, NFP_FL_LAG_DELAY);
+ return 0;
+}
+
+static int
+nfp_fl_lag_changeupper_event(struct nfp_fl_lag *lag,
+ struct netdev_notifier_changeupper_info *info)
+{
+ struct net_device *upper = info->upper_dev, *iter_netdev;
+ struct netdev_lag_upper_info *lag_upper_info;
+ struct nfp_fl_lag_group *group;
+ struct nfp_flower_priv *priv;
+ unsigned int slave_count = 0;
+ bool can_offload = true;
+ struct nfp_repr *repr;
+
+ if (!netif_is_lag_master(upper))
+ return 0;
+
+ priv = container_of(lag, struct nfp_flower_priv, nfp_lag);
+
+ rcu_read_lock();
+ for_each_netdev_in_bond_rcu(upper, iter_netdev) {
+ if (!nfp_netdev_is_nfp_repr(iter_netdev)) {
+ can_offload = false;
+ break;
+ }
+ repr = netdev_priv(iter_netdev);
+
+ /* Ensure all ports are created by the same app/on same card. */
+ if (repr->app != priv->app) {
+ can_offload = false;
+ break;
+ }
+
+ slave_count++;
+ }
+ rcu_read_unlock();
+
+ lag_upper_info = info->upper_info;
+
+ /* Firmware supports active/backup and L3/L4 hash bonds. */
+ if (lag_upper_info &&
+ lag_upper_info->tx_type != NETDEV_LAG_TX_TYPE_ACTIVEBACKUP &&
+ (lag_upper_info->tx_type != NETDEV_LAG_TX_TYPE_HASH ||
+ (lag_upper_info->hash_type != NETDEV_LAG_HASH_L34 &&
+ lag_upper_info->hash_type != NETDEV_LAG_HASH_E34))) {
+ can_offload = false;
+ nfp_flower_cmsg_warn(priv->app,
+ "Unable to offload tx_type %u hash %u\n",
+ lag_upper_info->tx_type,
+ lag_upper_info->hash_type);
+ }
+
+ mutex_lock(&lag->lock);
+ group = nfp_fl_lag_find_group_for_master_with_lag(lag, upper);
+
+ if (slave_count == 0 || !can_offload) {
+ /* Cannot offload the group - remove if previously offloaded. */
+ if (group && group->offloaded)
+ nfp_fl_lag_schedule_group_remove(lag, group);
+
+ mutex_unlock(&lag->lock);
+ return 0;
+ }
+
+ if (!group) {
+ group = nfp_fl_lag_group_create(lag, upper);
+ if (IS_ERR(group)) {
+ mutex_unlock(&lag->lock);
+ return PTR_ERR(group);
+ }
+ }
+
+ group->dirty = true;
+ group->slave_cnt = slave_count;
+
+ /* Group may have been on queue for removal but is now offfloable. */
+ group->to_remove = false;
+ mutex_unlock(&lag->lock);
+
+ schedule_delayed_work(&lag->work, NFP_FL_LAG_DELAY);
+ return 0;
+}
+
+static int
+nfp_fl_lag_changels_event(struct nfp_fl_lag *lag, struct net_device *netdev,
+ struct netdev_notifier_changelowerstate_info *info)
+{
+ struct netdev_lag_lower_state_info *lag_lower_info;
+ struct nfp_flower_repr_priv *repr_priv;
+ struct nfp_flower_priv *priv;
+ struct nfp_repr *repr;
+ unsigned long *flags;
+
+ if (!netif_is_lag_port(netdev) || !nfp_netdev_is_nfp_repr(netdev))
+ return 0;
+
+ lag_lower_info = info->lower_state_info;
+ if (!lag_lower_info)
+ return 0;
+
+ priv = container_of(lag, struct nfp_flower_priv, nfp_lag);
+ repr = netdev_priv(netdev);
+
+ /* Verify that the repr is associated with this app. */
+ if (repr->app != priv->app)
+ return 0;
+
+ repr_priv = repr->app_priv;
+ flags = &repr_priv->lag_port_flags;
+
+ mutex_lock(&lag->lock);
+ if (lag_lower_info->link_up)
+ *flags |= NFP_PORT_LAG_LINK_UP;
+ else
+ *flags &= ~NFP_PORT_LAG_LINK_UP;
+
+ if (lag_lower_info->tx_enabled)
+ *flags |= NFP_PORT_LAG_TX_ENABLED;
+ else
+ *flags &= ~NFP_PORT_LAG_TX_ENABLED;
+
+ *flags |= NFP_PORT_LAG_CHANGED;
+ mutex_unlock(&lag->lock);
+
+ schedule_delayed_work(&lag->work, NFP_FL_LAG_DELAY);
+ return 0;
+}
+
+static int
+nfp_fl_lag_netdev_event(struct notifier_block *nb, unsigned long event,
+ void *ptr)
+{
+ struct net_device *netdev;
+ struct nfp_fl_lag *lag;
+ int err;
+
+ netdev = netdev_notifier_info_to_dev(ptr);
+ lag = container_of(nb, struct nfp_fl_lag, lag_nb);
+
+ switch (event) {
+ case NETDEV_CHANGEUPPER:
+ err = nfp_fl_lag_changeupper_event(lag, ptr);
+ if (err)
+ return NOTIFY_BAD;
+ return NOTIFY_OK;
+ case NETDEV_CHANGELOWERSTATE:
+ err = nfp_fl_lag_changels_event(lag, netdev, ptr);
+ if (err)
+ return NOTIFY_BAD;
+ return NOTIFY_OK;
+ case NETDEV_UNREGISTER:
+ if (netif_is_bond_master(netdev)) {
+ err = nfp_fl_lag_schedule_group_delete(lag, netdev);
+ if (err)
+ return NOTIFY_BAD;
+ return NOTIFY_OK;
+ }
+ }
+
+ return NOTIFY_DONE;
+}
+
+int nfp_flower_lag_reset(struct nfp_fl_lag *lag)
+{
+ enum nfp_fl_lag_batch batch = NFP_FL_LAG_BATCH_FIRST;
+
+ lag->rst_cfg = true;
+ return nfp_fl_lag_config_group(lag, NULL, NULL, 0, &batch);
+}
+
+void nfp_flower_lag_init(struct nfp_fl_lag *lag)
+{
+ INIT_DELAYED_WORK(&lag->work, nfp_fl_lag_do_work);
+ INIT_LIST_HEAD(&lag->group_list);
+ mutex_init(&lag->lock);
+ ida_init(&lag->ida_handle);
+
+ /* 0 is a reserved batch version so increment to first valid value. */
+ nfp_fl_increment_version(lag);
+
+ lag->lag_nb.notifier_call = nfp_fl_lag_netdev_event;
+}
+
+void nfp_flower_lag_cleanup(struct nfp_fl_lag *lag)
+{
+ struct nfp_fl_lag_group *entry, *storage;
+
+ cancel_delayed_work_sync(&lag->work);
+
+ /* Remove all groups. */
+ mutex_lock(&lag->lock);
+ list_for_each_entry_safe(entry, storage, &lag->group_list, list) {
+ list_del(&entry->list);
+ kfree(entry);
+ }
+ mutex_unlock(&lag->lock);
+ mutex_destroy(&lag->lock);
+ ida_destroy(&lag->ida_handle);
+}
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 202284b42fd9..19cfa162ac65 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -575,12 +575,14 @@ static int nfp_flower_init(struct nfp_app *app)
/* Tell the firmware that the driver supports lag. */
err = nfp_rtsym_write_le(app->pf->rtbl,
"_abi_flower_balance_sync_enable", 1);
- if (!err)
+ if (!err) {
app_priv->flower_ext_feats |= NFP_FL_FEATS_LAG;
- else if (err == -ENOENT)
+ nfp_flower_lag_init(&app_priv->nfp_lag);
+ } else if (err == -ENOENT) {
nfp_warn(app->cpp, "LAG not supported by FW.\n");
- else
+ } else {
goto err_cleanup_metadata;
+ }
return 0;
@@ -599,6 +601,9 @@ static void nfp_flower_clean(struct nfp_app *app)
skb_queue_purge(&app_priv->cmsg_skbs_low);
flush_work(&app_priv->cmsg_work);
+ if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG)
+ nfp_flower_lag_cleanup(&app_priv->nfp_lag);
+
nfp_flower_metadata_cleanup(app);
vfree(app->priv);
app->priv = NULL;
@@ -665,11 +670,29 @@ nfp_flower_repr_change_mtu(struct nfp_app *app, struct net_device *netdev,
static int nfp_flower_start(struct nfp_app *app)
{
+ struct nfp_flower_priv *app_priv = app->priv;
+ int err;
+
+ if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG) {
+ err = nfp_flower_lag_reset(&app_priv->nfp_lag);
+ if (err)
+ return err;
+
+ err = register_netdevice_notifier(&app_priv->nfp_lag.lag_nb);
+ if (err)
+ return err;
+ }
+
return nfp_tunnel_config_start(app);
}
static void nfp_flower_stop(struct nfp_app *app)
{
+ struct nfp_flower_priv *app_priv = app->priv;
+
+ if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG)
+ unregister_netdevice_notifier(&app_priv->nfp_lag.lag_nb);
+
nfp_tunnel_config_stop(app);
}
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 7ce255705446..e03efb034948 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -43,6 +43,7 @@
#include <net/pkt_cls.h>
#include <net/tcp.h>
#include <linux/workqueue.h>
+#include <linux/idr.h>
struct net_device;
struct nfp_app;
@@ -97,6 +98,30 @@ struct nfp_mtu_conf {
spinlock_t lock;
};
+/**
+ * struct nfp_fl_lag - Flower APP priv data for link aggregation
+ * @lag_nb: Notifier to track master/slave events
+ * @work: Work queue for writing configs to the HW
+ * @lock: Lock to protect lag_group_list
+ * @group_list: List of all master/slave groups offloaded
+ * @ida_handle: IDA to handle group ids
+ * @pkt_num: Incremented for each config packet sent
+ * @batch_ver: Incremented for each batch of config packets
+ * @global_inst: Instance allocator for groups
+ * @rst_cfg: Marker to reset HW LAG config
+ */
+struct nfp_fl_lag {
+ struct notifier_block lag_nb;
+ struct delayed_work work;
+ struct mutex lock;
+ struct list_head group_list;
+ struct ida ida_handle;
+ unsigned int pkt_num;
+ unsigned int batch_ver;
+ u8 global_inst;
+ bool rst_cfg;
+};
+
/**
* struct nfp_flower_priv - Flower APP per-vNIC priv data
* @app: Back pointer to app
@@ -129,6 +154,7 @@ struct nfp_mtu_conf {
* from firmware for repr reify
* @reify_wait_queue: wait queue for repr reify response counting
* @mtu_conf: Configuration of repr MTU value
+ * @nfp_lag: Link aggregation data block
*/
struct nfp_flower_priv {
struct nfp_app *app;
@@ -158,6 +184,7 @@ struct nfp_flower_priv {
atomic_t reify_replies;
wait_queue_head_t reify_wait_queue;
struct nfp_mtu_conf mtu_conf;
+ struct nfp_fl_lag nfp_lag;
};
/**
@@ -250,5 +277,8 @@ void nfp_tunnel_request_route(struct nfp_app *app, struct sk_buff *skb);
void nfp_tunnel_keep_alive(struct nfp_app *app, struct sk_buff *skb);
int nfp_flower_setup_tc_egress_cb(enum tc_setup_type type, void *type_data,
void *cb_priv);
+void nfp_flower_lag_init(struct nfp_fl_lag *lag);
+void nfp_flower_lag_cleanup(struct nfp_fl_lag *lag);
+int nfp_flower_lag_reset(struct nfp_fl_lag *lag);
#endif
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 5/8] net: include hash policy in LAG changeupper info
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem
Cc: netdev, oss-drivers, John Hurley, Jiri Pirko, Jay Vosburgh,
Veaceslav Falico, Andy Gospodarek
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
LAG upper event notifiers contain the tx type used by the LAG device.
Extend this to also include the hash policy used for tx types that
utilize hashing.
Signed-off-by: John Hurley <john.hurley@netronome.com>
---
CC: Jiri Pirko <jiri@resnulli.us>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
drivers/net/bonding/bond_main.c | 27 ++++++++++++++++++++++++++-
drivers/net/team/team.c | 1 +
include/linux/netdevice.h | 11 +++++++++++
3 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index fea17b92b1ae..bd53a71f6b00 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1218,12 +1218,37 @@ static enum netdev_lag_tx_type bond_lag_tx_type(struct bonding *bond)
}
}
+static enum netdev_lag_hash bond_lag_hash_type(struct bonding *bond,
+ enum netdev_lag_tx_type type)
+{
+ if (type != NETDEV_LAG_TX_TYPE_HASH)
+ return NETDEV_LAG_HASH_NONE;
+
+ switch (bond->params.xmit_policy) {
+ case BOND_XMIT_POLICY_LAYER2:
+ return NETDEV_LAG_HASH_L2;
+ case BOND_XMIT_POLICY_LAYER34:
+ return NETDEV_LAG_HASH_L34;
+ case BOND_XMIT_POLICY_LAYER23:
+ return NETDEV_LAG_HASH_L23;
+ case BOND_XMIT_POLICY_ENCAP23:
+ return NETDEV_LAG_HASH_E23;
+ case BOND_XMIT_POLICY_ENCAP34:
+ return NETDEV_LAG_HASH_E34;
+ default:
+ return NETDEV_LAG_HASH_UNKNOWN;
+ }
+}
+
static int bond_master_upper_dev_link(struct bonding *bond, struct slave *slave,
struct netlink_ext_ack *extack)
{
struct netdev_lag_upper_info lag_upper_info;
+ enum netdev_lag_tx_type type;
- lag_upper_info.tx_type = bond_lag_tx_type(bond);
+ type = bond_lag_tx_type(bond);
+ lag_upper_info.tx_type = type;
+ lag_upper_info.hash_type = bond_lag_hash_type(bond, type);
return netdev_master_upper_dev_link(slave->dev, bond->dev, slave,
&lag_upper_info, extack);
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index d6ff881165d0..e6730a01d130 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1129,6 +1129,7 @@ static int team_upper_dev_link(struct team *team, struct team_port *port,
int err;
lag_upper_info.tx_type = team->mode->lag_tx_type;
+ lag_upper_info.hash_type = NETDEV_LAG_HASH_UNKNOWN;
err = netdev_master_upper_dev_link(port->dev, team->dev, NULL,
&lag_upper_info, extack);
if (err)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492c4e14..e97ba5e885a0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2328,8 +2328,19 @@ enum netdev_lag_tx_type {
NETDEV_LAG_TX_TYPE_HASH,
};
+enum netdev_lag_hash {
+ NETDEV_LAG_HASH_NONE,
+ NETDEV_LAG_HASH_L2,
+ NETDEV_LAG_HASH_L34,
+ NETDEV_LAG_HASH_L23,
+ NETDEV_LAG_HASH_E23,
+ NETDEV_LAG_HASH_E34,
+ NETDEV_LAG_HASH_UNKNOWN,
+};
+
struct netdev_lag_upper_info {
enum netdev_lag_tx_type tx_type;
+ enum netdev_lag_hash hash_type;
};
struct netdev_lag_lower_state_info {
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 4/8] nfp: flower: add per repr private data for LAG offload
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, John Hurley
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
Add a bitmap to each flower repr to track its state if it is enslaved by a
bond. This LAG state may be different to the port state - for example, the
port may be up but LAG state may be down due to the selection in an
active/backup bond.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
.../net/ethernet/netronome/nfp/flower/main.c | 26 +++++++++++++++++++
.../net/ethernet/netronome/nfp/flower/main.h | 8 ++++++
2 files changed, 34 insertions(+)
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 1910c3e2b3e5..202284b42fd9 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -185,6 +185,10 @@ nfp_flower_repr_netdev_init(struct nfp_app *app, struct net_device *netdev)
static void
nfp_flower_repr_netdev_clean(struct nfp_app *app, struct net_device *netdev)
{
+ struct nfp_repr *repr = netdev_priv(netdev);
+
+ kfree(repr->app_priv);
+
tc_setup_cb_egdev_unregister(netdev, nfp_flower_setup_tc_egress_cb,
netdev_priv(netdev));
}
@@ -225,7 +229,9 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
u8 nfp_pcie = nfp_cppcore_pcie_unit(app->pf->cpp);
struct nfp_flower_priv *priv = app->priv;
atomic_t *replies = &priv->reify_replies;
+ struct nfp_flower_repr_priv *repr_priv;
enum nfp_port_type port_type;
+ struct nfp_repr *nfp_repr;
struct nfp_reprs *reprs;
int i, err, reify_cnt;
const u8 queue = 0;
@@ -248,6 +254,15 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
goto err_reprs_clean;
}
+ repr_priv = kzalloc(sizeof(*repr_priv), GFP_KERNEL);
+ if (!repr_priv) {
+ err = -ENOMEM;
+ goto err_reprs_clean;
+ }
+
+ nfp_repr = netdev_priv(repr);
+ nfp_repr->app_priv = repr_priv;
+
/* For now we only support 1 PF */
WARN_ON(repr_type == NFP_REPR_TYPE_PF && i);
@@ -324,6 +339,8 @@ nfp_flower_spawn_phy_reprs(struct nfp_app *app, struct nfp_flower_priv *priv)
{
struct nfp_eth_table *eth_tbl = app->pf->eth_tbl;
atomic_t *replies = &priv->reify_replies;
+ struct nfp_flower_repr_priv *repr_priv;
+ struct nfp_repr *nfp_repr;
struct sk_buff *ctrl_skb;
struct nfp_reprs *reprs;
int err, reify_cnt;
@@ -351,6 +368,15 @@ nfp_flower_spawn_phy_reprs(struct nfp_app *app, struct nfp_flower_priv *priv)
goto err_reprs_clean;
}
+ repr_priv = kzalloc(sizeof(*repr_priv), GFP_KERNEL);
+ if (!repr_priv) {
+ err = -ENOMEM;
+ goto err_reprs_clean;
+ }
+
+ nfp_repr = netdev_priv(repr);
+ nfp_repr->app_priv = repr_priv;
+
port = nfp_port_alloc(app, NFP_PORT_PHYS_PORT, repr);
if (IS_ERR(port)) {
err = PTR_ERR(port);
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 6e82aa4ed84b..7ce255705446 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -160,6 +160,14 @@ struct nfp_flower_priv {
struct nfp_mtu_conf mtu_conf;
};
+/**
+ * struct nfp_flower_repr_priv - Flower APP per-repr priv data
+ * @lag_port_flags: Extended port flags to record lag state of repr
+ */
+struct nfp_flower_repr_priv {
+ unsigned long lag_port_flags;
+};
+
struct nfp_fl_key_ls {
u32 key_layer_two;
u8 key_layer;
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 3/8] nfp: flower: check for/turn on LAG support in firmware
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, John Hurley
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
Check if the fw contains the _abi_flower_balance_sync_enable symbol. If it
does then write a 1 to this indicating that the driver is willing to
receive NIC to kernel LAG related control messages.
If the write is successful, update the list of extra features supported by
the fw and add a stub to accept LAG cmsgs.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/flower/cmsg.c | 5 +++++
drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 1 +
drivers/net/ethernet/netronome/nfp/flower/main.c | 12 ++++++++++++
drivers/net/ethernet/netronome/nfp/flower/main.h | 1 +
4 files changed, 19 insertions(+)
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
index 577659f332e4..03aae2ed9983 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
@@ -239,6 +239,7 @@ nfp_flower_cmsg_portreify_rx(struct nfp_app *app, struct sk_buff *skb)
static void
nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct sk_buff *skb)
{
+ struct nfp_flower_priv *app_priv = app->priv;
struct nfp_flower_cmsg_hdr *cmsg_hdr;
enum nfp_flower_cmsg_type_port type;
@@ -258,6 +259,10 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct sk_buff *skb)
case NFP_FLOWER_CMSG_TYPE_ACTIVE_TUNS:
nfp_tunnel_keep_alive(app, skb);
break;
+ case NFP_FLOWER_CMSG_TYPE_LAG_CONFIG:
+ if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG)
+ break;
+ /* fall through */
default:
nfp_flower_cmsg_warn(app, "Cannot handle invalid repr control type %u\n",
type);
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index bee4367a2c38..3a42a1fc55cb 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -366,6 +366,7 @@ struct nfp_flower_cmsg_hdr {
enum nfp_flower_cmsg_type_port {
NFP_FLOWER_CMSG_TYPE_FLOW_ADD = 0,
NFP_FLOWER_CMSG_TYPE_FLOW_DEL = 2,
+ NFP_FLOWER_CMSG_TYPE_LAG_CONFIG = 4,
NFP_FLOWER_CMSG_TYPE_PORT_REIFY = 6,
NFP_FLOWER_CMSG_TYPE_MAC_REPR = 7,
NFP_FLOWER_CMSG_TYPE_PORT_MOD = 8,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 4e67c0cbf9f0..1910c3e2b3e5 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -546,8 +546,20 @@ static int nfp_flower_init(struct nfp_app *app)
else
app_priv->flower_ext_feats = features;
+ /* Tell the firmware that the driver supports lag. */
+ err = nfp_rtsym_write_le(app->pf->rtbl,
+ "_abi_flower_balance_sync_enable", 1);
+ if (!err)
+ app_priv->flower_ext_feats |= NFP_FL_FEATS_LAG;
+ else if (err == -ENOENT)
+ nfp_warn(app->cpp, "LAG not supported by FW.\n");
+ else
+ goto err_cleanup_metadata;
+
return 0;
+err_cleanup_metadata:
+ nfp_flower_metadata_cleanup(app);
err_free_app_priv:
vfree(app->priv);
return err;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 733ff53cc601..6e82aa4ed84b 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -67,6 +67,7 @@ struct nfp_app;
/* Extra features bitmap. */
#define NFP_FL_FEATS_GENEVE BIT(0)
#define NFP_FL_NBI_MTU_SETTING BIT(1)
+#define NFP_FL_FEATS_LAG BIT(31)
struct nfp_fl_mask_id {
struct circ_buf mask_id_free_list;
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 2/8] nfp: nfpcore: add rtsym writing function
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, John Hurley
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
Add an rtsym API function that combines the lookup of a symbol and the
writing of a value to it. Values can be written as unsigned 32 or 64 bits.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
.../ethernet/netronome/nfp/nfpcore/nfp_nffw.h | 2 +
.../netronome/nfp/nfpcore/nfp_rtsym.c | 43 +++++++++++++++++++
2 files changed, 45 insertions(+)
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.h b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.h
index c9724fb7ea4b..df599d5b6bb3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.h
@@ -100,6 +100,8 @@ nfp_rtsym_lookup(struct nfp_rtsym_table *rtbl, const char *name);
u64 nfp_rtsym_read_le(struct nfp_rtsym_table *rtbl, const char *name,
int *error);
+int nfp_rtsym_write_le(struct nfp_rtsym_table *rtbl, const char *name,
+ u64 value);
u8 __iomem *
nfp_rtsym_map(struct nfp_rtsym_table *rtbl, const char *name, const char *id,
unsigned int min_size, struct nfp_cpp_area **area);
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c
index 46107aefad1c..9e34216578da 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c
@@ -286,6 +286,49 @@ u64 nfp_rtsym_read_le(struct nfp_rtsym_table *rtbl, const char *name,
return val;
}
+/**
+ * nfp_rtsym_write_le() - Write an unsigned scalar value to a symbol
+ * @rtbl: NFP RTsym table
+ * @name: Symbol name
+ * @value: Value to write
+ *
+ * Lookup a symbol and write a value to it. Symbol can be 4 or 8 bytes in size.
+ * If 4 bytes then the lower 32-bits of 'value' are used. Value will be
+ * written as simple little-endian unsigned value.
+ *
+ * Return: 0 on success or error code.
+ */
+int nfp_rtsym_write_le(struct nfp_rtsym_table *rtbl, const char *name,
+ u64 value)
+{
+ const struct nfp_rtsym *sym;
+ int err;
+ u32 id;
+
+ sym = nfp_rtsym_lookup(rtbl, name);
+ if (!sym)
+ return -ENOENT;
+
+ id = NFP_CPP_ISLAND_ID(sym->target, NFP_CPP_ACTION_RW, 0, sym->domain);
+
+ switch (sym->size) {
+ case 4:
+ err = nfp_cpp_writel(rtbl->cpp, id, sym->addr, value);
+ break;
+ case 8:
+ err = nfp_cpp_writeq(rtbl->cpp, id, sym->addr, value);
+ break;
+ default:
+ nfp_err(rtbl->cpp,
+ "rtsym '%s' unsupported or non-scalar size: %lld\n",
+ name, sym->size);
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
u8 __iomem *
nfp_rtsym_map(struct nfp_rtsym_table *rtbl, const char *name, const char *id,
unsigned int min_size, struct nfp_cpp_area **area)
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 1/8] nfp: add ndo_set_mac_address for representors
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, John Hurley
In-Reply-To: <20180524022255.18548-1-jakub.kicinski@netronome.com>
From: John Hurley <john.hurley@netronome.com>
Adding a netdev to a bond requires that its mac address can be modified.
The default eth_mac_addr is sufficient to satisfy this requirement.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
index 09e87d5f4f72..117eca6819de 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
@@ -277,6 +277,7 @@ const struct net_device_ops nfp_repr_netdev_ops = {
.ndo_get_vf_config = nfp_app_get_vf_config,
.ndo_set_vf_link_state = nfp_app_set_vf_link_state,
.ndo_set_features = nfp_port_set_features,
+ .ndo_set_mac_address = eth_mac_addr,
};
static void nfp_repr_clean(struct nfp_repr *repr)
--
2.17.0
^ permalink raw reply related
* [PATCH net-next 0/8] nfp: offload LAG for tc flower egress
From: Jakub Kicinski @ 2018-05-24 2:22 UTC (permalink / raw)
To: davem
Cc: netdev, oss-drivers, Jakub Kicinski, Jiri Pirko, Jay Vosburgh,
Veaceslav Falico, Andy Gospodarek
Hi!
This series from John adds bond offload to the nfp driver. Patch 5
exposes the hash type for NETDEV_LAG_TX_TYPE_HASH to make sure nfp
hashing matches that of the software LAG. This may be unnecessarily
conservative, let's see what LAG maintainers think :)
John says:
This patchset sets up the infrastructure and offloads output actions for
when a TC flower rule attempts to egress a packet to a LAG port.
Firstly it adds some of the infrastructure required to the flower app and
to the nfp core. This includes the ability to change the MAC address of a
repr, a function for combining lookup and write to a FW symbol, and the
addition of private data to a repr on a per app basis.
Patch 6 continues by implementing notifiers that track Linux bonds and
communicates to the FW those which enslave reprs, along with the current
state of reprs within the bond.
Patch 7 ensures bonds are synchronised with FW by receiving and acting
upon cmsgs sent to the kernel. These may request that a bond message is
retransmitted when FW can process it, or may request a full sync of the
bonds defined in the kernel.
Patch 8 offloads a flower action when that action requires egressing to a
pre-defined Linux bond.
John Hurley (8):
nfp: add ndo_set_mac_address for representors
nfp: nfpcore: add rtsym writing function
nfp: flower: check for/turn on LAG support in firmware
nfp: flower: add per repr private data for LAG offload
net: include hash policy in LAG changeupper info
nfp: flower: monitor and offload LAG groups
nfp: flower: implement host cmsg handler for LAG
nfp: flower: compute link aggregation action
drivers/net/bonding/bond_main.c | 27 +-
drivers/net/ethernet/netronome/nfp/Makefile | 1 +
.../ethernet/netronome/nfp/flower/action.c | 131 +++-
.../net/ethernet/netronome/nfp/flower/cmsg.c | 11 +-
.../net/ethernet/netronome/nfp/flower/cmsg.h | 14 +
.../ethernet/netronome/nfp/flower/lag_conf.c | 726 ++++++++++++++++++
.../net/ethernet/netronome/nfp/flower/main.c | 61 ++
.../net/ethernet/netronome/nfp/flower/main.h | 52 +-
.../ethernet/netronome/nfp/flower/offload.c | 2 +-
.../net/ethernet/netronome/nfp/nfp_net_repr.c | 1 +
.../ethernet/netronome/nfp/nfpcore/nfp_nffw.h | 2 +
.../netronome/nfp/nfpcore/nfp_rtsym.c | 43 ++
drivers/net/team/team.c | 1 +
include/linux/netdevice.h | 11 +
14 files changed, 1053 insertions(+), 30 deletions(-)
create mode 100644 drivers/net/ethernet/netronome/nfp/flower/lag_conf.c
---
CC: Jiri Pirko <jiri@resnulli.us>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
^ permalink raw reply
* Re: [PATCH v3 net-next 0/2] bpfilter
From: Alexei Starovoitov @ 2018-05-24 2:09 UTC (permalink / raw)
To: Jakub Kicinski, Alexei Starovoitov
Cc: David S . Miller, daniel, torvalds, gregkh, luto, mcgrof,
keescook, netdev, linux-kernel, kernel-team
In-Reply-To: <20180523185010.0490e7ec@cakuba>
On 5/23/18 6:50 PM, Jakub Kicinski wrote:
> On Wed, 23 May 2018 18:33:52 -0700, Jakub Kicinski wrote:
>> Minor glitch with Ubuntu 18.04:
>>
>> $ gcc --version
>> gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
>>
>> In file included from /usr/include/fcntl.h:290:0,
>> from ../net/bpfilter/main.c:7:
>> In function ‘open’,
>> inlined from ‘main’ at ../net/bpfilter/main.c:58:13:
>> /usr/include/x86_64-linux-gnu/bits/fcntl2.h:50:4: error: call to ‘__open_missing_mode’ declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments
>> __open_missing_mode ();
>> ^~~~~~~~~~~~~~~~~~~~~~
>> scripts/Makefile.host:107: recipe for target 'net/bpfilter/main.o' failed
>> make[3]: *** [net/bpfilter/main.o] Error 1
>>
>> I can't repro on Fedora 27 gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5),
>> perhaps the GCC is broken on that Ubuntu 18.04 box of mine. The warning/
>> /error, however, looks potentially legit?
>
> More?
>
> Kernel: arch/x86/boot/bzImage is ready (#9)
> Building modules, stage 2.
> MODPOST 1620 modules
> ERROR: "bpfilter_process_sockopt" [net/bpfilter/bpfilter.ko] undefined!
> ../scripts/Makefile.modpost:92: recipe for target '__modpost' failed
hmm. how come buildbot didn't yell at me for any of these things.
will take a look soon.
^ permalink raw reply
* Re: [PATCH v3 net-next 0/2] bpfilter
From: Jakub Kicinski @ 2018-05-24 1:50 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S . Miller, daniel, torvalds, gregkh, luto, mcgrof,
keescook, netdev, linux-kernel, kernel-team
In-Reply-To: <20180523183352.7ccc3f5d@cakuba>
[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]
On Wed, 23 May 2018 18:33:52 -0700, Jakub Kicinski wrote:
> Minor glitch with Ubuntu 18.04:
>
> $ gcc --version
> gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
>
> In file included from /usr/include/fcntl.h:290:0,
> from ../net/bpfilter/main.c:7:
> In function ‘open’,
> inlined from ‘main’ at ../net/bpfilter/main.c:58:13:
> /usr/include/x86_64-linux-gnu/bits/fcntl2.h:50:4: error: call to ‘__open_missing_mode’ declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments
> __open_missing_mode ();
> ^~~~~~~~~~~~~~~~~~~~~~
> scripts/Makefile.host:107: recipe for target 'net/bpfilter/main.o' failed
> make[3]: *** [net/bpfilter/main.o] Error 1
>
> I can't repro on Fedora 27 gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5),
> perhaps the GCC is broken on that Ubuntu 18.04 box of mine. The warning/
> /error, however, looks potentially legit?
More?
Kernel: arch/x86/boot/bzImage is ready (#9)
Building modules, stage 2.
MODPOST 1620 modules
ERROR: "bpfilter_process_sockopt" [net/bpfilter/bpfilter.ko] undefined!
../scripts/Makefile.modpost:92: recipe for target '__modpost' failed
make[2]: *** [__modpost] Error 1
/home/jkicinski/devel/linux/Makefile:1274: recipe for target 'modules' failed
make[1]: *** [modules] Error 2
make[1]: Leaving directory '/home/jkicinski/devel/linux/build_randconfig'
Makefile:146: recipe for target 'sub-make' failed
make: *** [sub-make] Error 2
[-- Attachment #2: bpfitlter_config.bz2 --]
[-- Type: application/x-bzip, Size: 30187 bytes --]
^ permalink raw reply
* Re: [PATCH v3 net-next 0/2] bpfilter
From: Jakub Kicinski @ 2018-05-24 1:33 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S . Miller, daniel, torvalds, gregkh, luto, mcgrof,
keescook, netdev, linux-kernel, kernel-team
In-Reply-To: <20180522022230.2492505-1-ast@kernel.org>
Minor glitch with Ubuntu 18.04:
$ gcc --version
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
In file included from /usr/include/fcntl.h:290:0,
from ../net/bpfilter/main.c:7:
In function ‘open’,
inlined from ‘main’ at ../net/bpfilter/main.c:58:13:
/usr/include/x86_64-linux-gnu/bits/fcntl2.h:50:4: error: call to ‘__open_missing_mode’ declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments
__open_missing_mode ();
^~~~~~~~~~~~~~~~~~~~~~
scripts/Makefile.host:107: recipe for target 'net/bpfilter/main.o' failed
make[3]: *** [net/bpfilter/main.o] Error 1
I can't repro on Fedora 27 gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5),
perhaps the GCC is broken on that Ubuntu 18.04 box of mine. The warning/
/error, however, looks potentially legit?
^ permalink raw reply
* [PATCH net v2] enic: set DMA mask to 47 bit
From: Govindarajulu Varadarajan @ 2018-05-23 18:17 UTC (permalink / raw)
To: davem, netdev; +Cc: benve, Govindarajulu Varadarajan
In commit 624dbf55a359b ("driver/net: enic: Try DMA 64 first, then
failover to DMA") DMA mask was changed from 40 bits to 64 bits.
Hardware actually supports only 47 bits.
Fixes: 624dbf55a359b ("driver/net: enic: Try DMA 64 first, then failover to DMA")
Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
---
v2:
* rebase to net
* Fix space and single line for "Fixes:"
drivers/net/ethernet/cisco/enic/enic_main.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 81684acf52af..8a8b12b720ef 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -2747,11 +2747,11 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
pci_set_master(pdev);
/* Query PCI controller on system for DMA addressing
- * limitation for the device. Try 64-bit first, and
+ * limitation for the device. Try 47-bit first, and
* fail to 32-bit.
*/
- err = pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
+ err = pci_set_dma_mask(pdev, DMA_BIT_MASK(47));
if (err) {
err = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
if (err) {
@@ -2765,10 +2765,10 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto err_out_release_regions;
}
} else {
- err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
+ err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(47));
if (err) {
dev_err(dev, "Unable to obtain %u-bit DMA "
- "for consistent allocations, aborting\n", 64);
+ "for consistent allocations, aborting\n", 47);
goto err_out_release_regions;
}
using_dac = 1;
--
2.17.0
^ permalink raw reply related
* Re: [PATCH RFC net-next 00/11] udp gso
From: Willem de Bruijn @ 2018-05-24 1:15 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Paolo Abeni, Network Development, Willem de Bruijn
In-Reply-To: <20180524000230.GP5488@localhost.localdomain>
On Wed, May 23, 2018 at 8:02 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Wed, Apr 18, 2018 at 09:49:18AM -0400, Willem de Bruijn wrote:
>> I just hacked up a sendmmsg extension to the benchmark to verify.
>> Indeed that does not have nearly the same benefit as GSO:
>>
>> udp tx: 976 MB/s 695394 calls/s 16557 msg/s
>>
>> This matches the numbers seen from TCP without TSO and GSO.
>> That also has few system calls, but observes per MTU stack traversal.
>
> Reviving this old thread because it's the only place I saw sendmmsg
> being mentioned.
>
> sendmmsg shouldn't be considered as an alternative, but rather as a
> complement. Then instead of the application building one large request
> and request the stack to fragment it, it could simply build the
> sendmmsg request and the stack would group the mmsg into a gso skb. It
> seems more natural to the application. But well, both (sendmmsg and
> the option to fragment) are Linux-specific..
>
> For that we need sendmmsg to do something smarter than doing several
> sendmsg calls, yes.
I agree. See also my original point:
"An alternative implementation that would allow non-uniform
segment length is to use GSO_BY_FRAGS like SCTP. This would
likely require MSG_MORE to build the list using multiple
send calls (or one sendmmsg). The two approaches are not
mutually-exclusive, so that could be a follow-up."
Clear advantages of GSO_BY_FRAGS are that segments do
not have to be of equal length and that converting existing users
of sendmmsg is trivial.
On the other hand, this is less likely to be offloaded to hardware,
as it requires non-constant metadata in the descriptor.
Both cases also potentially apply to the GRO path to allow for
efficient forwarding. And to the udp socket rx layer to allow for
queuing batches of datagrams at a time, then carving off one
gso_size per recvmsg.
^ permalink raw reply
* [PATCH net-next] hv_netvsc: fix bogus ifalias on network device
From: Stephen Hemminger @ 2018-05-24 1:02 UTC (permalink / raw)
To: netdev; +Cc: Stephen Hemminger
If the guest network adapter is not configured with DeviceNaming
enabled on the host, then the query for friendly name will return
success but with a zero length name. Which then leads to a garbage value
(stack contents) for ifalias.
Fix is simple, just don't set name if host doesn't return it.
Fixes: 0fe554a46a0f ("hv_netvsc: propogate Hyper-V friendly name into interface alias")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
drivers/net/hyperv/rndis_filter.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 7f3dab4b4cbc..5428bb261102 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1237,7 +1237,10 @@ static void rndis_get_friendly_name(struct net_device *net,
if (rndis_filter_query_device(rndis_device, net_device,
RNDIS_OID_GEN_FRIENDLY_NAME,
wname, &size) != 0)
- return;
+ return; /* ignore if host does not support */
+
+ if (size == 0)
+ return; /* name not set */
/* Convert Windows Unicode string to UTF-8 */
len = ucs2_as_utf8(ifalias, wname, sizeof(ifalias));
--
2.17.0
^ permalink raw reply related
* [PATCH bpf] bpf: properly enforce index mask to prevent out-of-bounds speculation
From: Daniel Borkmann @ 2018-05-24 0:32 UTC (permalink / raw)
To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
While reviewing the verifier code, I recently noticed that the
following two program variants in relation to tail calls can be
loaded.
Variant 1:
# bpftool p d x i 15
0: (15) if r1 == 0x0 goto pc+3
1: (18) r2 = map[id:5]
3: (05) goto pc+2
4: (18) r2 = map[id:6]
6: (b7) r3 = 7
7: (35) if r3 >= 0xa0 goto pc+2
8: (54) (u32) r3 &= (u32) 255
9: (85) call bpf_tail_call#12
10: (b7) r0 = 1
11: (95) exit
# bpftool m s i 5
5: prog_array flags 0x0
key 4B value 4B max_entries 4 memlock 4096B
# bpftool m s i 6
6: prog_array flags 0x0
key 4B value 4B max_entries 160 memlock 4096B
Variant 2:
# bpftool p d x i 20
0: (15) if r1 == 0x0 goto pc+3
1: (18) r2 = map[id:8]
3: (05) goto pc+2
4: (18) r2 = map[id:7]
6: (b7) r3 = 7
7: (35) if r3 >= 0x4 goto pc+2
8: (54) (u32) r3 &= (u32) 3
9: (85) call bpf_tail_call#12
10: (b7) r0 = 1
11: (95) exit
# bpftool m s i 8
8: prog_array flags 0x0
key 4B value 4B max_entries 160 memlock 4096B
# bpftool m s i 7
7: prog_array flags 0x0
key 4B value 4B max_entries 4 memlock 4096B
In both cases the index masking inserted by the verifier in order
to control out of bounds speculation from a CPU via b2157399cc98
("bpf: prevent out-of-bounds speculation") seems to be incorrect
in what it is enforcing. In the 1st variant, the mask is applied
from the map with the significantly larger number of entries where
we would allow to a certain degree out of bounds speculation for
the smaller map, and in the 2nd variant where the mask is applied
from the map with the smaller number of entries, we get buggy
behavior since we truncate the index of the larger map.
The original intent from commit b2157399cc98 is to reject such
occasions where two or more different tail call maps are used
in the same tail call helper invocation. However, the check on
the BPF_MAP_PTR_POISON is never hit since we never poisoned the
saved pointer in the first place! We do this explicitly for map
lookups but in case of tail calls we basically used the tail
call map in insn_aux_data that was processed in the most recent
path which the verifier walked. Thus any prior path that stored
a pointer in insn_aux_data at the helper location was always
overridden.
Fix it by moving the map pointer poison logic into a small helper
that covers both BPF helpers with the same logic. After that in
fixup_bpf_calls() the poison check is then hit for tail calls
and the program rejected. Latter only happens in unprivileged
case since this is the *only* occasion where a rewrite needs to
happen, and where such rewrite is specific to the map (max_entries,
index_mask). In the privileged case the rewrite is generic for
the insn->imm / insn->code update so multiple maps from different
paths can be handled just fine since all the remaining logic
happens in the instruction processing itself. This is similar
to the case of map lookups: in case there is a collision of
maps in fixup_bpf_calls() we must skip the inlined rewrite since
this will turn the generic instruction sequence into a non-
generic one. Thus the patch_call_imm will simply update the
insn->imm location where the bpf_map_lookup_elem() will later
take care of the dispatch. Given we need this 'poison' state
as a check, the information of whether a map is an unpriv_array
gets lost, so enforcing it prior to that needs an additional
state. In general this check is needed since there are some
complex and tail call intensive BPF programs out there where
LLVM tends to generate such code occasionally. We therefore
convert the map_ptr rather into map_state to store all this
w/o extra memory overhead, and the bit whether one of the maps
involved in the collision was from an unpriv_array thus needs
to be retained as well there.
Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
[ Test cases coming via bpf-next batch of patches to avoid
merge conflicts in test_verifier with bpf. ]
include/linux/bpf_verifier.h | 2 +-
kernel/bpf/verifier.c | 86 ++++++++++++++++++++++++++++++++------------
2 files changed, 65 insertions(+), 23 deletions(-)
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7e61c39..52fb077 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -142,7 +142,7 @@ struct bpf_verifier_state_list {
struct bpf_insn_aux_data {
union {
enum bpf_reg_type ptr_type; /* pointer type for load/store insns */
- struct bpf_map *map_ptr; /* pointer for call insn into lookup_elem */
+ unsigned long map_state; /* pointer/poison value for maps */
s32 call_imm; /* saved imm field of call insn */
};
int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb..dcebf3f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -156,7 +156,29 @@ struct bpf_verifier_stack_elem {
#define BPF_COMPLEXITY_LIMIT_INSNS 131072
#define BPF_COMPLEXITY_LIMIT_STACK 1024
-#define BPF_MAP_PTR_POISON ((void *)0xeB9F + POISON_POINTER_DELTA)
+#define BPF_MAP_PTR_UNPRIV 1UL
+#define BPF_MAP_PTR_POISON ((void *)((0xeB9FUL << 1) + \
+ POISON_POINTER_DELTA))
+#define BPF_MAP_PTR(X) ((struct bpf_map *)((X) & ~BPF_MAP_PTR_UNPRIV))
+
+static bool bpf_map_ptr_poisoned(const struct bpf_insn_aux_data *aux)
+{
+ return BPF_MAP_PTR(aux->map_state) == BPF_MAP_PTR_POISON;
+}
+
+static bool bpf_map_ptr_unpriv(const struct bpf_insn_aux_data *aux)
+{
+ return aux->map_state & BPF_MAP_PTR_UNPRIV;
+}
+
+static void bpf_map_ptr_store(struct bpf_insn_aux_data *aux,
+ const struct bpf_map *map, bool unpriv)
+{
+ BUILD_BUG_ON((unsigned long)BPF_MAP_PTR_POISON & BPF_MAP_PTR_UNPRIV);
+ unpriv |= bpf_map_ptr_unpriv(aux);
+ aux->map_state = (unsigned long)map |
+ (unpriv ? BPF_MAP_PTR_UNPRIV : 0UL);
+}
struct bpf_call_arg_meta {
struct bpf_map *map_ptr;
@@ -2333,6 +2355,29 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
return 0;
}
+static int
+record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
+ int func_id, int insn_idx)
+{
+ struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx];
+
+ if (func_id != BPF_FUNC_tail_call &&
+ func_id != BPF_FUNC_map_lookup_elem)
+ return 0;
+ if (meta->map_ptr == NULL) {
+ verbose(env, "kernel subsystem misconfigured verifier\n");
+ return -EINVAL;
+ }
+
+ if (!BPF_MAP_PTR(aux->map_state))
+ bpf_map_ptr_store(aux, meta->map_ptr,
+ meta->map_ptr->unpriv_array);
+ else if (BPF_MAP_PTR(aux->map_state) != meta->map_ptr)
+ bpf_map_ptr_store(aux, BPF_MAP_PTR_POISON,
+ meta->map_ptr->unpriv_array);
+ return 0;
+}
+
static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
{
const struct bpf_func_proto *fn = NULL;
@@ -2387,13 +2432,6 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
err = check_func_arg(env, BPF_REG_2, fn->arg2_type, &meta);
if (err)
return err;
- if (func_id == BPF_FUNC_tail_call) {
- if (meta.map_ptr == NULL) {
- verbose(env, "verifier bug\n");
- return -EINVAL;
- }
- env->insn_aux_data[insn_idx].map_ptr = meta.map_ptr;
- }
err = check_func_arg(env, BPF_REG_3, fn->arg3_type, &meta);
if (err)
return err;
@@ -2404,6 +2442,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
if (err)
return err;
+ err = record_func_map(env, &meta, func_id, insn_idx);
+ if (err)
+ return err;
+
/* Mark slots with STACK_MISC in case of raw mode, stack offset
* is inferred from register state.
*/
@@ -2428,8 +2470,6 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
} else if (fn->ret_type == RET_VOID) {
regs[BPF_REG_0].type = NOT_INIT;
} else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL) {
- struct bpf_insn_aux_data *insn_aux;
-
regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
/* There is no offset yet applied, variable or fixed */
mark_reg_known_zero(env, regs, BPF_REG_0);
@@ -2445,11 +2485,6 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
}
regs[BPF_REG_0].map_ptr = meta.map_ptr;
regs[BPF_REG_0].id = ++env->id_gen;
- insn_aux = &env->insn_aux_data[insn_idx];
- if (!insn_aux->map_ptr)
- insn_aux->map_ptr = meta.map_ptr;
- else if (insn_aux->map_ptr != meta.map_ptr)
- insn_aux->map_ptr = BPF_MAP_PTR_POISON;
} else {
verbose(env, "unknown return type %d of func %s#%d\n",
fn->ret_type, func_id_name(func_id), func_id);
@@ -5417,6 +5452,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
struct bpf_insn *insn = prog->insnsi;
const struct bpf_func_proto *fn;
const int insn_cnt = prog->len;
+ struct bpf_insn_aux_data *aux;
struct bpf_insn insn_buf[16];
struct bpf_prog *new_prog;
struct bpf_map *map_ptr;
@@ -5491,19 +5527,22 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
insn->imm = 0;
insn->code = BPF_JMP | BPF_TAIL_CALL;
+ aux = &env->insn_aux_data[i + delta];
+ if (!bpf_map_ptr_unpriv(aux))
+ continue;
+
/* instead of changing every JIT dealing with tail_call
* emit two extra insns:
* if (index >= max_entries) goto out;
* index &= array->index_mask;
* to avoid out-of-bounds cpu speculation
*/
- map_ptr = env->insn_aux_data[i + delta].map_ptr;
- if (map_ptr == BPF_MAP_PTR_POISON) {
+ if (bpf_map_ptr_poisoned(aux)) {
verbose(env, "tail_call abusing map_ptr\n");
return -EINVAL;
}
- if (!map_ptr->unpriv_array)
- continue;
+
+ map_ptr = BPF_MAP_PTR(aux->map_state);
insn_buf[0] = BPF_JMP_IMM(BPF_JGE, BPF_REG_3,
map_ptr->max_entries, 2);
insn_buf[1] = BPF_ALU32_IMM(BPF_AND, BPF_REG_3,
@@ -5527,9 +5566,12 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
*/
if (prog->jit_requested && BITS_PER_LONG == 64 &&
insn->imm == BPF_FUNC_map_lookup_elem) {
- map_ptr = env->insn_aux_data[i + delta].map_ptr;
- if (map_ptr == BPF_MAP_PTR_POISON ||
- !map_ptr->ops->map_gen_lookup)
+ aux = &env->insn_aux_data[i + delta];
+ if (bpf_map_ptr_poisoned(aux))
+ goto patch_call_imm;
+
+ map_ptr = BPF_MAP_PTR(aux->map_state);
+ if (!map_ptr->ops->map_gen_lookup)
goto patch_call_imm;
cnt = map_ptr->ops->map_gen_lookup(map_ptr, insn_buf);
--
2.9.5
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox