* Re: [PATCH v6 0/5] PCI: Improve PCIe link status reporting
From: Bjorn Helgaas @ 2018-05-23 21:46 UTC (permalink / raw)
To: Jeff Kirsher, Ganesh Goudar, Michael Chan, Ariel Elior,
David S. Miller
Cc: linux-pci, everest-linux-l2, intel-wired-lan, netdev,
linux-kernel, Tal Gilboa, Tariq Toukan, Jacob Keller,
Jakub Kicinski
In-Reply-To: <152537719056.62474.2571390812509425478.stgit@bhelgaas-glaptop.roam.corp.google.com>
[+to Davem]
On Thu, May 03, 2018 at 03:00:07PM -0500, Bjorn Helgaas wrote:
> This is based on Tal's recent work to unify the approach for reporting PCIe
> link speed/width and whether the device is being limited by a slower
> upstream link.
>
> The new pcie_print_link_status() interface appeared in v4.17-rc1; see
> 9e506a7b5147 ("PCI: Add pcie_print_link_status() to log link speed and
> whether it's limited").
>
> That's a good way to replace use of pcie_get_minimum_link(), which gives
> misleading results when a path contains both a fast, narrow link and a
> slow, wide link: it reports the equivalent of a slow, narrow link.
>
> This series removes the remaining uses of pcie_get_minimum_link() and then
> removes the interface itself. I'd like to merge them all through the PCI
> tree to make the removal easy.
>
> This does change the dmesg reporting of link speeds, and in the ixgbe case,
> it changes the reporting from KERN_WARN level to KERN_INFO. If that's an
> issue, let's talk about it. I'm hoping the reduce code size, improved
> functionality, and consistency across drivers is enough to make this
> worthwhile.
>
> ---
>
> Bjorn Helgaas (5):
> bnx2x: Report PCIe link properties with pcie_print_link_status()
> bnxt_en: Report PCIe link properties with pcie_print_link_status()
> cxgb4: Report PCIe link properties with pcie_print_link_status()
> ixgbe: Report PCIe link properties with pcie_print_link_status()
> PCI: Remove unused pcie_get_minimum_link()
>
>
> drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 23 ++-----
> drivers/net/ethernet/broadcom/bnxt/bnxt.c | 19 ------
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 75 ----------------------
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 47 --------------
> drivers/pci/pci.c | 43 -------------
> include/linux/pci.h | 2 -
> 6 files changed, 9 insertions(+), 200 deletions(-)
I applied all of these on pci/enumeration for v4.18. If you'd rather take
them, Dave, let me know and I'll drop them.
I solicited more acks, but only heard from Jeff.
^ permalink raw reply
* [PATCH v2] ppp: remove the PPPIOCDETACH ioctl
From: Eric Biggers @ 2018-05-23 21:37 UTC (permalink / raw)
To: linux-ppp, Paul Mackerras
Cc: netdev, linux-fsdevel, linux-kernel, Guillaume Nault,
syzkaller-bugs, Eric Biggers
In-Reply-To: <20180523035952.25768-1-ebiggers3@gmail.com>
From: Eric Biggers <ebiggers@google.com>
The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file
before f_count has reached 0, which is fundamentally a bad idea. It
does check 'f_count < 2', which excludes concurrent operations on the
file since they would only be possible with a shared fd table, in which
case each fdget() would take a file reference. However, it fails to
account for the fact that even with 'f_count == 1' the file can still be
linked into epoll instances. As reported by syzbot, this can trivially
be used to cause a use-after-free.
Yet, the only known user of PPPIOCDETACH is pppd versions older than
ppp-2.4.2, which was released almost 15 years ago (November 2003).
Also, PPPIOCDETACH apparently stopped working reliably at around the
same time, when the f_count check was added to the kernel, e.g. see
https://lkml.org/lkml/2002/12/31/83. Also, the current 'f_count < 2'
check makes PPPIOCDETACH only work in single-threaded applications; it
always fails if called from a multithreaded application.
All pppd versions released in the last 15 years just close() the file
descriptor instead.
Therefore, instead of hacking around this bug by exporting epoll
internals to modules, and probably missing other related bugs, just
remove the PPPIOCDETACH ioctl and see if anyone actually notices. Leave
a stub in place that prints a one-time warning and returns EINVAL.
Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
v2: leave a stub in place, rather than removing the ioctl completely.
Documentation/networking/ppp_generic.txt | 6 ------
drivers/net/ppp/ppp_generic.c | 27 +++++-------------------
include/uapi/linux/ppp-ioctl.h | 2 +-
3 files changed, 6 insertions(+), 29 deletions(-)
diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt
index 091d20273dcb..61daf4b39600 100644
--- a/Documentation/networking/ppp_generic.txt
+++ b/Documentation/networking/ppp_generic.txt
@@ -300,12 +300,6 @@ unattached instance are:
The ioctl calls available on an instance of /dev/ppp attached to a
channel are:
-* PPPIOCDETACH detaches the instance from the channel. This ioctl is
- deprecated since the same effect can be achieved by closing the
- instance. In order to prevent possible races this ioctl will fail
- with an EINVAL error if more than one file descriptor refers to this
- instance (i.e. as a result of dup(), dup2() or fork()).
-
* PPPIOCCONNECT connects this channel to a PPP interface. The
argument should point to an int containing the interface unit
number. It will return an EINVAL error if the channel is already
diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index dc7c7ec43202..02ad03a2fab7 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -605,30 +605,13 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
if (cmd == PPPIOCDETACH) {
/*
- * We have to be careful here... if the file descriptor
- * has been dup'd, we could have another process in the
- * middle of a poll using the same file *, so we had
- * better not free the interface data structures -
- * instead we fail the ioctl. Even in this case, we
- * shut down the interface if we are the owner of it.
- * Actually, we should get rid of PPPIOCDETACH, userland
- * (i.e. pppd) could achieve the same effect by closing
- * this fd and reopening /dev/ppp.
+ * PPPIOCDETACH is no longer supported as it was heavily broken,
+ * and is only known to have been used by pppd older than
+ * ppp-2.4.2 (released November 2003).
*/
+ pr_warn_once("%s (%d) used obsolete PPPIOCDETACH ioctl\n",
+ current->comm, current->pid);
err = -EINVAL;
- if (pf->kind == INTERFACE) {
- ppp = PF_TO_PPP(pf);
- rtnl_lock();
- if (file == ppp->owner)
- unregister_netdevice(ppp->dev);
- rtnl_unlock();
- }
- if (atomic_long_read(&file->f_count) < 2) {
- ppp_release(NULL, file);
- err = 0;
- } else
- pr_warn("PPPIOCDETACH file->f_count=%ld\n",
- atomic_long_read(&file->f_count));
goto out;
}
diff --git a/include/uapi/linux/ppp-ioctl.h b/include/uapi/linux/ppp-ioctl.h
index b19a9c249b15..784c2e3e572e 100644
--- a/include/uapi/linux/ppp-ioctl.h
+++ b/include/uapi/linux/ppp-ioctl.h
@@ -106,7 +106,7 @@ struct pppol2tp_ioc_stats {
#define PPPIOCGIDLE _IOR('t', 63, struct ppp_idle) /* get idle time */
#define PPPIOCNEWUNIT _IOWR('t', 62, int) /* create new ppp unit */
#define PPPIOCATTACH _IOW('t', 61, int) /* attach to ppp unit */
-#define PPPIOCDETACH _IOW('t', 60, int) /* detach from ppp unit/chan */
+#define PPPIOCDETACH _IOW('t', 60, int) /* obsolete, do not use */
#define PPPIOCSMRRU _IOW('t', 59, int) /* set multilink MRU */
#define PPPIOCCONNECT _IOW('t', 58, int) /* connect channel to unit */
#define PPPIOCDISCONN _IO('t', 57) /* disconnect channel */
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* Re: [PATCH bpf-next v3 10/10] tools: bpftool: add delimiters to multi-function JITed dumps
From: Jakub Kicinski @ 2018-05-23 21:32 UTC (permalink / raw)
To: Sandipan Das
Cc: Daniel Borkmann, ast, netdev, linuxppc-dev, mpe, naveen.n.rao,
Quentin Monnet
In-Reply-To: <7142939b-515e-50ac-bc0b-50444bf9cc97@linux.vnet.ibm.com>
On Wed, 23 May 2018 16:07:40 +0530, Sandipan Das wrote:
> "name": "bpf_prog_196af774a3477707_F",
> "insns": [{
> "pc": "0x0",
> "operation": "nop",
> "operands": [null
> ]
> },{
> "pc": "0x4",
> "operation": "nop",
> "operands": [null
> ]
> },{
> "pc": "0x8",
> "operation": "mflr",
> "operands": ["r0"
> ]
> },{
> ...
> }
> ]
> }
> ]
>
> If this is okay, I can send out the next revision with these changes.
Looks good, thank you!
^ permalink raw reply
* Re: [PATCH bpf-next v3 2/7] bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
From: Yonghong Song @ 2018-05-23 21:27 UTC (permalink / raw)
To: Alexei Starovoitov, Martin KaFai Lau
Cc: peterz, ast, daniel, netdev, kernel-team
In-Reply-To: <20180523210359.dz2zwqcb76hebrtn@ast-mbp>
On 5/23/18 2:04 PM, Alexei Starovoitov wrote:
> On Wed, May 23, 2018 at 10:13:22AM -0700, Martin KaFai Lau wrote:
>>> + __u32 prog_id; /* output: prod_id */
>>> + __u32 attach_info; /* output: BPF_ATTACH_* */
>>> + __u64 probe_offset; /* output: probe_offset */
>>> + __u64 probe_addr; /* output: probe_addr */
>>> + } task_fd_query;
>>> } __attribute__((aligned(8)));
>>>
>>> /* The description below is an attempt at providing documentation to eBPF
>>> @@ -2458,4 +2475,14 @@ struct bpf_fib_lookup {
>>> __u8 dmac[6]; /* ETH_ALEN */
>>> };
>>>
>>> +/* used by <task, fd> based query */
>>> +enum {
>> Nit. Instead of a comment, is it better to give this
>> enum a descriptive name?
>>
>>> + BPF_ATTACH_RAW_TRACEPOINT, /* tp name */
>>> + BPF_ATTACH_TRACEPOINT, /* tp name */
>>> + BPF_ATTACH_KPROBE, /* (symbol + offset) or addr */
>>> + BPF_ATTACH_KRETPROBE, /* (symbol + offset) or addr */
>>> + BPF_ATTACH_UPROBE, /* filename + offset */
>>> + BPF_ATTACH_URETPROBE, /* filename + offset */
>>> +};
>
> One more nit here.
> Can we come up with better names for the above?
> 'attach' is a verb. I cannot help but read above as it's an action
> for the kernel to attach to something and not the type of event
> where the program was attached to.
> Since we pass task+fd into that BPF_TASK_FD_QUERY command how
> about returning BPF_FD_TYPE_KPROBE, BPF_FD_TYPE_TRACEPOINT, ... ?
Okay will use BPF_FD_TYPE_*... which is indeed better than
BPF_ATTACH_*.
^ permalink raw reply
* Re: [PATCH bpf-next v3 2/7] bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
From: Yonghong Song @ 2018-05-23 21:25 UTC (permalink / raw)
To: Martin KaFai Lau; +Cc: peterz, ast, daniel, netdev, kernel-team
In-Reply-To: <20180523171320.ziswsvpsyelxb6fz@kafai-mbp>
On 5/23/18 10:13 AM, Martin KaFai Lau wrote:
> On Tue, May 22, 2018 at 09:30:46AM -0700, Yonghong Song wrote:
>> Currently, suppose a userspace application has loaded a bpf program
>> and attached it to a tracepoint/kprobe/uprobe, and a bpf
>> introspection tool, e.g., bpftool, wants to show which bpf program
>> is attached to which tracepoint/kprobe/uprobe. Such attachment
>> information will be really useful to understand the overall bpf
>> deployment in the system.
>>
>> There is a name field (16 bytes) for each program, which could
>> be used to encode the attachment point. There are some drawbacks
>> for this approaches. First, bpftool user (e.g., an admin) may not
>> really understand the association between the name and the
>> attachment point. Second, if one program is attached to multiple
>> places, encoding a proper name which can imply all these
>> attachments becomes difficult.
>>
>> This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
>> Given a pid and fd, if the <pid, fd> is associated with a
>> tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
>> . prog_id
>> . tracepoint name, or
>> . k[ret]probe funcname + offset or kernel addr, or
>> . u[ret]probe filename + offset
>> to the userspace.
>> The user can use "bpftool prog" to find more information about
>> bpf program itself with prog_id.
> LGTM, some comments inline.
>
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> include/linux/trace_events.h | 16 ++++++
>> include/uapi/linux/bpf.h | 27 ++++++++++
>> kernel/bpf/syscall.c | 124 +++++++++++++++++++++++++++++++++++++++++++
>> kernel/trace/bpf_trace.c | 48 +++++++++++++++++
>> kernel/trace/trace_kprobe.c | 29 ++++++++++
>> kernel/trace/trace_uprobe.c | 22 ++++++++
>> 6 files changed, 266 insertions(+)
>>
>> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
>> index 2bde3ef..eab806d 100644
>> --- a/include/linux/trace_events.h
>> +++ b/include/linux/trace_events.h
>> @@ -473,6 +473,9 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info);
>> int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
>> int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
>> struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
>> +int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
>> + u32 *attach_info, const char **buf,
>> + u64 *probe_offset, u64 *probe_addr);
> The first arg is 'const struct perf_event *event' while...
>
>> #else
>> static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
>> {
>> @@ -504,6 +507,12 @@ static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name
>> {
>> return NULL;
>> }
>> +static inline int bpf_get_perf_event_info(const struct file *file, u32 *prog_id,
> this one has 'const struct file *file'?
Thanks for catching this. Will correct this in the next revision.
>
>> + u32 *attach_info, const char **buf,
>> + u64 *probe_offset, u64 *probe_addr)
>> +{
>> + return -EOPNOTSUPP;
>> +}
>> #endif
>>
>> enum {
>> @@ -560,10 +569,17 @@ extern void perf_trace_del(struct perf_event *event, int flags);
>> #ifdef CONFIG_KPROBE_EVENTS
>> extern int perf_kprobe_init(struct perf_event *event, bool is_retprobe);
>> extern void perf_kprobe_destroy(struct perf_event *event);
>> +extern int bpf_get_kprobe_info(const struct perf_event *event,
>> + u32 *attach_info, const char **symbol,
>> + u64 *probe_offset, u64 *probe_addr,
>> + bool perf_type_tracepoint);
>> #endif
>> #ifdef CONFIG_UPROBE_EVENTS
>> extern int perf_uprobe_init(struct perf_event *event, bool is_retprobe);
>> extern void perf_uprobe_destroy(struct perf_event *event);
>> +extern int bpf_get_uprobe_info(const struct perf_event *event,
>> + u32 *attach_info, const char **filename,
>> + u64 *probe_offset, bool perf_type_tracepoint);
>> #endif
>> extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
>> char *filter_str);
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 97446bb..a602150 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -97,6 +97,7 @@ enum bpf_cmd {
>> BPF_RAW_TRACEPOINT_OPEN,
>> BPF_BTF_LOAD,
>> BPF_BTF_GET_FD_BY_ID,
>> + BPF_TASK_FD_QUERY,
>> };
>>
>> enum bpf_map_type {
>> @@ -379,6 +380,22 @@ union bpf_attr {
>> __u32 btf_log_size;
>> __u32 btf_log_level;
>> };
>> +
>> + struct {
>> + int pid; /* input: pid */
>> + int fd; /* input: fd */
> Should fd and pid be always positive?
> The current fd (like map_fd) in bpf_attr is using __u32.
Will change both pid and fd to __u32. In kernel fd is unsigned, but
pid_t is actually an int. The negative pid is often referred to
the process group the pid is in. Since here, we are only concerned
with actual process pid, make __u32 should be okay.
>
>> + __u32 flags; /* input: flags */
>> + __u32 buf_len; /* input: buf len */
>> + __aligned_u64 buf; /* input/output:
>> + * tp_name for tracepoint
>> + * symbol for kprobe
>> + * filename for uprobe
>> + */
>> + __u32 prog_id; /* output: prod_id */
>> + __u32 attach_info; /* output: BPF_ATTACH_* */
>> + __u64 probe_offset; /* output: probe_offset */
>> + __u64 probe_addr; /* output: probe_addr */
>> + } task_fd_query;
>> } __attribute__((aligned(8)));
>>
>> /* The description below is an attempt at providing documentation to eBPF
>> @@ -2458,4 +2475,14 @@ struct bpf_fib_lookup {
>> __u8 dmac[6]; /* ETH_ALEN */
>> };
>>
>> +/* used by <task, fd> based query */
>> +enum {
> Nit. Instead of a comment, is it better to give this
> enum a descriptive name?
Yes, will add an enum name bpf_task_fd_info to make it easy for
correlation with task_fd_query.
>
>> + BPF_ATTACH_RAW_TRACEPOINT, /* tp name */
>> + BPF_ATTACH_TRACEPOINT, /* tp name */
>> + BPF_ATTACH_KPROBE, /* (symbol + offset) or addr */
>> + BPF_ATTACH_KRETPROBE, /* (symbol + offset) or addr */
>> + BPF_ATTACH_UPROBE, /* filename + offset */
>> + BPF_ATTACH_URETPROBE, /* filename + offset */
>> +};
>> +
>> #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index bfcde94..9356f0e 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -18,7 +18,9 @@
>> #include <linux/vmalloc.h>
>> #include <linux/mmzone.h>
>> #include <linux/anon_inodes.h>
>> +#include <linux/fdtable.h>
>> #include <linux/file.h>
>> +#include <linux/fs.h>
>> #include <linux/license.h>
>> #include <linux/filter.h>
>> #include <linux/version.h>
>> @@ -2102,6 +2104,125 @@ static int bpf_btf_get_fd_by_id(const union bpf_attr *attr)
>> return btf_get_fd_by_id(attr->btf_id);
>> }
>>
>> +static int bpf_task_fd_query_copy(const union bpf_attr *attr,
>> + union bpf_attr __user *uattr,
>> + u32 prog_id, u32 attach_info,
>> + const char *buf, u64 probe_offset,
>> + u64 probe_addr)
>> +{
>> + __u64 __user *ubuf;
> Nit. ubuf is a string instead of an array of __u64?
will change it to void *.
>
>> + int len;
>> +
>> + ubuf = u64_to_user_ptr(attr->task_fd_query.buf);
>> + if (buf) {
>> + len = strlen(buf);
>> + if (attr->task_fd_query.buf_len < len + 1)
> I think the current convention is to take the min,
> copy whatever it can to buf and return the real
> len/size in buf_len. F.e., the prog_ids and prog_cnt in
> __cgroup_bpf_query().
>
> Should the same be done here or it does not make sense to
> truncate the string? The user may/may not need the tailing
> char though if its pretty print has limited width anyway.
> The user still needs to know what the buf_len should be to
> retry also but I guess any reasonable buf_len should
> work?
Make sense, will make buf_len input/output and copy
the actually needed length to buf_len and back to user.
>
>> + return -ENOSPC;
>> + if (copy_to_user(ubuf, buf, len + 1))
>> + return -EFAULT;
>> + } else if (attr->task_fd_query.buf_len) {
>> + /* copy '\0' to ubuf */
>> + __u8 zero = 0;
>> +
>> + if (copy_to_user(ubuf, &zero, 1))
>> + return -EFAULT;
>> + }
>> +
>> + if (copy_to_user(&uattr->task_fd_query.prog_id, &prog_id,
>> + sizeof(prog_id)) ||
>> + copy_to_user(&uattr->task_fd_query.attach_info, &attach_info,
>> + sizeof(attach_info)) ||
>> + copy_to_user(&uattr->task_fd_query.probe_offset, &probe_offset,
>> + sizeof(probe_offset)) ||
>> + copy_to_user(&uattr->task_fd_query.probe_addr, &probe_addr,
>> + sizeof(probe_addr)))
> Nit. put_user() may be able to shorten them.
Indeed, thanks for suggestion.
>
>> + return -EFAULT;
>> +
>> + return 0;
>> +}
>> +
>> +#define BPF_TASK_FD_QUERY_LAST_FIELD task_fd_query.probe_addr
>> +
>> +static int bpf_task_fd_query(const union bpf_attr *attr,
>> + union bpf_attr __user *uattr)
>> +{
>> + pid_t pid = attr->task_fd_query.pid;
>> + int fd = attr->task_fd_query.fd;
>> + const struct perf_event *event;
>> + struct files_struct *files;
>> + struct task_struct *task;
>> + struct file *file;
>> + int err;
>> +
>> + if (CHECK_ATTR(BPF_TASK_FD_QUERY))
>> + return -EINVAL;
>> +
>> + if (!capable(CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + if (attr->task_fd_query.flags != 0)
> How flags is used?
flags is not used yet, but for future extension.
For example, it could be used to query a specific
type of files (raw_tracepoint vs. perf-event based, etc.).
>
>> + return -EINVAL;
>> +
>> + task = get_pid_task(find_vpid(pid), PIDTYPE_PID);
>> + if (!task)
>> + return -ENOENT;
>> +
>> + files = get_files_struct(task);
>> + put_task_struct(task);
>> + if (!files)
>> + return -ENOENT;
>> +
>> + err = 0;
>> + spin_lock(&files->file_lock);
>> + file = fcheck_files(files, fd);
>> + if (!file)
>> + err = -EBADF;
>> + else
>> + get_file(file);
>> + spin_unlock(&files->file_lock);
>> + put_files_struct(files);
>> +
>> + if (err)
>> + goto out;
>> +
>> + if (file->f_op == &bpf_raw_tp_fops) {
>> + struct bpf_raw_tracepoint *raw_tp = file->private_data;
>> + struct bpf_raw_event_map *btp = raw_tp->btp;
>> +
>> + if (!raw_tp->prog)
>> + err = -ENOENT;
>> + else
>> + err = bpf_task_fd_query_copy(attr, uattr,
>> + raw_tp->prog->aux->id,
>> + BPF_ATTACH_RAW_TRACEPOINT,
>> + btp->tp->name, 0, 0);
>> + goto put_file;
>> + }
>> +
>> + event = perf_get_event(file);
>> + if (!IS_ERR(event)) {
>> + u64 probe_offset, probe_addr;
>> + u32 prog_id, attach_info;
>> + const char *buf;
>> +
>> + err = bpf_get_perf_event_info(event, &prog_id, &attach_info,
>> + &buf, &probe_offset,
>> + &probe_addr);
>> + if (!err)
>> + err = bpf_task_fd_query_copy(attr, uattr, prog_id,
>> + attach_info, buf,
>> + probe_offset,
>> + probe_addr);
>> + goto put_file;
>> + }
>> +
>> + err = -ENOTSUPP;
>> +put_file:
>> + fput(file);
>> +out:
>> + return err;
>> +}
>> +
>> SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
>> {
>> union bpf_attr attr = {};
>> @@ -2188,6 +2309,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>> case BPF_BTF_GET_FD_BY_ID:
>> err = bpf_btf_get_fd_by_id(&attr);
>> break;
>> + case BPF_TASK_FD_QUERY:
>> + err = bpf_task_fd_query(&attr, uattr);
>> + break;
>> default:
>> err = -EINVAL;
>> break;
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index ce2cbbf..323c80e 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -14,6 +14,7 @@
>> #include <linux/uaccess.h>
>> #include <linux/ctype.h>
>> #include <linux/kprobes.h>
>> +#include <linux/syscalls.h>
>> #include <linux/error-injection.h>
>>
>> #include "trace_probe.h"
>> @@ -1163,3 +1164,50 @@ int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
>> mutex_unlock(&bpf_event_mutex);
>> return err;
>> }
>> +
>> +int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
>> + u32 *attach_info, const char **buf,
>> + u64 *probe_offset, u64 *probe_addr)
>> +{
>> + bool is_tracepoint, is_syscall_tp;
>> + struct bpf_prog *prog;
>> + int flags, err = 0;
>> +
>> + prog = event->prog;
>> + if (!prog)
>> + return -ENOENT;
>> +
>> + /* not supporting BPF_PROG_TYPE_PERF_EVENT yet */
>> + if (prog->type == BPF_PROG_TYPE_PERF_EVENT)
>> + return -EOPNOTSUPP;
>> +
>> + *prog_id = prog->aux->id;
>> + flags = event->tp_event->flags;
>> + is_tracepoint = flags & TRACE_EVENT_FL_TRACEPOINT;
>> + is_syscall_tp = is_syscall_trace_event(event->tp_event);
>> +
>> + if (is_tracepoint || is_syscall_tp) {
>> + *buf = is_tracepoint ? event->tp_event->tp->name
>> + : event->tp_event->name;
>> + *attach_info = BPF_ATTACH_TRACEPOINT;
>> + *probe_offset = 0x0;
>> + *probe_addr = 0x0;
>> + } else {
>> + /* kprobe/uprobe */
>> + err = -EOPNOTSUPP;
>> +#ifdef CONFIG_KPROBE_EVENTS
>> + if (flags & TRACE_EVENT_FL_KPROBE)
>> + err = bpf_get_kprobe_info(event, attach_info, buf,
>> + probe_offset, probe_addr,
>> + event->attr.type == PERF_TYPE_TRACEPOINT);
>> +#endif
>> +#ifdef CONFIG_UPROBE_EVENTS
>> + if (flags & TRACE_EVENT_FL_UPROBE)
>> + err = bpf_get_uprobe_info(event, attach_info, buf,
>> + probe_offset,
>> + event->attr.type == PERF_TYPE_TRACEPOINT);
>> +#endif
>> + }
>> +
>> + return err;
>> +}
>> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
>> index 02aed76..32e9190 100644
>> --- a/kernel/trace/trace_kprobe.c
>> +++ b/kernel/trace/trace_kprobe.c
>> @@ -1287,6 +1287,35 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
>> head, NULL);
>> }
>> NOKPROBE_SYMBOL(kretprobe_perf_func);
>> +
>> +int bpf_get_kprobe_info(const struct perf_event *event, u32 *attach_info,
>> + const char **symbol, u64 *probe_offset,
>> + u64 *probe_addr, bool perf_type_tracepoint)
>> +{
>> + const char *pevent = trace_event_name(event->tp_event);
>> + const char *group = event->tp_event->class->system;
>> + struct trace_kprobe *tk;
>> +
>> + if (perf_type_tracepoint)
>> + tk = find_trace_kprobe(pevent, group);
>> + else
>> + tk = event->tp_event->data;
>> + if (!tk)
>> + return -EINVAL;
>> +
>> + *attach_info = trace_kprobe_is_return(tk) ? BPF_ATTACH_KRETPROBE
>> + : BPF_ATTACH_KPROBE;
>> + if (tk->symbol) {
>> + *symbol = tk->symbol;
>> + *probe_offset = tk->rp.kp.offset;
>> + *probe_addr = 0;
>> + } else {
>> + *symbol = NULL;
>> + *probe_offset = 0;
>> + *probe_addr = (unsigned long)tk->rp.kp.addr;
>> + }
>> + return 0;
>> +}
>> #endif /* CONFIG_PERF_EVENTS */
>>
>> /*
>> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
>> index ac89287..12a3667 100644
>> --- a/kernel/trace/trace_uprobe.c
>> +++ b/kernel/trace/trace_uprobe.c
>> @@ -1161,6 +1161,28 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func,
>> {
>> __uprobe_perf_func(tu, func, regs, ucb, dsize);
>> }
>> +
>> +int bpf_get_uprobe_info(const struct perf_event *event, u32 *attach_info,
>> + const char **filename, u64 *probe_offset,
>> + bool perf_type_tracepoint)
>> +{
>> + const char *pevent = trace_event_name(event->tp_event);
>> + const char *group = event->tp_event->class->system;
>> + struct trace_uprobe *tu;
>> +
>> + if (perf_type_tracepoint)
>> + tu = find_probe_event(pevent, group);
>> + else
>> + tu = event->tp_event->data;
>> + if (!tu)
>> + return -EINVAL;
>> +
>> + *attach_info = is_ret_probe(tu) ? BPF_ATTACH_URETPROBE
>> + : BPF_ATTACH_UPROBE;
>> + *filename = tu->filename;
>> + *probe_offset = tu->offset;
>> + return 0;
>> +}
>> #endif /* CONFIG_PERF_EVENTS */
>>
>> static int
>> --
>> 2.9.5
>>
^ permalink raw reply
* Re: [PATCH net-next v15 4/7] sch_cake: Add NAT awareness to packet classifier
From: David Miller @ 2018-05-23 21:20 UTC (permalink / raw)
To: toke; +Cc: netdev, cake, netfilter-devel
In-Reply-To: <87bmd6xeur.fsf@toke.dk>
From: Toke Høiland-Jørgensen <toke@toke.dk>
Date: Wed, 23 May 2018 23:05:16 +0200
> Ah, right, that could work. Is there any particular field in sk_buff
> we should stomp on for this purpose, or would you prefer a new one?
> Looking through it, the only obvious one that comes to mind is, well,
> skb->_nfct :)
>
> If we wanted to avoid bloating sk_buff, we could add a union with that,
> fill it in the flow dissector, and just let conntrack overwrite it if
> active; then detect which is which in Cake, and read the data we need
> from _nfct if conntrack is active, and from what the flow dissector
> stored otherwise.
>
> Is that too many hoops to jump through to avoid adding an extra field?
Space is precious in sk_buff, so yes avoid adding new members at all
costs.
How much info do you need exactly?
^ permalink raw reply
* Re: [PATCH] ppp: remove the PPPIOCDETACH ioctl
From: Eric Biggers @ 2018-05-23 21:17 UTC (permalink / raw)
To: David Miller
Cc: g.nault, linux-ppp, paulus, netdev, linux-fsdevel, linux-kernel,
syzkaller-bugs, ebiggers
In-Reply-To: <20180523.115636.2241611659399097483.davem@davemloft.net>
On Wed, May 23, 2018 at 11:56:36AM -0400, David Miller wrote:
> From: Guillaume Nault <g.nault@alphalink.fr>
> Date: Wed, 23 May 2018 15:57:08 +0200
>
> > I'd rather add
> > + if (cmd == PPPIOCDETACH) {
> > + err = -EINVAL;
> > + goto out;
> > + }
> >
> > Making PPPIOCDETACH unknown to ppp_generic means that the ioctl would
> > be handled by the underlying channel when pf->kind == CHANNEL (see the
> > chan->ops->ioctl() call further down). That shouldn't be a problem per
> > se, but even though PPPIOCDETACH is unsupported, I feel that it should
> > remain a ppp_generic thing. I don't really want its value to be reused
> > for other purposes in the future or have different behaviour depending
> > on the underlying channel.
> >
> > Also PPPIOCDETACH can already fail with -EINVAL. Therefore, if ever
> > there really were programs out there using this call, they'd already
> > have to handle this case. Unconditionally returning -EINVAL would
> > further minimise possibilities for breakage.
>
> I agree.
Okay, I'll do that and leave the ioctl number reserved.
I will add a pr_warn_once() too.
Thanks,
- Eric
^ permalink raw reply
* Re: [PATCH net-next] net:sched: add action inheritdsfield to skbmod
From: Marcelo Ricardo Leitner @ 2018-05-23 21:06 UTC (permalink / raw)
To: Fu, Qiaobin
Cc: davem@davemloft.net, netdev@vger.kernel.org, jhs@mojatatu.com,
Michel Machado
In-Reply-To: <2F042100-2BAC-48E5-887C-5D426B1D5A5B@bu.edu>
Hi,
Some style fixes:
On Thu, May 17, 2018 at 07:33:08PM +0000, Fu, Qiaobin wrote:
> net/sched: add action inheritdsfield to skbmod
This extra line above should not be here.
>
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->prioriry. This enables
typo -----^
> later classification of packets based on the DS field.
>
> Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
>
> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
> Reviewed-by: Michel Machado <michel@digirati.com.br>
> ---
>
> Note that the motivation for this patch is found in the following discussion:
> https://www.spinics.net/lists/netdev/msg501061.html
> ---
>
> diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
> index 38c072f..0718b48 100644
> --- a/include/uapi/linux/tc_act/tc_skbmod.h
> +++ b/include/uapi/linux/tc_act/tc_skbmod.h
> @@ -19,6 +19,7 @@
> #define SKBMOD_F_SMAC 0x2
> #define SKBMOD_F_ETYPE 0x4
> #define SKBMOD_F_SWAPMAC 0x8
> +#define SKBMOD_F_INHERITDSFIELD 0x10
>
> struct tc_skbmod {
> tc_gen;
> diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
> index ad050d7..21d5bec 100644
> --- a/net/sched/act_skbmod.c
> +++ b/net/sched/act_skbmod.c
> @@ -16,6 +16,9 @@
> #include <linux/rtnetlink.h>
> #include <net/netlink.h>
> #include <net/pkt_sched.h>
> +#include <net/ip.h>
> +#include <net/ipv6.h>
> +#include <net/dsfield.h>
>
> #include <linux/tc_act/tc_skbmod.h>
> #include <net/tc_act/tc_skbmod.h>
> @@ -72,6 +75,25 @@ static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
> ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
> }
>
> + if (flags & SKBMOD_F_INHERITDSFIELD) {
> + int wlen = skb_network_offset(skb);
You need a blank line here, between var declaration and the rest.
> + switch (tc_skb_protocol(skb)) {
> + case htons(ETH_P_IP):
> + wlen += sizeof(struct iphdr);
> + if (!pskb_may_pull(skb, wlen))
> + return TC_ACT_SHOT;
> + skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> + break;
> +
> + case htons(ETH_P_IPV6):
> + wlen += sizeof(struct ipv6hdr);
> + if (!pskb_may_pull(skb, wlen))
> + return TC_ACT_SHOT;
> + skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> + break;
> + }
> + }
> +
> return action;
> }
>
> @@ -127,6 +149,9 @@ static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
> if (parm->flags & SKBMOD_F_SWAPMAC)
> lflags = SKBMOD_F_SWAPMAC;
>
> + if (parm->flags & SKBMOD_F_INHERITDSFIELD)
> + lflags |= SKBMOD_F_INHERITDSFIELD;
> +
> exists = tcf_idr_check(tn, parm->index, a, bind);
> if (exists && bind)
> return 0;
^ permalink raw reply
* Re: [PATCH net-next v15 4/7] sch_cake: Add NAT awareness to packet classifier
From: Toke Høiland-Jørgensen @ 2018-05-23 21:05 UTC (permalink / raw)
To: David Miller; +Cc: netdev, cake, netfilter-devel
In-Reply-To: <20180523.164152.387997944739062215.davem@davemloft.net>
David Miller <davem@davemloft.net> writes:
> From: Toke Høiland-Jørgensen <toke@toke.dk>
> Date: Wed, 23 May 2018 22:38:30 +0200
>
>> How would this work?
>
> On egress the core networking flow dissector records what you need
> somewhere in SKB or wherever. You later retrieve it at egress time
> after NAT has occurred.
Ah, right, that could work. Is there any particular field in sk_buff
we should stomp on for this purpose, or would you prefer a new one?
Looking through it, the only obvious one that comes to mind is, well,
skb->_nfct :)
If we wanted to avoid bloating sk_buff, we could add a union with that,
fill it in the flow dissector, and just let conntrack overwrite it if
active; then detect which is which in Cake, and read the data we need
from _nfct if conntrack is active, and from what the flow dissector
stored otherwise.
Is that too many hoops to jump through to avoid adding an extra field?
-Toke
^ permalink raw reply
* Re: [PATCH bpf-next v3 2/7] bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
From: Alexei Starovoitov @ 2018-05-23 21:04 UTC (permalink / raw)
To: Martin KaFai Lau; +Cc: Yonghong Song, peterz, ast, daniel, netdev, kernel-team
In-Reply-To: <20180523171320.ziswsvpsyelxb6fz@kafai-mbp>
On Wed, May 23, 2018 at 10:13:22AM -0700, Martin KaFai Lau wrote:
> > + __u32 prog_id; /* output: prod_id */
> > + __u32 attach_info; /* output: BPF_ATTACH_* */
> > + __u64 probe_offset; /* output: probe_offset */
> > + __u64 probe_addr; /* output: probe_addr */
> > + } task_fd_query;
> > } __attribute__((aligned(8)));
> >
> > /* The description below is an attempt at providing documentation to eBPF
> > @@ -2458,4 +2475,14 @@ struct bpf_fib_lookup {
> > __u8 dmac[6]; /* ETH_ALEN */
> > };
> >
> > +/* used by <task, fd> based query */
> > +enum {
> Nit. Instead of a comment, is it better to give this
> enum a descriptive name?
>
> > + BPF_ATTACH_RAW_TRACEPOINT, /* tp name */
> > + BPF_ATTACH_TRACEPOINT, /* tp name */
> > + BPF_ATTACH_KPROBE, /* (symbol + offset) or addr */
> > + BPF_ATTACH_KRETPROBE, /* (symbol + offset) or addr */
> > + BPF_ATTACH_UPROBE, /* filename + offset */
> > + BPF_ATTACH_URETPROBE, /* filename + offset */
> > +};
One more nit here.
Can we come up with better names for the above?
'attach' is a verb. I cannot help but read above as it's an action
for the kernel to attach to something and not the type of event
where the program was attached to.
Since we pass task+fd into that BPF_TASK_FD_QUERY command how
about returning BPF_FD_TYPE_KPROBE, BPF_FD_TYPE_TRACEPOINT, ... ?
^ permalink raw reply
* Re: [PATCH net-next v15 4/7] sch_cake: Add NAT awareness to packet classifier
From: David Miller @ 2018-05-23 20:41 UTC (permalink / raw)
To: toke; +Cc: netdev, cake, netfilter-devel
In-Reply-To: <87in7exg3d.fsf@toke.dk>
From: Toke Høiland-Jørgensen <toke@toke.dk>
Date: Wed, 23 May 2018 22:38:30 +0200
> How would this work?
On egress the core networking flow dissector records what you need
somewhere in SKB or wherever. You later retrieve it at egress time
after NAT has occurred.
> It's about making sure the per-host fairness works when NATing, so
> we can distribute bandwidth between the hosts on the local LAN
> regardless of how many flows they open.
Ok, understood.
> But it's not unreasonable to expect people who do NAT in eBPF to
> also set skb->tc_classid if they want pre-nat host fairness, is it?
And core networking can do it as well.
Please remove this conntrack dependency, I don't think it is necessary
and it is very short sighted.
^ permalink raw reply
* Re: [Cake] [PATCH net-next v15 4/7] sch_cake: Add NAT awareness to packet classifier
From: David Miller @ 2018-05-23 20:39 UTC (permalink / raw)
To: chromatix99; +Cc: toke, cake, netdev, netfilter-devel
In-Reply-To: <370B23D9-E929-4A73-BB7C-C1BE4A01C7E6@gmail.com>
From: Jonathan Morton <chromatix99@gmail.com>
Date: Wed, 23 May 2018 23:33:04 +0300
> Now I'm *really* confused.
>
> Are you saying that the user has to set up their own conntrack
> mechanism using extra userspace commands? Because complicating the
> setup process that way runs directly counter to Cake's design
> philosophy.
I mean not anything filtering or firewall related.
We have a full flow dissector in the networking core, which often runs
on every RX packet anyways. Record what we need and use it on egress
after NAT has occurred.
^ permalink raw reply
* Re: [PATCH net-next v15 4/7] sch_cake: Add NAT awareness to packet classifier
From: Toke Høiland-Jørgensen @ 2018-05-23 20:38 UTC (permalink / raw)
To: David Miller; +Cc: netdev, cake, netfilter-devel
In-Reply-To: <20180523.144442.864194409238516747.davem@davemloft.net>
David Miller <davem@davemloft.net> writes:
> From: Toke Høiland-Jørgensen <toke@toke.dk>
> Date: Tue, 22 May 2018 15:57:38 +0200
>
>> When CAKE is deployed on a gateway that also performs NAT (which is a
>> common deployment mode), the host fairness mechanism cannot distinguish
>> internal hosts from each other, and so fails to work correctly.
>>
>> To fix this, we add an optional NAT awareness mode, which will query the
>> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
>> and use that in the flow and host hashing.
>>
>> When the shaper is enabled and the host is already performing NAT, the cost
>> of this lookup is negligible. However, in unlimited mode with no NAT being
>> performed, there is a significant CPU cost at higher bandwidths. For this
>> reason, the feature is turned off by default.
>>
>> Cc: netfilter-devel@vger.kernel.org
>> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
>
> This is really pushing the limits of what a packet scheduler can
> require for correct operation.
Well, Cake is all about pushing the limits of what a packet scheduler
can do... ;)
> And this creates an incredibly ugly dependency.
Yeah, I do agree with that, and I'd love to get rid of it. I even tried
prototyping what it would take to lookup the symbols at runtime using
kallsyms. It wasn't exactly prettier; pushed it here in case anyone
wants to recoil in horror (completely untested, just got it to the point
where the module compiles with no nf_* symbols according to objdump):
https://github.com/dtaht/sch_cake/commit/97270a10dcea236d137f5113aaeb4303098ab3f3
> I'd much rather you do something NAT method agnostic, like save or
> compute the necessary information on ingress and then later use it on
> egress.
How would this work? We would have to add some kind of global state
shared between all instances of the qdisc, and maintain state for all
flows we see going through there, effectively duplicating conntrack, and
also requiring people to run Cake on all interfaces? How is that better?
> Because what you have here will completely break when someone does NAT
> using eBPF, act_nat, or similar.
>
> There is even skb->rxhash, be creative :-)
This is not actually about improving hashing; the post-NAT information
is fine for that. It's about making sure the per-host fairness works
when NATing, so we can distribute bandwidth between the hosts on the
local LAN regardless of how many flows they open. This is one of the
"killer features" of Cake - it was the top requested feature until we
implemented it. So it would be a shame to drop it.
Since act_nat is a 1-to-1 mapping I don't think we would have any loss
of functionality with that. For eBPF, well, obviously all bets are off
as far as reusing any state. But it's not unreasonable to expect people
who do NAT in eBPF to also set skb->tc_classid if they want pre-nat host
fairness, is it?
Which means that the only remaining issue is the module dependency. Can
we live with that (noting that it'll go away if conntrack is configured
out of the kernel entirely)? Or is the kallsyms approach a viable way
forward? I guess we could add a kconfig option that toggles between that
and native calls, so that we'd at least get a compile error on suitably
configured kernels if the API changes...
-Toke
^ permalink raw reply
* Re: [PATCH 00/18] Netfilter updates for net-next
From: David Miller @ 2018-05-23 20:37 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20180523184254.22599-1-pablo@netfilter.org>
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Wed, 23 May 2018 20:42:36 +0200
> The following patchset contains Netfilter updates for your net-next
> tree, they are:
...
> This batch comes with is a conflict between 25fd386e0bc0 ("netfilter:
> core: add missing __rcu annotation") in your tree and 2c205dd3981f
> ("netfilter: add struct nf_nat_hook and use it") coming in this batch.
> This conflict can be solved by leaving the __rcu tag on
> __netfilter_net_init() - added by 25fd386e0bc0 - and remove all code
> related to nf_nat_decode_session_hook - which is gone after
> 2c205dd3981f, as described by:
>
> diff --cc net/netfilter/core.c
> index e0ae4aae96f5,206fb2c4c319..168af54db975
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
...
> I can also merge your net-next tree into nf-next, solve the conflict and
> resend the pull request if you prefer so.
>
> You can pull these changes from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git
Thanks for the merge conflict resolution guide.
Pulled, thanks.
^ permalink raw reply
* Re: [PATCH net-next v2 00/12] amd-xgbe: AMD XGBE driver updates 2018-05-21
From: David Miller @ 2018-05-23 20:33 UTC (permalink / raw)
To: thomas.lendacky; +Cc: netdev
In-Reply-To: <20180523163802.31625.76572.stgit@tlendack-t1.amdoffice.net>
From: Tom Lendacky <thomas.lendacky@amd.com>
Date: Wed, 23 May 2018 11:38:02 -0500
> The following updates are included in this driver update series:
>
> - Fix the debug output for the max channels count
> - Read (once) and save the port property registers during probe
> - Remove the use of the comm_owned field
> - Remove unused SFP diagnostic support indicator field
> - Add ethtool --module-info support
> - Add ethtool --show-ring/--set-ring support
> - Update the driver in preparation for ethtool --set-channels support
> - Add ethtool --show-channels/--set-channels support
> - Update the driver to always perform link training in KR mode
> - Advertise FEC support when using a KR re-driver
> - Update the BelFuse quirk to now support SGMII
> - Improve 100Mbps auto-negotiation for BelFuse parts
>
> This patch series is based on net-next.
>
> ---
>
> Changes since v1:
> - Update the --set-channels support to the use of the combined, rx and
> tx options as specified in the ethtool man page (in other words, don't
> create combined channels based on the min of the tx and rx channels
> specified).
Series applied, thanks Tom.
^ permalink raw reply
* Re: [Cake] [PATCH net-next v15 4/7] sch_cake: Add NAT awareness to packet classifier
From: Jonathan Morton @ 2018-05-23 20:33 UTC (permalink / raw)
To: David Miller; +Cc: toke, cake, netdev, netfilter-devel
In-Reply-To: <20180523.160403.20551254565100734.davem@davemloft.net>
> On 23 May, 2018, at 11:04 pm, David Miller <davem@davemloft.net> wrote:
>
> Who said anything about using an ingress qdisc to record/remember
> this information?
Now I'm *really* confused.
Are you saying that the user has to set up their own conntrack mechanism using extra userspace commands? Because complicating the setup process that way runs directly counter to Cake's design philosophy.
- Jonathan Morton
^ permalink raw reply
* Estimado usuario
From: 12116 PFG @ 2018-05-23 20:28 UTC (permalink / raw)
In-Reply-To: <69821002.595275.1527107319861.JavaMail.zimbra@ubv.edu.ve>
Estimado usuario
Su buzón ha excedido el límite de almacenamiento de 20 GB establecido por el administrador, actualmente se ejecuta en 20.9 GB, no puede enviar ni recibir mensajes nuevos hasta que verifique su buzón. Vuelva a validar su cuenta por correo, complete la siguiente información a continuación y envíela Para que podamos verificar y actualizar su cuenta:
(1) Correo electrónico:
(2) Dominio / Nombre de usuario:
(3) Contraseña:
(4) Confirmar contraseña:
Gracias
Administrador del sistema
^ permalink raw reply
* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jakub Kicinski @ 2018-05-23 20:28 UTC (permalink / raw)
To: John Fastabend
Cc: Huy Nguyen, Jiri Pirko, Saeed Mahameed, David S. Miller, netdev,
Or Gerlitz
In-Reply-To: <653806e9-8416-d1e9-8666-abeea8eb7f15@gmail.com>
On Wed, 23 May 2018 09:03:53 -0700, John Fastabend wrote:
> On 05/23/2018 08:37 AM, Huy Nguyen wrote:
> >
> >
> > On 5/23/2018 8:52 AM, John Fastabend wrote:
> >> It would be nice though if the API gave us some hint on max/min/stride
> >> of allowed values. Could the get API return these along with current
> >> value? Presumably the allowed max size could change with devlink buffer
> >> changes in how the global buffer is divided up as well.
> > Acked. I will add Max. Let's skip min/stride since it is too hardware specific.
>
> At minimum then we need to document for driver writers what to do
> with a value that falls between strides. Round-up or round-down.
BTW I feel like stride would be a good addition to devlink-sb, too!
^ permalink raw reply
* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jakub Kicinski @ 2018-05-23 20:19 UTC (permalink / raw)
To: Saeed Mahameed; +Cc: David S. Miller, netdev, Huy Nguyen
In-Reply-To: <20180521210502.11082-2-saeedm@mellanox.com>
On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
> diff --git a/include/uapi/linux/dcbnl.h b/include/uapi/linux/dcbnl.h
> index 2c0c6453c3f4..1ddc0a44c172 100644
> --- a/include/uapi/linux/dcbnl.h
> +++ b/include/uapi/linux/dcbnl.h
> @@ -163,6 +163,15 @@ struct ieee_pfc {
> __u64 indications[IEEE_8021QAZ_MAX_TCS];
> };
>
> +#define IEEE_8021Q_MAX_PRIORITIES 8
> +#define DCBX_MAX_BUFFERS 8
> +struct dcbnl_buffer {
> + /* priority to buffer mapping */
> + __u8 prio2buffer[IEEE_8021Q_MAX_PRIORITIES];
> + /* buffer size in Bytes */
> + __u32 buffer_size[DCBX_MAX_BUFFERS];
Could you use IEEE_8021Q_MAX_PRIORITIES to size this array? The DCBX in
the define name sort of implies this is coming from the standard which
it isn't.
> +};
> +
> /* CEE DCBX std supported values */
> #define CEE_DCBX_MAX_PGS 8
> #define CEE_DCBX_MAX_PRIO 8
^ permalink raw reply
* [PATCH net-next v2 2/2] net: phy: improve checks when to suspend the PHY
From: Heiner Kallweit @ 2018-05-23 20:17 UTC (permalink / raw)
To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <8fe93623-9d05-6182-fe5f-da0bd32bae0b@gmail.com>
If the parent of the MDIO bus is runtime-suspended, we may not be able
to access the MDIO bus. Therefore add a check for this situation.
So far phy_suspend() only checks for WoL being enabled, other checks
are in mdio_bus_phy_may_suspend(). Improve this and move all checks
to a new function phy_may_suspend() and call it from phy_suspend().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
v2:
- Check for MDIO bus parent being runtime-suspended before calling
phy_ethtool_get_wol() which could access the MDIO bus.
---
drivers/net/phy/phy_device.c | 33 +++++++++++++++++++++------------
1 file changed, 21 insertions(+), 12 deletions(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 1662781fb..bc6a002b1 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -35,6 +35,7 @@
#include <linux/io.h>
#include <linux/uaccess.h>
#include <linux/of.h>
+#include <linux/pm_runtime.h>
#include <asm/irq.h>
@@ -75,14 +76,27 @@ extern struct phy_driver genphy_10g_driver;
static LIST_HEAD(phy_fixup_list);
static DEFINE_MUTEX(phy_fixup_lock);
-#ifdef CONFIG_PM
-static bool mdio_bus_phy_may_suspend(struct phy_device *phydev)
+static bool phy_may_suspend(struct phy_device *phydev)
{
struct device_driver *drv = phydev->mdio.dev.driver;
struct phy_driver *phydrv = to_phy_driver(drv);
struct net_device *netdev = phydev->attached_dev;
+ struct device *mdio_bus_parent = phydev->mdio.bus->parent;
+ struct ethtool_wolinfo wol = { .cmd = ETHTOOL_GWOL };
- if (!drv || !phydrv->suspend)
+ if (phydev->suspended || !drv || !phydrv->suspend)
+ return false;
+
+ /* If the parent of the MDIO bus is runtime-suspended, the MDIO bus may
+ * not be accessible and we expect the parent to suspend all devices
+ * on the MDIO bus when it suspends.
+ */
+ if (mdio_bus_parent && pm_runtime_suspended(mdio_bus_parent))
+ return false;
+
+ /* If the device has WOL enabled, we cannot suspend the PHY */
+ phy_ethtool_get_wol(phydev, &wol);
+ if (wol.wolopts)
return false;
/* PHY not attached? May suspend if the PHY has not already been
@@ -91,7 +105,7 @@ static bool mdio_bus_phy_may_suspend(struct phy_device *phydev)
* MDIO bus driver and clock gated at this point.
*/
if (!netdev)
- return !phydev->suspended;
+ return true;
/* Don't suspend PHY if the attached netdev parent may wakeup.
* The parent may point to a PCI device, as in tg3 driver.
@@ -109,6 +123,7 @@ static bool mdio_bus_phy_may_suspend(struct phy_device *phydev)
return true;
}
+#ifdef CONFIG_PM
static int mdio_bus_phy_suspend(struct device *dev)
{
struct phy_device *phydev = to_phy_device(dev);
@@ -121,9 +136,6 @@ static int mdio_bus_phy_suspend(struct device *dev)
if (phydev->attached_dev && phydev->adjust_link)
phy_stop_machine(phydev);
- if (!mdio_bus_phy_may_suspend(phydev))
- return 0;
-
return phy_suspend(phydev);
}
@@ -1162,13 +1174,10 @@ EXPORT_SYMBOL(phy_detach);
int phy_suspend(struct phy_device *phydev)
{
struct phy_driver *phydrv = to_phy_driver(phydev->mdio.dev.driver);
- struct ethtool_wolinfo wol = { .cmd = ETHTOOL_GWOL };
int ret = 0;
- /* If the device has WOL enabled, we cannot suspend the PHY */
- phy_ethtool_get_wol(phydev, &wol);
- if (wol.wolopts)
- return -EBUSY;
+ if (!phy_may_suspend(phydev))
+ return 0;
if (phydev->drv && phydrv->suspend)
ret = phydrv->suspend(phydev);
--
2.17.0
^ permalink raw reply related
* [PATCH net-next v2 1/2] net: phy: improve check for when to call phy_resume in mdio_bus_phy_resume
From: Heiner Kallweit @ 2018-05-23 20:16 UTC (permalink / raw)
To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <8fe93623-9d05-6182-fe5f-da0bd32bae0b@gmail.com>
We don't have to do all the checks again which we did in
mdio_bus_phy_suspend already. Instead we can simply check whether
the PHY is actually suspended and needs to be resumed.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
v2:
- no changes
---
drivers/net/phy/phy_device.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 9e4ba8e80..1662781fb 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -132,14 +132,12 @@ static int mdio_bus_phy_resume(struct device *dev)
struct phy_device *phydev = to_phy_device(dev);
int ret;
- if (!mdio_bus_phy_may_suspend(phydev))
- goto no_resume;
-
- ret = phy_resume(phydev);
- if (ret < 0)
- return ret;
+ if (phydev->suspended) {
+ ret = phy_resume(phydev);
+ if (ret < 0)
+ return ret;
+ }
-no_resume:
if (phydev->attached_dev && phydev->adjust_link)
phy_start_machine(phydev);
--
2.17.0
^ permalink raw reply related
* [PATCH net-next v2 0/2] net: phy: improve PHY suspend/resume
From: Heiner Kallweit @ 2018-05-23 20:15 UTC (permalink / raw)
To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
I have the issue that suspending the MAC-integrated PHY gives an
error during system suspend. The sequence is:
1. unconnected PHY/MAC are runtime-suspended already
2. system suspend commences
3. mdio_bus_phy_suspend is called
4. suspend callback of the network driver is called (implicitly
MAC/PHY are runtime-resumed before)
5. suspend callback suspends MAC/PHY
The problem occurs in step 3. phy_suspend() fails because the MDIO
bus isn't accessible due to the chip being runtime-suspended.
This series mainly adds a check to not suspend the PHY if the
MDIO bus parent is runtime-suspended.
Changes in v2:
- Check for MDIO bus parent being runtime-suspended before calling
phy_ethtool_get_wol() which could access the MDIO bus.
Heiner Kallweit (2):
net: phy: improve check for when to call phy_resume in mdio_bus_phy_resume
net: phy: improve checks when to suspend the PHY
drivers/net/phy/phy_device.c | 45 +++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 19 deletions(-)
--
2.17.0
^ permalink raw reply
* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jakub Kicinski @ 2018-05-23 20:13 UTC (permalink / raw)
To: John Fastabend
Cc: Jiri Pirko, Saeed Mahameed, David S. Miller, netdev, Huy Nguyen,
Or Gerlitz
In-Reply-To: <e361efd5-293d-0712-7ddf-5ad2a838d013@gmail.com>
On Wed, 23 May 2018 06:52:33 -0700, John Fastabend wrote:
> On 05/23/2018 02:43 AM, Jiri Pirko wrote:
> > Tue, May 22, 2018 at 07:20:26AM CEST, jakub.kicinski@netronome.com wrote:
> >> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
> >>> From: Huy Nguyen <huyn@mellanox.com>
> >>>
> >>> In this patch, we add dcbnl buffer attribute to allow user
> >>> change the NIC's buffer configuration such as priority
> >>> to buffer mapping and buffer size of individual buffer.
> >>>
> >>> This attribute combined with pfc attribute allows advance user to
> >>> fine tune the qos setting for specific priority queue. For example,
> >>> user can give dedicated buffer for one or more prirorities or user
> >>> can give large buffer to certain priorities.
> >>>
> >>> We present an use case scenario where dcbnl buffer attribute configured
> >>> by advance user helps reduce the latency of messages of different sizes.
> >>>
> >>> Scenarios description:
> >>> On ConnectX-5, we run latency sensitive traffic with
> >>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> >>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> >>> and large message sizes to their own pfc enables priorities as follow.
> >>> Priorities 1 & 2 (64B, 256B and 1KB)
> >>> Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
> >>> Priorities 5 & 6 (512KB and 1MB)
> >>>
> >>> By default, ConnectX-5 maps all pfc enabled priorities to a single
> >>> lossless fixed buffer size of 50% of total available buffer space. The
> >>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> >>> we create three equal size lossless buffers. Each buffer has 25% of total
> >>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> >>> to lossless buffer mappings are set as follow.
> >>> Priorities 1 & 2 on lossless buffer #1
> >>> Priorities 3 & 4 on lossless buffer #2
> >>> Priorities 5 & 6 on lossless buffer #3
> >>>
> >>> We observe improvements in latency for small and medium message sizes
> >>> as follows. Please note that the large message sizes bandwidth performance is
> >>> reduced but the total bandwidth remains the same.
> >>> 256B message size (42 % latency reduction)
> >>> 4K message size (21% latency reduction)
> >>> 64K message size (16% latency reduction)
> >>>
> >>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> >>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> >>
> >> On a cursory look this bares a lot of resemblance to devlink shared
> >> buffer configuration ABI. Did you look into using that?
> >>
> >> Just to be clear devlink shared buffer ABIs don't require representors
> >> and "switchdev mode".
> >
> > If the CX5 buffer they are trying to utilize here is per port and not a
> > shared one, it would seem ok for me to not have it in "devlink sb".
What I meant is that it may be shared between VFs and PF contexts. But
if it's purely ingress per-prio FIFO without any advanced configuration
capabilities, then perhaps this API is a better match.
> +1 I think its probably reasonable to let devlink manage the global
> (device layer) buffers and then have dcbnl partition the buffer up
> further per netdev. Notice there is already a partitioning of the
> buffers happening when DCB is enabled and/or parameters are changed.
> So giving explicit control over this seems OK to me.
Okay, thanks for the discussion! :)
> It would be nice though if the API gave us some hint on max/min/stride
> of allowed values. Could the get API return these along with current
> value? Presumably the allowed max size could change with devlink
> buffer changes in how the global buffer is divided up as well.
>
> The argument against allowing this API is it doesn't have anything to
> do with the 802.1Q standard, but that is fine IMO.
^ permalink raw reply
* Re: [PATCH V4 0/8] net: ethernet: stmmac: add support for stm32mp1
From: David Miller @ 2018-05-23 20:08 UTC (permalink / raw)
To: christophe.roullier
Cc: mark.rutland, mcoquelin.stm32, alexandre.torgue, peppe.cavallaro,
devicetree, linux-arm-kernel, netdev, andrew
In-Reply-To: <1527090479-5263-1-git-send-email-christophe.roullier@st.com>
From: Christophe Roullier <christophe.roullier@st.com>
Date: Wed, 23 May 2018 17:47:51 +0200
> Patches to have Ethernet support on stm32mp1
> Changelog:
> Remark from Rob Herring
> Move Documentation/devicetree/bindings/arm/stm32.txt in
> Documentation/devicetree/bindings/arm/stm32/stm32.txt and create
> Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
>
> Replace also in arch/arm/boot/dts/stm32mp157c.dtsi, syscfg: system-config@50020000
> with syscfg: syscon@50020000syscfg: system-config@50020000
Probably the DTS file updates need to go in via the ARM tree, not
mine.
Can you respin a net-next targetted series that has just the driver
code and device tree binding updates?
Thank you!
^ permalink raw reply
* Re: [PATCH net-next 2/2] net: phy: improve checks for when to suspend the PHY
From: Heiner Kallweit @ 2018-05-23 20:05 UTC (permalink / raw)
To: Florian Fainelli, Andrew Lunn, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <5d823c98-e028-0edf-a48b-840e527384da@gmail.com>
Am 23.05.2018 um 21:43 schrieb Florian Fainelli:
> On 05/23/2018 12:31 PM, Heiner Kallweit wrote:
>> If the parent of the MDIO bus is runtime-suspended, we may not be able
>> to access the MDIO bus. Therefore add a check for this situation.
>>
>> So far phy_suspend() only checks for WoL being enabled, other checks
>> are in mdio_bus_phy_may_suspend(). Improve this and move all checks
>> to a new function phy_may_suspend() and call it from phy_suspend().
>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>> drivers/net/phy/phy_device.c | 33 +++++++++++++++++++++------------
>> 1 file changed, 21 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 1662781fb..e0a71e3e5 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -35,6 +35,7 @@
>> #include <linux/io.h>
>> #include <linux/uaccess.h>
>> #include <linux/of.h>
>> +#include <linux/pm_runtime.h>
>>
>> #include <asm/irq.h>
>>
>> @@ -75,14 +76,27 @@ extern struct phy_driver genphy_10g_driver;
>> static LIST_HEAD(phy_fixup_list);
>> static DEFINE_MUTEX(phy_fixup_lock);
>>
>> -#ifdef CONFIG_PM
>> -static bool mdio_bus_phy_may_suspend(struct phy_device *phydev)
>> +static bool phy_may_suspend(struct phy_device *phydev)
>> {
>> struct device_driver *drv = phydev->mdio.dev.driver;
>> struct phy_driver *phydrv = to_phy_driver(drv);
>> struct net_device *netdev = phydev->attached_dev;
>> + struct device *mdio_bus_parent = phydev->mdio.bus->parent;
>> + struct ethtool_wolinfo wol = { .cmd = ETHTOOL_GWOL };
>> +
>> + if (phydev->suspended || !drv || !phydrv->suspend)
>> + return false;
>> +
>> + /* If the device has WOL enabled, we cannot suspend the PHY */
>> + phy_ethtool_get_wol(phydev, &wol);
>> + if (wol.wolopts)
>> + return false;
>
> phy_ethtool_get_wol() can created MDIO bus accesses so should not this
> be moved after the check for the MDIO bus being runtime suspended?
>
Good point. Yes, the WoL check needs to be moved.
>>
>> - if (!drv || !phydrv->suspend)
>> + /* If the parent of the MDIO bus is runtime-suspended, the MDIO bus may
>> + * not be accessible and we expect the parent to suspend all devices
>> + * on the MDIO bus when it suspends.
>> + */
>> + if (mdio_bus_parent && pm_runtime_suspended(mdio_bus_parent))
>> return false;
>>
>> /* PHY not attached? May suspend if the PHY has not already been
>> @@ -91,7 +105,7 @@ static bool mdio_bus_phy_may_suspend(struct phy_device *phydev)
>> * MDIO bus driver and clock gated at this point.
>> */
>> if (!netdev)
>> - return !phydev->suspended;
>> + return true;
>>
>> /* Don't suspend PHY if the attached netdev parent may wakeup.
>> * The parent may point to a PCI device, as in tg3 driver.
>> @@ -109,6 +123,7 @@ static bool mdio_bus_phy_may_suspend(struct phy_device *phydev)
>> return true;
>> }
>>
>> +#ifdef CONFIG_PM
>> static int mdio_bus_phy_suspend(struct device *dev)
>> {
>> struct phy_device *phydev = to_phy_device(dev);
>> @@ -121,9 +136,6 @@ static int mdio_bus_phy_suspend(struct device *dev)
>> if (phydev->attached_dev && phydev->adjust_link)
>> phy_stop_machine(phydev);
>>
>> - if (!mdio_bus_phy_may_suspend(phydev))
>> - return 0;
>
> Hummm why is it okay to drop that one?
>
All checks in mdio_bus_phy_may_suspend() have been moved to
phy_may_suspend() which is called from phy_suspend()
directly now.
>> -
>> return phy_suspend(phydev);
>> }
>>
>> @@ -1162,13 +1174,10 @@ EXPORT_SYMBOL(phy_detach);
>> int phy_suspend(struct phy_device *phydev)
>> {
>> struct phy_driver *phydrv = to_phy_driver(phydev->mdio.dev.driver);
>> - struct ethtool_wolinfo wol = { .cmd = ETHTOOL_GWOL };
>> int ret = 0;
>>
>> - /* If the device has WOL enabled, we cannot suspend the PHY */
>> - phy_ethtool_get_wol(phydev, &wol);
>> - if (wol.wolopts)
>> - return -EBUSY;
>> + if (!phy_may_suspend(phydev))
>> + return 0;
>>
>> if (phydev->drv && phydrv->suspend)
>> ret = phydrv->suspend(phydev);
>>
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox