* Re: [PATCH bpf-next] tools/bpf: fix a netlink recv issue
From: Alexei Starovoitov @ 2018-09-11 21:27 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180911210911.3235080-1-yhs@fb.com>
On Tue, Sep 11, 2018 at 02:09:11PM -0700, Yonghong Song wrote:
> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> functions into a new file") introduced a while loop for the
> netlink recv path. This while loop is needed since the
> buffer in recv syscall may not be enough to hold all the
> information and in such cases multiple recv calls are needed.
>
> There is a bug introduced by the above commit as
> the while loop may block on recv syscall if there is no
> more messages are expected. The netlink message header
> flag NLM_F_MULTI is used to indicate that more messages
> are expected and this patch fixed the bug by doing
> further recv syscall only if multipart message is expected.
>
> The patch added another fix regarding to message length of 0.
> When netlink recv returns message length of 0, there will be
> no more messages for returning data so the while loop
> can end.
>
> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related functions into a new file")
> Reported-by: Björn Töpel <bjorn.topel@intel.com>
> Tested-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>
Applied, Thanks
^ permalink raw reply
* Re: libbpf build broken on musl libc (Alpine Linux)
From: Alexei Starovoitov @ 2018-09-11 21:24 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Jakub Kicinski, Daniel Borkmann, Thomas Richter,
Hendrik Brueckner, Linux Kernel Mailing List,
Linux Networking Development Mailing List
In-Reply-To: <20180911121543.GB22689@kernel.org>
On Tue, Sep 11, 2018 at 09:15:43AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Sep 11, 2018 at 12:22:18PM +0200, Jakub Kicinski escreveu:
> > On Mon, 10 Sep 2018 14:29:03 -0300, Arnaldo Carvalho de Melo wrote:
> > > After lunch I'll work on a patch to fix this,
>
> > Hi Arnaldo!
>
> > Any luck?
>
> Well, we need to apply the patch below and make tools/lib/str_error_r.c
> live in a library that libbpf and perf is linked to.
do you want us to take the patch or you're applying it yourself?
^ permalink raw reply
* [Patch net] net_sched: notify filter deletion when deleting a chain
From: Cong Wang @ 2018-09-11 21:22 UTC (permalink / raw)
To: netdev; +Cc: Cong Wang, Jiri Pirko
When we delete a chain of filters, we need to notify
user-space we are deleting each filters in this chain
too.
Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
net/sched/cls_api.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1a67af8a6e8c..0a75cb2e5e7b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1902,6 +1902,8 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
RTM_NEWCHAIN, false);
break;
case RTM_DELCHAIN:
+ tfilter_notify_chain(net, skb, block, q, parent, n,
+ chain, RTM_DELTFILTER);
/* Flush the chain first as the user requested chain removal. */
tcf_chain_flush(chain);
/* In case the chain was successfully deleted, put a reference
--
2.14.4
^ permalink raw reply related
* Re: [PATCH bpf-next 0/2] bpf: add bpffs/bpftool dump for prog_array and map_in_map maps
From: Alexei Starovoitov @ 2018-09-11 21:20 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180907002605.1408960-1-yhs@fb.com>
On Thu, Sep 06, 2018 at 05:26:03PM -0700, Yonghong Song wrote:
> The support to dump program array and map_in_map maps
> for bpffs and bpftool is added. Patch #1 added bpffs support
> and Patch #2 added bpftool support. Please see
> individual patches for example output.
Applied, Thanks
^ permalink raw reply
* [PATCH bpf-next] tools/bpf: fix a netlink recv issue
From: Yonghong Song @ 2018-09-11 21:09 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
functions into a new file") introduced a while loop for the
netlink recv path. This while loop is needed since the
buffer in recv syscall may not be enough to hold all the
information and in such cases multiple recv calls are needed.
There is a bug introduced by the above commit as
the while loop may block on recv syscall if there is no
more messages are expected. The netlink message header
flag NLM_F_MULTI is used to indicate that more messages
are expected and this patch fixed the bug by doing
further recv syscall only if multipart message is expected.
The patch added another fix regarding to message length of 0.
When netlink recv returns message length of 0, there will be
no more messages for returning data so the while loop
can end.
Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related functions into a new file")
Reported-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
tools/lib/bpf/netlink.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 469e068dd0c5..fde1d7bf8199 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -65,18 +65,23 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
__dump_nlmsg_t _fn, dump_nlmsg_t fn,
void *cookie)
{
+ bool multipart = true;
struct nlmsgerr *err;
struct nlmsghdr *nh;
char buf[4096];
int len, ret;
- while (1) {
+ while (multipart) {
+ multipart = false;
len = recv(sock, buf, sizeof(buf), 0);
if (len < 0) {
ret = -errno;
goto done;
}
+ if (len == 0)
+ break;
+
for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
nh = NLMSG_NEXT(nh, len)) {
if (nh->nlmsg_pid != nl_pid) {
@@ -87,6 +92,8 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
ret = -LIBBPF_ERRNO__INVSEQ;
goto done;
}
+ if (nh->nlmsg_flags & NLM_F_MULTI)
+ multipart = true;
switch (nh->nlmsg_type) {
case NLMSG_ERROR:
err = (struct nlmsgerr *)NLMSG_DATA(nh);
--
2.17.1
^ permalink raw reply related
* Re: [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket
From: santosh.shilimkar @ 2018-09-11 21:06 UTC (permalink / raw)
To: Tushar Dave, ast, daniel, davem, jakub.kicinski, quentin.monnet,
jiong.wang, sandipan, john.fastabend, kafai, rdna, yhs, netdev,
rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-5-git-send-email-tushar.n.dave@oracle.com>
On 9/11/18 12:38 PM, Tushar Dave wrote:
> RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
> arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
> However, because socket filter only deal with skb (e.g. struct skb as
> bpf context) we can only use socket filter for rds_tcp and not for
> rds_rdma.
>
> Considering one filtering solution for RDS, it seems that the common
> denominator between sk_buff and scatterlist is scatterlist. Therefore,
> this patch converts skb to sgvec and invoke sg_filter_run for
> rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.
>
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
I remember acking the earlier version. Here it is again..
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
^ permalink raw reply
* Re: [PATCH v2 net-next 11/12] net: ethernet: Add helper for set_pauseparam for Pause
From: kbuild test robot @ 2018-09-11 21:01 UTC (permalink / raw)
To: Andrew Lunn
Cc: kbuild-all, David Miller, netdev, Florian Fainelli, Andrew Lunn
In-Reply-To: <1536616350-15442-12-git-send-email-andrew@lunn.ch>
[-- Attachment #1: Type: text/plain, Size: 13820 bytes --]
Hi Andrew,
I love your patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Andrew-Lunn/Preparing-for-phylib-limkmodes/20180911-204149
reproduce: make htmldocs
All warnings (new ones prefixed by >>):
drivers/target/target_core_device.c:1: warning: no structured comments found
drivers/usb/dwc3/gadget.c:510: warning: Excess function parameter 'dwc' description in 'dwc3_gadget_start_config'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/bus.c:1: warning: no structured comments found
drivers/usb/typec/bus.c:268: warning: Function parameter or member 'mode' not described in 'typec_match_altmode'
drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
drivers/usb/typec/class.c:1: warning: no structured comments found
include/linux/w1.h:281: warning: Function parameter or member 'of_match_table' not described in 'w1_family'
fs/direct-io.c:257: warning: Excess function parameter 'offset' description in 'dio_complete'
fs/file_table.c:1: warning: no structured comments found
fs/libfs.c:477: warning: Excess function parameter 'available' description in 'simple_write_end'
fs/posix_acl.c:646: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
fs/posix_acl.c:646: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
fs/posix_acl.c:646: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:183: warning: Function parameter or member 'blockable' not described in 'amdgpu_mn_read_lock'
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:254: warning: Function parameter or member 'blockable' not described in 'amdgpu_mn_invalidate_range_start_gfx'
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:302: warning: Function parameter or member 'blockable' not described in 'amdgpu_mn_invalidate_range_start_hsa'
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:3011: warning: Excess function parameter 'dev' description in 'amdgpu_vm_get_task_info'
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:3012: warning: Function parameter or member 'adev' not described in 'amdgpu_vm_get_task_info'
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:3012: warning: Excess function parameter 'dev' description in 'amdgpu_vm_get_task_info'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_pin' not described in 'drm_driver'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_unpin' not described in 'drm_driver'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_res_obj' not described in 'drm_driver'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_get_sg_table' not described in 'drm_driver'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_import_sg_table' not described in 'drm_driver'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_vmap' not described in 'drm_driver'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_vunmap' not described in 'drm_driver'
include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_mmap' not described in 'drm_driver'
include/drm/drm_panel.h:98: warning: Function parameter or member 'link' not described in 'drm_panel'
drivers/gpu/drm/i915/i915_vma.h:49: warning: cannot understand function prototype: 'struct i915_vma '
drivers/gpu/drm/i915/i915_vma.h:1: warning: no structured comments found
drivers/gpu/drm/i915/intel_guc_fwif.h:553: warning: cannot understand function prototype: 'struct guc_log_buffer_state '
drivers/gpu/drm/i915/i915_trace.h:1: warning: no structured comments found
include/linux/skbuff.h:860: warning: Function parameter or member 'dev_scratch' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'list' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'ip_defrag_offset' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'skb_mstamp' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member '__cloned_offset' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'head_frag' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member '__pkt_type_offset' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'encapsulation' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'encap_hdr_csum' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'csum_valid' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'csum_complete_sw' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'csum_level' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'inner_protocol_type' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'remcsum_offload' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'offload_fwd_mark' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'offload_mr_fwd_mark' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'sender_cpu' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'reserved_tailroom' not described in 'sk_buff'
include/linux/skbuff.h:860: warning: Function parameter or member 'inner_ipproto' not described in 'sk_buff'
include/net/sock.h:238: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_portpair' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_cookie' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_listener' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common'
include/net/sock.h:238: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common'
include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.rmem_alloc' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.len' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.head' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.tail' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'sk_wq_raw' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock'
include/net/sock.h:509: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock'
include/linux/netdevice.h:2044: warning: Function parameter or member 'adj_list.upper' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'adj_list.lower' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'gso_partial_features' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'switchdev_ops' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'l3mdev_ops' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'xfrmdev_ops' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'tlsdev_ops' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'name_assign_type' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'ieee802154_ptr' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'mpls_ptr' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'xdp_prog' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'gro_flush_timeout' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'nf_hooks_ingress' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member '____cacheline_aligned_in_smp' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'qdisc_hash' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'xps_cpus_map' not described in 'net_device'
include/linux/netdevice.h:2044: warning: Function parameter or member 'xps_rxqs_map' not described in 'net_device'
>> drivers/net/phy/phy_device.c:1826: warning: Function parameter or member 'tx' not described in 'phy_set_sym_pause'
>> drivers/net/phy/phy_device.c:1826: warning: Function parameter or member 'tx' not described in 'phy_set_sym_pause'
include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(advertising' not described in 'phylink_link_state'
include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(lp_advertising' not described in 'phylink_link_state'
sound/soc/soc-core.c:2918: warning: Excess function parameter 'legacy_dai_naming' description in 'snd_soc_register_dais'
Documentation/admin-guide/cgroup-v2.rst:1485: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/admin-guide/cgroup-v2.rst:1487: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/admin-guide/cgroup-v2.rst:1488: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/core-api/boot-time-mm.rst:78: ERROR: Error in "kernel-doc" directive:
unknown option: "nodocs".
vim +1826 drivers/net/phy/phy_device.c
1812
1813 /**
1814 * phy_set_sym_pause - Configure symmetric Pause
1815 * @phydev: target phy_device struct
1816 * @rx: Receiver Pause is supported
1817 * @autoneg: Auto neg should be used
1818 *
1819 * Description: Configure advertised Pause support depending on if
1820 * receiver pause and pause auto neg is supported. Generally called
1821 * from the set_pauseparam .ndo.
1822 */
1823 void phy_set_sym_pause(struct phy_device *phydev, bool rx, bool tx,
1824 bool autoneg)
1825 {
> 1826 phydev->supported &= ~SUPPORTED_Pause;
1827
1828 if (rx && tx && autoneg)
1829 phydev->supported |= SUPPORTED_Pause;
1830
1831 phydev->advertising = phydev->supported;
1832 }
1833 EXPORT_SYMBOL(phy_set_sym_pause);
1834
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6586 bytes --]
^ permalink raw reply
* Re: [PATCH v3 net-next 5/6] dt-bindings: net: dsa: Add lantiq,xrx200-gswip DT bindings
From: Hauke Mehrtens @ 2018-09-11 21:01 UTC (permalink / raw)
To: Rob Herring
Cc: davem, netdev, andrew, vivien.didelot, f.fainelli, john,
linux-mips, dev, hauke.mehrtens, devicetree
In-Reply-To: <20180910220119.GA32582@bogus>
[-- Attachment #1.1: Type: text/plain, Size: 5896 bytes --]
On 09/11/2018 12:01 AM, Rob Herring wrote:
> On Sun, Sep 09, 2018 at 10:20:27PM +0200, Hauke Mehrtens wrote:
>> This adds the binding for the GSWIP (Gigabit switch) core found in the
>> xrx200 / VR9 Lantiq / Intel SoC.
>>
>> This part takes care of the switch, MDIO bus, and loading the FW into
>> the embedded GPHYs.
>>
>> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
>> Cc: devicetree@vger.kernel.org
>> ---
>> .../devicetree/bindings/net/dsa/lantiq-gswip.txt | 141 +++++++++++++++++++++
>> 1 file changed, 141 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
>> new file mode 100644
>> index 000000000000..a089f5856778
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
>> @@ -0,0 +1,141 @@
>> +Lantiq GSWIP Ethernet switches
>> +==================================
>> +
>> +Required properties for GSWIP core:
>> +
>> +- compatible : "lantiq,xrx200-gswip" for the embedded GSWIP in the
>> + xRX200 SoC
>> +- reg : memory range of the GSWIP core registers
>> + : memory range of the GSWIP MDIO registers
>> + : memory range of the GSWIP MII registers
>> +
>> +See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of
>> +additional required and optional properties.
>> +
>> +
>> +Required properties for MDIO bus:
>> +- compatible : "lantiq,xrx200-mdio" for the MDIO bus inside the GSWIP
>> + core of the xRX200 SoC and the PHYs connected to it.
>> +
>> +See Documentation/devicetree/bindings/net/mdio.txt for a list of additional
>> +required and optional properties.
>> +
>> +
>> +Required properties for GPHY firmware loading:
>> +- compatible : "lantiq,gphy-fw" and "lantiq,xrx200-gphy-fw",
>> + "lantiq,xrx200a1x-gphy-fw", "lantiq,xrx200a2x-gphy-fw",
>> + "lantiq,xrx300-gphy-fw", or "lantiq,xrx330-gphy-fw"
>> + for the loading of the firmware into the embedded
>> + GPHY core of the SoC.
>
> One valid combination of compatibles per line please.
Ok, I will update this.
>
>> +- lantiq,rcu : reference to the rcu syscon
>> +
>> +The GPHY firmware loader has a list of GPHY entries, one for each
>> +embedded GPHY
>> +
>> +- reg : Offset of the GPHY firmware register in the RCU
>> + register range
>
> This use of reg is strange. This node should probably be a child of
> the RCU.
The SoC Designers put all registers for which they didn't want to create
a new register block into the RCU (Reset controller unit) range. The
switch itself is on the main crossbar, and has his own memory range, but
the registers to load the GPHY FW are in the RCU register. We have to
load the GPHY firmware before we can assess the GPHY, after the FW is
loaded we control the GPHY through the MDIO bus of the switch.
The GPHY is now part of the switch driver, so we moved the GPHY node
also as a sub node to the switch, when it would be under the RCU we
somehow have to make sure it gets loaded before the switch gets loaded,
which is more complicated. The GPHY itself is also part of the switch IP
block and not the reset controller unit.
>> +- resets : list of resets of the embedded GPHY
>> +- reset-names : list of names of the resets
>> +
>> +Example:
>> +
>> +Ethernet switch on the VRX200 SoC:
>> +
>> +gswip: gswip@E108000 {
>
> switch@... or ethernet-switch@...
>
> We need a standard name here and add it to the DT spec.
Ok, I will change this.
>
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> + compatible = "lantiq,xrx200-gswip";
>> + reg = < 0xE108000 0x3000 /* switch */
>> + 0xE10B100 0x70 /* mdio */
>> + 0xE10B1D8 0x30 /* mii */
>> + >;
>> + dsa,member = <0 0>;
>
> Not documented.
This is part of the general dsa binding.
>
>> +
>> + ports {
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> +
>> + port@0 {
>> + reg = <0>;
>> + label = "lan3";
>> + phy-mode = "rgmii";
>> + phy-handle = <&phy0>;
>> + };
>> +
>> + port@1 {
>> + reg = <1>;
>> + label = "lan4";
>> + phy-mode = "rgmii";
>> + phy-handle = <&phy1>;
>> + };
>> +
>> + port@2 {
>> + reg = <2>;
>> + label = "lan2";
>> + phy-mode = "internal";
>> + phy-handle = <&phy11>;
>> + };
>> +
>> + port@4 {
>> + reg = <4>;
>> + label = "lan1";
>> + phy-mode = "internal";
>> + phy-handle = <&phy13>;
>> + };
>> +
>> + port@5 {
>> + reg = <5>;
>> + label = "wan";
>> + phy-mode = "rgmii";
>> + phy-handle = <&phy5>;
>> + };
>> +
>> + port@6 {
>> + reg = <0x6>;
>> + label = "cpu";
>> + ethernet = <ð0>;
>> + };
>> + };
>> +
>> + mdio@0 {
>
> What's the address 0 here?
I will remove this, there is only one MDIO bus under the switch.
>
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> + compatible = "lantiq,xrx200-mdio";
>> + reg = <0>;
>> +
>> + phy0: ethernet-phy@0 {
>> + reg = <0x0>;
>> + };
>> + phy1: ethernet-phy@1 {
>> + reg = <0x1>;
>> + };
>> + phy5: ethernet-phy@5 {
>> + reg = <0x5>;
>> + };
>> + phy11: ethernet-phy@11 {
>> + reg = <0x11>;
>> + };
>> + phy13: ethernet-phy@13 {
>> + reg = <0x13>;
>> + };
>> + };
>> +
>> + gphy-fw {
>> + compatible = "lantiq,xrx200-gphy-fw", "lantiq,gphy-fw";
>> + lantiq,rcu = <&rcu0>;
>
> Missing #size-cells and #address-cells, but this should change as I said
> above.
Ok, I will change this.
>
>> +
>> + gphy@20 {
>> + reg = <0x20>;
>> +
>> + resets = <&reset0 31 30>;
>> + reset-names = "gphy";
>> + };
>> +
>> + gphy@68 {
>> + reg = <0x68>;
>> +
>> + resets = <&reset0 29 28>;
>> + reset-names = "gphy";
>> + };
>> + };
>> +};
>> --
>> 2.11.0
>>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>
RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.
Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
net/rds/ib.c | 1 +
net/rds/ib.h | 1 +
net/rds/ib_recv.c | 12 ++++++
net/rds/rds.h | 1 +
net/rds/recv.c | 12 ++++++
net/rds/tcp.c | 1 +
net/rds/tcp.h | 2 +
net/rds/tcp_recv.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
8 files changed, 137 insertions(+), 1 deletion(-)
diff --git a/net/rds/ib.c b/net/rds/ib.c
index eba75c1..6c40652 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -527,6 +527,7 @@ struct rds_transport rds_ib_transport = {
.conn_path_shutdown = rds_ib_conn_path_shutdown,
.inc_copy_to_user = rds_ib_inc_copy_to_user,
.inc_free = rds_ib_inc_free,
+ .inc_to_sg_get = rds_ib_inc_to_sg_get,
.cm_initiate_connect = rds_ib_cm_initiate_connect,
.cm_handle_connect = rds_ib_cm_handle_connect,
.cm_connect_complete = rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 73427ff..0a12b41 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -404,6 +404,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 2f16146..0054c7c 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
}
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+ struct rds_ib_incoming *ibinc;
+ struct rds_page_frag *frag;
+
+ ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+ frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+ *sg = &frag->f_sg;
+
+ return 0;
+}
+
static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
struct rds_ib_recv_work *recv)
{
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 6bfaf05..9f3e4df 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -542,6 +542,7 @@ struct rds_transport {
int (*recv_path)(struct rds_conn_path *cp);
int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
void (*inc_free)(struct rds_incoming *inc);
+ int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
struct rdma_cm_event *event, bool isv6);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 1271965..424042e 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -290,6 +290,8 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
struct sock *sk;
unsigned long flags;
struct rds_conn_path *cp;
+ struct sk_filter *filter;
+ int result = __SOCKSG_PASS;
inc->i_conn = conn;
inc->i_rx_jiffies = jiffies;
@@ -374,6 +376,16 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
/* We can be racing with rds_release() which marks the socket dead. */
sk = rds_rs_to_sk(rs);
+ rcu_read_lock();
+ filter = rcu_dereference(sk->sk_filter);
+ if (filter && conn->c_trans->inc_to_sg_get) {
+ struct scatterlist *sg = NULL;
+
+ if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0)
+ result = sg_filter_run(sk, sg);
+ }
+ rcu_read_unlock();
+
/* serialize with rds_release -> sock_orphan */
write_lock_irqsave(&rs->rs_recv_lock, flags);
if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index b9bbcf3..b0683e6 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -464,6 +464,7 @@ struct rds_transport rds_tcp_transport = {
.conn_path_shutdown = rds_tcp_conn_path_shutdown,
.inc_copy_to_user = rds_tcp_inc_copy_to_user,
.inc_free = rds_tcp_inc_free,
+ .inc_to_sg_get = rds_tcp_inc_to_sg_get,
.stats_info_copy = rds_tcp_stats_info_copy,
.exit = rds_tcp_exit,
.t_owner = THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 3c69361..e4ea16e 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -7,6 +7,7 @@
struct rds_tcp_incoming {
struct rds_incoming ti_inc;
struct sk_buff_head ti_skb_list;
+ struct scatterlist *sg;
};
struct rds_tcp_connection {
@@ -82,6 +83,7 @@ void rds_tcp_restore_callbacks(struct socket *sock,
int rds_tcp_recv_path(struct rds_conn_path *cp);
void rds_tcp_inc_free(struct rds_incoming *inc);
int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
/* tcp_send.c */
void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 42c5ff1..22d84f2 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -50,14 +50,113 @@ static void rds_tcp_inc_purge(struct rds_incoming *inc)
void rds_tcp_inc_free(struct rds_incoming *inc)
{
struct rds_tcp_incoming *tinc;
+ int i;
+
tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
rds_tcp_inc_purge(inc);
+
+ if (tinc->sg) {
+ for (i = 0; i < sg_nents(tinc->sg); i++) {
+ struct page *page;
+
+ page = sg_page(&tinc->sg[i]);
+ put_page(page);
+ }
+ kfree(tinc->sg);
+ }
+
rdsdebug("freeing tinc %p inc %p\n", tinc, inc);
kmem_cache_free(rds_tcp_incoming_slab, tinc);
}
+#define MAX_SG MAX_SKB_FRAGS
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+ struct rds_tcp_incoming *tinc;
+ struct sk_buff *skb;
+ int num_sg = 0;
+ int i;
+
+ tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+ /* For now we are assuming that the max sg elements we need is MAX_SG.
+ * To determine actual number of sg elements we need to traverse the
+ * skb queue e.g.
+ *
+ * skb_queue_walk(&tinc->ti_skb_list, skb) {
+ * num_sg += skb_shinfo(skb)->nr_frags + 1;
+ * }
+ */
+ tinc->sg = kzalloc(sizeof(*tinc->sg) * MAX_SG, GFP_KERNEL);
+ if (!tinc->sg)
+ return -ENOMEM;
+
+ sg_init_table(tinc->sg, MAX_SG);
+ skb_queue_walk(&tinc->ti_skb_list, skb) {
+ num_sg += skb_to_sgvec_nomark(skb, &tinc->sg[num_sg], 0,
+ skb->len);
+ }
+
+ /* packet can have zero length */
+ if (num_sg <= 0) {
+ kfree(tinc->sg);
+ tinc->sg = NULL;
+ return -ENODATA;
+ }
+
+ sg_mark_end(&tinc->sg[num_sg - 1]);
+ *sg = tinc->sg;
+
+ for (i = 0; i < num_sg; i++)
+ get_page(sg_page(&tinc->sg[i]));
+
+ return 0;
+}
+
+static int rds_tcp_inc_copy_sg_to_user(struct rds_incoming *inc,
+ struct iov_iter *to)
+{
+ struct rds_tcp_incoming *tinc;
+ struct scatterlist *sg;
+ unsigned long copied = 0;
+ unsigned long len;
+ u8 i = 0;
+
+ tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+ len = be32_to_cpu(inc->i_hdr.h_len);
+ sg = tinc->sg;
+
+ do {
+ struct page *page;
+ unsigned long n, copy, to_copy;
+
+ sg = &tinc->sg[i];
+ copy = sg->length;
+ page = sg_page(sg);
+ to_copy = iov_iter_count(to);
+ to_copy = min_t(unsigned long, to_copy, copy);
+
+ n = copy_page_to_iter(page, sg->offset, to_copy, to);
+ if (n != copy)
+ return -EFAULT;
+
+ rds_stats_add(s_copy_to_user, to_copy);
+ copied += to_copy;
+ sg->offset += to_copy;
+ sg->length -= to_copy;
+
+ if (!sg->length)
+ i++;
+
+ if (copied == len)
+ break;
+ } while (i != sg_nents(tinc->sg));
+ return copied;
+}
/*
- * this is pretty lame, but, whatever.
+ * This is pretty lame, but, whatever.
+ * Note: bpf filter can change RDS packet and if so then the modified packet is
+ * contained in the form of scatterlist, not skb.
*/
int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
{
@@ -70,6 +169,12 @@ int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+ /* if tinc->sg is not NULL means bpf filter ran on packet and so packet
+ * now is in the form of scatterlist.
+ */
+ if (tinc->sg)
+ return rds_tcp_inc_copy_sg_to_user(inc, to);
+
skb_queue_walk(&tinc->ti_skb_list, skb) {
unsigned long to_copy, skb_off;
for (skb_off = 0; skb_off < skb->len; skb_off += to_copy) {
@@ -176,6 +281,7 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, struct sk_buff *skb,
desc->error = -ENOMEM;
goto out;
}
+ tinc->sg = NULL;
tc->t_tinc = tinc;
rdsdebug("alloced tinc %p\n", tinc);
rds_inc_path_init(&tinc->ti_inc, cp,
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>
Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.
When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
helper to inspect packet data contained in struct scatterlist and
returns appropriate action code back to kernel.
To ease testing, RDS client functionality is also added so that users
can generate RDS packet.
Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71
Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71
[root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [038] ..s. 146.947362: 0: 30 31 32
<idle>-0 [038] ..s. 146.947364: 0: 33 34 35
Similarly specifying '-t ib' will run this on IB link.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
samples/bpf/Makefile | 3 +
samples/bpf/rds_filter_kern.c | 42 ++++++
samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 384 insertions(+)
create mode 100644 samples/bpf/rds_filter_kern.c
create mode 100644 samples/bpf/rds_filter_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index be0a961..bbac5ef 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
hostprogs-y += xdp_fwd
hostprogs-y += task_fd_query
hostprogs-y += xdp_sample_pkts
+hostprogs-y += rds_filter
# Libbpf dependencies
LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
xdp_fwd-objs := xdp_fwd_user.o
task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -167,6 +169,7 @@ always += xdpsock_kern.o
always += xdp_fwd_kern.o
always += task_fd_query_kern.o
always += xdp_sample_pkts_kern.o
+always += rds_filter_kern.o
KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..633e687
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define bpf_printk(fmt, ...) \
+({ \
+ char ____fmt[] = fmt; \
+ bpf_trace_printk(____fmt, sizeof(____fmt), \
+ ##__VA_ARGS__); \
+})
+
+SEC("socksg")
+int main_prog(struct sk_msg_md *msg)
+{
+ int start, end, err;
+ unsigned char *d;
+
+ start = 0;
+ end = 6;
+
+ err = bpf_msg_pull_data(msg, start, end, 0);
+ if (err) {
+ bpf_printk("socksg: pull_data err %i\n", err);
+ return SOCKSG_PASS;
+ }
+
+ if (msg->data + 6 > msg->data_end)
+ return SOCKSG_PASS;
+
+ d = (unsigned char *)msg->data;
+ bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+ bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+ return SOCKSG_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1186345
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT 4000
+#define BUFSIZE 8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+ if (strcmp(trans, "tcp") == 0)
+ return RDS_TRANS_TCP;
+ if (strcmp(trans, "ib") == 0)
+ return RDS_TRANS_IB;
+ return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+ switch (trans) {
+ case RDS_TRANS_TCP:
+ return ("tcp");
+ case RDS_TRANS_IB:
+ return ("ib");
+ case RDS_TRANS_NONE:
+ return ("none");
+ default:
+ return ("unknown");
+ }
+}
+
+static int gettransport(int sock)
+{
+ int err;
+ char val;
+ socklen_t len = sizeof(int);
+
+ err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+ (char *)&val, &len);
+ if (err < 0) {
+ fprintf(stderr, "%s: getsockopt %s\n",
+ __func__, strerror(errno));
+ return err;
+ }
+ return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+ int err;
+
+ err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+ (char *)&transport, sizeof(transport));
+ if (err < 0) {
+ fprintf(stderr, "could not set transport %s, %s\n",
+ trans2str(transport), strerror(errno));
+ }
+ return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+ socklen_t sin_size = sizeof(struct sockaddr_in);
+ struct sockaddr_in sin;
+ int err;
+
+ err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+ if (err < 0) {
+ fprintf(stderr, "%s getsockname %s\n",
+ __func__, strerror(errno));
+ return;
+ }
+ printf("%s address: %s port %d\n",
+ (str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+ if (ret != NULL)
+ *ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+ int i;
+
+ printf("payload contains:");
+ for (i = 0; i < 10; i++)
+ printf("%x ", buf[i]);
+ printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+ struct sockaddr_in sin, din;
+ struct msghdr msg;
+ struct iovec *iov;
+ int rc, sock;
+ char *buf;
+
+ buf = calloc(BUFSIZE, sizeof(char));
+ if (!buf) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ return;
+ }
+
+ sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ if (sock < 0) {
+ fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+ goto out;
+ }
+ if (settransport(sock, transport) < 0)
+ goto out;
+
+ printf("transport %s\n", trans2str(gettransport(sock)));
+
+ memset(&sin, 0, sizeof(sin));
+ sin.sin_family = AF_INET;
+ sin.sin_addr.s_addr = inet_addr(address);
+ sin.sin_port = htons(port);
+
+ rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+ if (rc < 0) {
+ fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+ goto out;
+ }
+
+ /* attach bpf prog */
+ assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+ sizeof(prog_fd[0])) == 0);
+
+ print_sock_local_info(sock, "server bound to", NULL);
+
+ iov = calloc(1, sizeof(struct iovec));
+ if (!iov) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ goto out;
+ }
+
+ while (1) {
+ memset(buf, 0, BUFSIZE);
+ iov[0].iov_base = buf;
+ iov[0].iov_len = BUFSIZE;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_name = &din;
+ msg.msg_namelen = sizeof(din);
+ msg.msg_iov = iov;
+ msg.msg_iovlen = 1;
+
+ printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+ rc = recvmsg(sock, &msg, 0);
+ if (rc < 0) {
+ fprintf(stderr, "%s: recvmsg %s\n",
+ __func__, strerror(errno));
+ break;
+ }
+
+ printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+ inet_ntoa(sin.sin_addr),
+ inet_ntoa(din.sin_addr),
+ (uint32_t) iov[0].iov_len,
+ (uint32_t) msg.msg_controllen,
+ ntohs(din.sin_port));
+
+ print_payload(buf);
+ }
+ free(iov);
+out:
+ free(buf);
+}
+
+static void create_message(char *buf)
+{
+ unsigned int i;
+
+ for (i = 0; i < BUFSIZE; i++) {
+ buf[i] = i + 0x30;
+ }
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+ struct iovec *iov;
+
+ iov = calloc(1, sizeof(struct iovec));
+ if (!iov) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ return -1;
+ }
+
+ msg->msg_iov = iov;
+ msg->msg_iovlen = 1;
+
+ iov[0].iov_base = buf;
+ iov[0].iov_len = BUFSIZE * sizeof(char);
+
+ return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+ struct sockaddr_in sin, din;
+ struct msghdr msg;
+ int rc, sock;
+ char *buf;
+
+ buf = calloc(BUFSIZE, sizeof(char));
+ if (!buf) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ return;
+ }
+
+ create_message(buf);
+
+ sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ if (sock < 0) {
+ fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+ goto out;
+ }
+
+ if (settransport(sock, transport) < 0)
+ goto out;
+
+ printf("transport %s\n", trans2str(gettransport(sock)));
+
+ memset(&sin, 0, sizeof(sin));
+ sin.sin_family = AF_INET;
+ sin.sin_addr.s_addr = inet_addr(localaddr);
+ sin.sin_port = 0;
+
+ rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+ if (rc < 0) {
+ fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+ goto out;
+ }
+ print_sock_local_info(sock, "client bound to", &sin);
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_name = &din;
+ msg.msg_namelen = sizeof(din);
+
+ memset(&din, 0, sizeof(din));
+ din.sin_family = AF_INET;
+ din.sin_addr.s_addr = inet_addr(remoteaddr);
+ din.sin_port = htons(server_port);
+
+ rc = build_rds_packet(&msg, buf);
+ if (rc < 0)
+ goto out;
+
+ printf("client sending %d byte message from %s to %s on port %d\n",
+ (uint32_t) msg.msg_iov->iov_len, localaddr,
+ remoteaddr, ntohs(sin.sin_port));
+
+ rc = sendmsg(sock, &msg, 0);
+ if (rc < 0)
+ fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+ print_payload(buf);
+
+ if (msg.msg_control)
+ free(msg.msg_control);
+ if (msg.msg_iov)
+ free(msg.msg_iov);
+out:
+ free(buf);
+
+ return;
+}
+
+static void usage(char *progname)
+{
+ fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+ "\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+ in_port_t server_port = TESTPORT;
+ char *serveraddr = NULL;
+ char *clientaddr = NULL;
+ char filename[256];
+ int opt;
+
+ while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+ switch (opt) {
+ case 's':
+ serveraddr = optarg;
+ break;
+ case 'c':
+ clientaddr = optarg;
+ break;
+ case 't':
+ transport = str2trans(optarg);
+ if (transport == RDS_TRANS_NONE) {
+ fprintf(stderr,
+ "unknown transport %s\n", optarg);
+ usage(argv[0]);
+ return (-1);
+ }
+ break;
+ default:
+ usage(argv[0]);
+ return 1;
+ }
+ }
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+ return 1;
+ }
+
+ if (serveraddr && !clientaddr) {
+ printf("running server in a loop\n");
+ server(serveraddr, server_port);
+ } else if (serveraddr && clientaddr) {
+ client(clientaddr, serveraddr, server_port);
+ }
+
+ return 0;
+}
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 3/5] ebpf: Add sg_filter_run()
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>
When sg_filter_run() is invoked it runs the attached eBPF
prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
struct scatterlist.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
include/linux/filter.h | 8 ++++++++
include/uapi/linux/bpf.h | 6 ++++++
net/core/filter.c | 35 +++++++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 6 ++++++
4 files changed, 55 insertions(+)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6791a0a..ae664a9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1113,4 +1113,12 @@ struct bpf_sock_ops_kern {
*/
};
+enum __socksg_action {
+ __SOCKSG_PASS = 0,
+ __SOCKSG_DROP,
+ __SOCKSG_REDIRECT,
+};
+
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
#endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6ec1e32..1e11789 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
SK_PASS,
};
+enum socksg_action {
+ SOCKSG_PASS = 0,
+ SOCKSG_DROP,
+ SOCKSG_REDIRECT,
+};
+
/* user accessible metadata for SK_MSG packet hook, new fields must
* be added to the end of this structure
*/
diff --git a/net/core/filter.c b/net/core/filter.c
index 469c488..a3afc61 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,41 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
}
EXPORT_SYMBOL(sk_filter_trim_cap);
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+ struct sk_filter *filter;
+ int result = 0;
+
+ if (!sg)
+ return result;
+
+ rcu_read_lock();
+ filter = rcu_dereference(sk->sk_filter);
+ if (filter) {
+ struct sk_msg_buff mb = {0};
+
+ memcpy(mb.sg_data, sg, sizeof(*sg) * MAX_SKB_FRAGS);
+ mb.sg_start = 0;
+ mb.sg_end = sg_nents(sg);
+ mb.data = sg_virt(sg);
+ mb.data_end = mb.data + sg->length;
+ mb.sg_copy[mb.sg_end - 1] = true;
+
+ result = BPF_PROG_RUN(filter->prog, &mb);
+
+ /* BPF prog may have changed mb.sg_data e.g. may linearize
+ * multiple scatterlist entries into one. Therefore, make sure
+ * to update original sg and mark the sg end.
+ */
+ memcpy(sg, mb.sg_data, sizeof(*sg) * MAX_SKB_FRAGS);
+ sg_mark_end(&sg[mb.sg_end - 1]);
+ }
+ rcu_read_unlock();
+
+ return result;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
{
return skb_get_poff(skb);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6ec1e32..1e11789 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
SK_PASS,
};
+enum socksg_action {
+ SOCKSG_PASS = 0,
+ SOCKSG_DROP,
+ SOCKSG_REDIRECT,
+};
+
/* user accessible metadata for SK_MSG packet hook, new fields must
* be added to the end of this structure
*/
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>
Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
include/linux/bpf_types.h | 1 +
include/uapi/linux/bpf.h | 1 +
kernel/bpf/syscall.c | 1 +
kernel/bpf/verifier.c | 1 +
net/core/filter.c | 55 ++++++++++++++++++++++++++++++++++++++++--
samples/bpf/bpf_load.c | 11 ++++++---
tools/bpf/bpftool/prog.c | 1 +
tools/include/uapi/linux/bpf.h | 1 +
tools/lib/bpf/libbpf.c | 3 +++
tools/lib/bpf/libbpf.h | 2 ++
10 files changed, 72 insertions(+), 5 deletions(-)
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c09..7dc1503 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
#endif
#ifdef CONFIG_BPF_EVENTS
BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
+ BPF_PROG_TYPE_SOCKET_SG_FILTER,
};
enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3c9636f..5f302b7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
type != BPF_PROG_TYPE_CGROUP_SKB &&
+ type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
!capable(CAP_SYS_ADMIN))
return -EPERM;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f4ff0c5..17fc4d2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
case BPF_PROG_TYPE_LWT_XMIT:
case BPF_PROG_TYPE_SK_SKB:
case BPF_PROG_TYPE_SK_MSG:
+ case BPF_PROG_TYPE_SOCKET_SG_FILTER:
if (meta)
return meta->pkt_access;
diff --git a/net/core/filter.c b/net/core/filter.c
index 0b40f95..469c488 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
static void __bpf_prog_release(struct bpf_prog *prog)
{
- if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+ if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+ prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
bpf_prog_put(prog);
} else {
bpf_release_orig_filter(prog);
@@ -1539,10 +1540,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
{
+ struct bpf_prog *prog;
+
if (sock_flag(sk, SOCK_FILTER_LOCKED))
return ERR_PTR(-EPERM);
- return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+ prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+ if (IS_ERR(prog))
+ prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+ return prog;
}
int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4935,6 +4942,17 @@ bool bpf_helper_changes_pkt_data(void *func)
}
static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_msg_pull_data:
+ return &bpf_msg_pull_data_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
+}
+
+static const struct bpf_func_proto *
tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
switch (func_id) {
@@ -6753,6 +6771,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
return insn - insn_buf;
}
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog,
+ u32 *target_size)
+{
+ struct bpf_insn *insn = insn_buf;
+
+ switch (si->off) {
+ case offsetof(struct sk_msg_md, data):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
+ si->dst_reg, si->src_reg,
+ offsetof(struct sk_msg_buff, data));
+ break;
+ case offsetof(struct sk_msg_md, data_end):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
+ si->dst_reg, si->src_reg,
+ offsetof(struct sk_msg_buff, data_end));
+ break;
+ }
+
+ return insn - insn_buf;
+}
+
static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
@@ -6891,6 +6933,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
.test_run = bpf_prog_test_run_skb,
};
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+ .get_func_proto = socksg_filter_func_proto,
+ .is_valid_access = sk_msg_is_valid_access,
+ .convert_ctx_access = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
.get_func_proto = tc_cls_act_func_proto,
.is_valid_access = tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 904e775..3b1697d 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
bool is_sockops = strncmp(event, "sockops", 7) == 0;
bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+ bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_SK_SKB;
} else if (is_sk_msg) {
prog_type = BPF_PROG_TYPE_SK_MSG;
+ } else if (is_socksg) {
+ prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -122,8 +126,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
return 0;
- if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
- if (is_socket)
+ if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+ if (is_socket || is_socksg)
event += 6;
else
event += 7;
@@ -627,7 +631,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
memcmp(shname, "cgroup/", 7) == 0 ||
memcmp(shname, "sockops", 7) == 0 ||
memcmp(shname, "sk_skb", 6) == 0 ||
- memcmp(shname, "sk_msg", 6) == 0) {
+ memcmp(shname, "sk_msg", 6) == 0 ||
+ memcmp(shname, "socksg", 6) == 0) {
ret = load_and_attach(shname, data->d_buf,
data->d_size);
if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d..9c57c4e 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@
[BPF_PROG_TYPE_RAW_TRACEPOINT] = "raw_tracepoint",
[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
[BPF_PROG_TYPE_LIRC_MODE2] = "lirc_mode2",
+ [BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
};
static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
+ BPF_PROG_TYPE_SOCKET_SG_FILTER,
};
enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f1..a7ac51c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_LIRC_MODE2:
case BPF_PROG_TYPE_SK_REUSEPORT:
+ case BPF_PROG_TYPE_SOCKET_SG_FILTER:
return false;
case BPF_PROG_TYPE_UNSPEC:
case BPF_PROG_TYPE_KPROBE:
@@ -2077,6 +2078,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
void bpf_program__set_expected_attach_type(struct bpf_program *prog,
enum bpf_attach_type type)
@@ -2129,6 +2131,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+ BPF_PROG_SEC("socksg", BPF_PROG_TYPE_SOCKET_SG_FILTER),
};
#undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 96c55fa..7527ea4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -208,6 +208,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
void bpf_program__set_expected_attach_type(struct bpf_program *prog,
enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
bool bpf_program__is_socket_filter(struct bpf_program *prog);
bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -217,6 +218,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
bool bpf_program__is_sched_act(struct bpf_program *prog);
bool bpf_program__is_xdp(struct bpf_program *prog);
bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
/*
* No need for __attribute__((packed)), all members of 'bpf_map_def'
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 0/5] eBPF and struct scatterlist
From: Tushar Dave @ 2018-09-11 19:37 UTC (permalink / raw)
To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
yhs, netdev, rds-devel, sowmini.varadhan
This non-RFC patch-set is follow-up on the RFC v3 that was sent earlier.
(https://www.spinics.net/lists/netdev/msg519380.html)
In this patch-set following changes are made,
RFC v3 -> this patch-set:
- "RFC v3 patch 3" is removed as it is no longer needed because
bpf_msg_pull_data() has all required bug fixed. Thanks Daniel.
- Use __GFP_COMP while allocating pages in bpf_msg_pull_data to avoid
page_copy_sane while using sg page in copy_page_to_iter() (patch 1)
- In sg_filter_run(), after BPF prog returns, mb.sg_data may have
changed while linearize multiple scatterlist entries into one.
Therefore, make sure to update original sg and mark the sg end correctly
before return. (patch 3)
- BPF program can write/modify RDS packet, if that is the case then the
modified packet data is represented in scatterlist. Therefore use
scatterlist (not skb) while copying payload back to userspace. Also
carefully release scatterlist and associated pages e.g.
get_page()/put_page() (patch 4)
Details:
--------
eBPF: Patch 1 use __GFP_COMP while allocating pages in bpf_msg_pull_data
to avoid page_copy_sane warning.
eBPF: Patch 2 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
which uses the existing socket filter infrastructure for bpf program
attach and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER
deals with struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
of struct scatterlist.
eBPF: Patch 3 adds sg_filter_run() that runs BPF_PROG_TYPE_SOCKET_SG_FILTER.
RDS: patch 4 allows rds_recv_incoming to invoke socket filter program
which deals with struct scatterlist
bpf/samples: Patch 5 adds socket filter eBPF sample program that uses
patches 1 to 5. The sample program opens an rds socket, attach ebpf
program (socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and
uses bpf_msg_pull_data() helper to inspect RDS packet data. For a test,
current sample program only prints first few bytes of packet data.
Background:
-----------
The motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module).
This patchset uses exiting socket filter infrastructure and extend it
with new eBPF program type that deals with struct scatterlist.
Existing bpf helper bpf_msg_pull_data() is used to inspect packet data
that are in form struct scatterlist. For RDS, the integrated approach
treats the scatterlist as the common denominator, and allows the
application to write a filter for processing a scatterlist.
Testing:
---------
To confirm data accuracy and results, RDS packets of various sizes has
been tested with socksg program along with various start and end values
for bpf_msg_pull_data(). All such tests shows accurate results.
Thanks.
-Tushar
Tushar Dave (5):
bpf: use __GFP_COMP while allocating page
eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
ebpf: Add sg_filter_run()
rds: invoke socket sg filter attached to rds socket
ebpf: Add sample ebpf program for SOCKET_SG_FILTER
include/linux/bpf_types.h | 1 +
include/linux/filter.h | 8 +
include/uapi/linux/bpf.h | 7 +
kernel/bpf/syscall.c | 1 +
kernel/bpf/verifier.c | 1 +
net/core/filter.c | 93 ++++++++++-
net/rds/ib.c | 1 +
net/rds/ib.h | 1 +
net/rds/ib_recv.c | 12 ++
net/rds/rds.h | 1 +
net/rds/recv.c | 12 ++
net/rds/tcp.c | 1 +
net/rds/tcp.h | 2 +
net/rds/tcp_recv.c | 108 ++++++++++++-
samples/bpf/Makefile | 3 +
samples/bpf/bpf_load.c | 11 +-
samples/bpf/rds_filter_kern.c | 42 +++++
samples/bpf/rds_filter_user.c | 339 +++++++++++++++++++++++++++++++++++++++++
tools/bpf/bpftool/prog.c | 1 +
tools/include/uapi/linux/bpf.h | 7 +
tools/lib/bpf/libbpf.c | 3 +
tools/lib/bpf/libbpf.h | 2 +
22 files changed, 650 insertions(+), 7 deletions(-)
create mode 100644 samples/bpf/rds_filter_kern.c
create mode 100644 samples/bpf/rds_filter_user.c
--
1.8.3.1
^ permalink raw reply
* [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>
Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.
e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8
To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
net/core/filter.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index d301134..0b40f95 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
if (unlikely(bytes_sg_total > copy))
return -EINVAL;
- page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
+ page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+ get_order(copy));
if (unlikely(!page))
return -ENOMEM;
p = page_address(page);
--
1.8.3.1
^ permalink raw reply related
* Re: [RFC] managing PHY carrier from user space
From: Joakim Tjernlund @ 2018-09-11 19:21 UTC (permalink / raw)
To: netdev@vger.kernel.org, f.fainelli@gmail.com, andrew@lunn.ch
In-Reply-To: <b1df2baf-4c6b-6645-4b6f-648ff22949d2@gmail.com>
On Tue, 2018-09-11 at 09:56 -0700, Florian Fainelli wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
> On 09/11/2018 09:41 AM, Joakim Tjernlund wrote:
> > I am looking for a way to induce carrier state from user space, primarily
> > for Fixed PHYs as these are always up. ifplugd/dhcp etc. does not behave properly
> > if the link is up when it really isn't.
>
> Was my suggestion in my email to you somehow not working? This is
> obviously not acceptable for upstream, there is no reason, even for a
> fixed PHY, to attempt to mangle with the carrier state for any
> reasonable production purposes.
Ohh, I never got that mail. Scanning the netdev archives I found it though, thanks.
I will go down the ndo_change_carrier() way and see whether I can work out what to
do w.r.t fixed link status callback.
Thanks
Jocke
^ permalink raw reply
* Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
From: Tobias Hommel @ 2018-09-11 19:02 UTC (permalink / raw)
To: Wolfgang Walter
Cc: Steffen Klassert, Kristian Evensen, Network Development, weiwan,
edumazet
In-Reply-To: <2028376.H0yIdbXTXp@stwm.de>
> > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force
> > clears the dst_entry.
> >
> > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > skb_dst_force() might clear the dst_entry attached to the skb.
> > The xfrm code don't expect this to happen, so we crash with
> > a NULL pointer dereference in this case. Fix it by checking
> > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > in cast the dst_entry was cleared.
> >
> > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > Reported-by: Tobias Hommel <netdev-list@genoetigt.de>
> > Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
> > Reported-by: Wolfgang Walter <linux@stwm.de>
> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> > ---
> > net/xfrm/xfrm_output.c | 4 ++++
> > net/xfrm/xfrm_policy.c | 4 ++++
> > 2 files changed, 8 insertions(+)
> >
> > diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> > index 89b178a78dc7..36d15a38ce5e 100644
> > --- a/net/xfrm/xfrm_output.c
> > +++ b/net/xfrm/xfrm_output.c
> > @@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int
> > err) spin_unlock_bh(&x->lock);
> >
> > skb_dst_force(skb);
> > + if (!skb_dst(skb)) {
> > + XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
> > + goto error_nolock;
> > + }
> >
> > if (xfrm_offload(skb)) {
> > x->type_offload->encap(x, skb);
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index 7c5e8978aeaa..626e0f4d1749 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb,
> > unsigned short family) }
> >
> > skb_dst_force(skb);
> > + if (!skb_dst(skb)) {
> > + XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
> > + return 0;
> > + }
> >
> > dst = xfrm_lookup(net, skb_dst(skb), &fl, NULL, XFRM_LOOKUP_QUEUE);
> > if (IS_ERR(dst)) {
>
> This patch fixes the problem here.
>
> XfrmFwdHdrError gets around 80 at the very beginning and remains so. Probably
> this happens when some route are changed/set then.
>
> Regards and thanks,
Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
This is less than 1 lost packet per minute, which seems to be okay for now.
^ permalink raw reply
* Re: tools/bpf regression causing samples/bpf/ to hang
From: Björn Töpel @ 2018-09-11 19:01 UTC (permalink / raw)
To: yhs; +Cc: Netdev, ast, Daniel Borkmann, Jesper Dangaard Brouer
In-Reply-To: <a5a23097-c59f-42a2-eeff-050b825cde11@fb.com>
Den tis 11 sep. 2018 kl 20:21 skrev Yonghong Song <yhs@fb.com>:
>
>
>
> On 9/11/18 10:15 AM, Björn Töpel wrote:
> > Den tis 11 sep. 2018 kl 18:47 skrev Yonghong Song <yhs@fb.com>:
> >>
> >>
> >>
> >> On 9/11/18 4:11 AM, Björn Töpel wrote:
> >>> Hi Yonghong, I tried to run the XDP samples from the bpf-next tip
> >>> today, and was hit by a regression.
> >>>
> >>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>> functions into a new file") adds a while(1) around the recv call in
> >>> bpf_set_link_xdp_fd making that call getting stuck in an infinite
> >>> loop.
> >>>
> >>> I simply removed the loop, and that solved my problem (patch below).
> >>>
> >>> However, I don't know if removing the loop would break bpftool for
> >>> you. If not, I can submit the patch as a proper one for bpf-next.
> >>
> >> Hi, Björn, thanks for reporting the problem.
> >> The while loop is needed since the "recv" syscall buffer size
> >> may not be big enough to hold all the returned information, in
> >> which cases, multiple "recv" calls are needed.
> >>
> >> Could you try the following patch to see whether it fixed your
> >> issue? Thanks!
> >>
> >
> > Nope, it doesn't -- but if you move that hunk after the for-loop it works.
>
> Could you try this patch?
>
Works! Thanks!
Tested-by: Björn Töpel <bjorn.topel@intel.com>
> commit 9a7fb19899ce87594fe8012f8a23fc8fc7b6b764 (HEAD -> fix)
> Author: Yonghong Song <yhs@fb.com>
> Date: Tue Sep 11 08:58:20 2018 -0700
>
> tools/bpf: fix a netlink recv issue
>
> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> functions into a new file") introduced a while loop for the
> netlink recv path. This while loop is needed since the
> buffer in recv syscall may not be enough to hold all the
> information and in such cases multiple recv calls are needed.
>
> There is a bug introduced by the above commit as
> the while loop may block on recv syscall if there is no
> more messages are expected. The netlink message header
> flag NLM_F_MULTI is used to indicate that more messages
> are expected and this patch fixed the bug by doing
> further recv syscall only if multipart message is expected.
>
> The patch added another fix regarding to message length of 0.
> When netlink recv returns message length of 0, there will be
> no more messages for returning data so the while loop
> can end.
>
> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> functions into a new file")
> Reported-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>
>
> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> index 469e068dd0c5..fde1d7bf8199 100644
> --- a/tools/lib/bpf/netlink.c
> +++ b/tools/lib/bpf/netlink.c
> @@ -65,18 +65,23 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> int seq,
> __dump_nlmsg_t _fn, dump_nlmsg_t fn,
> void *cookie)
> {
> + bool multipart = true;
> struct nlmsgerr *err;
> struct nlmsghdr *nh;
> char buf[4096];
> int len, ret;
>
> - while (1) {
> + while (multipart) {
> + multipart = false;
> len = recv(sock, buf, sizeof(buf), 0);
> if (len < 0) {
> ret = -errno;
> goto done;
> }
>
> + if (len == 0)
> + break;
> +
> for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> nh = NLMSG_NEXT(nh, len)) {
> if (nh->nlmsg_pid != nl_pid) {
> @@ -87,6 +92,8 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> int seq,
> ret = -LIBBPF_ERRNO__INVSEQ;
> goto done;
> }
> + if (nh->nlmsg_flags & NLM_F_MULTI)
> + multipart = true;
> switch (nh->nlmsg_type) {
> case NLMSG_ERROR:
> err = (struct nlmsgerr *)NLMSG_DATA(nh);
>
>
> >
> > Cheers,
> > Björn
> >
> >> commit 3eb1c0249dfc3ea4ad61aa223dce32262af7e049 (HEAD -> fix)
> >> Author: Yonghong Song <yhs@fb.com>
> >> Date: Tue Sep 11 08:58:20 2018 -0700
> >>
> >> tools/bpf: fix a netlink recv issue
> >>
> >> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >> functions into a new file") introduced a while loop for the
> >> netlink recv path. This while loop is needed since the
> >> buffer in recv syscall may not be big enough to hold all the
> >> information and in such cases multiple recv calls are needed.
> >>
> >> When netlink recv returns message length of 0, there will be
> >> no more messages for returning data so the while loop
> >> can end.
> >>
> >> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >> functions into a new file")
> >> Reported-by: Björn Töpel <bjorn.topel@intel.com>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >>
> >> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> >> index 469e068dd0c5..37827319a50a 100644
> >> --- a/tools/lib/bpf/netlink.c
> >> +++ b/tools/lib/bpf/netlink.c
> >> @@ -77,6 +77,9 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> >> int seq,
> >> goto done;
> >> }
> >>
> >> + if (len == 0)
> >> + break;
> >> +
> >> for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> >> nh = NLMSG_NEXT(nh, len)) {
> >> if (nh->nlmsg_pid != nl_pid) {
> >>
> >>
> >>>
> >>> Thanks!
> >>> Björn
> >>>
> >>> From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= <bjorn.topel@intel.com>
> >>> Date: Tue, 11 Sep 2018 12:35:44 +0200
> >>> Subject: [PATCH] tools/bpf: remove loop around netlink recv
> >>>
> >>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>> functions into a new file") moved the bpf_set_link_xdp_fd and split it
> >>> up into multiple functions. The added receive function
> >>> bpf_netlink_recv added a loop around the recv syscall leading to
> >>> multiple recv calls. This resulted in all XDP samples in the
> >>> samples/bpf/ to stop working, since they were stuck in a blocking
> >>> recv.
> >>>
> >>> This commits removes the while (1)-statement.
> >>>
> >>> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>> functions into a new file")
> >>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> >>> ---
> >>> tools/lib/bpf/netlink.c | 64 ++++++++++++++++++++---------------------
> >>> 1 file changed, 31 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> >>> index 469e068dd0c5..0eae1fbf46c6 100644
> >>> --- a/tools/lib/bpf/netlink.c
> >>> +++ b/tools/lib/bpf/netlink.c
> >>> @@ -70,41 +70,39 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
> >>> char buf[4096];
> >>> int len, ret;
> >>>
> >>> - while (1) {
> >>> - len = recv(sock, buf, sizeof(buf), 0);
> >>> - if (len < 0) {
> >>> - ret = -errno;
> >>> + len = recv(sock, buf, sizeof(buf), 0);
> >>> + if (len < 0) {
> >>> + ret = -errno;
> >>> + goto done;
> >>> + }
> >>> +
> >>> + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> >>> + nh = NLMSG_NEXT(nh, len)) {
> >>> + if (nh->nlmsg_pid != nl_pid) {
> >>> + ret = -LIBBPF_ERRNO__WRNGPID;
> >>> goto done;
> >>> }
> >>> -
> >>> - for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> >>> - nh = NLMSG_NEXT(nh, len)) {
> >>> - if (nh->nlmsg_pid != nl_pid) {
> >>> - ret = -LIBBPF_ERRNO__WRNGPID;
> >>> - goto done;
> >>> - }
> >>> - if (nh->nlmsg_seq != seq) {
> >>> - ret = -LIBBPF_ERRNO__INVSEQ;
> >>> - goto done;
> >>> - }
> >>> - switch (nh->nlmsg_type) {
> >>> - case NLMSG_ERROR:
> >>> - err = (struct nlmsgerr *)NLMSG_DATA(nh);
> >>> - if (!err->error)
> >>> - continue;
> >>> - ret = err->error;
> >>> - nla_dump_errormsg(nh);
> >>> - goto done;
> >>> - case NLMSG_DONE:
> >>> - return 0;
> >>> - default:
> >>> - break;
> >>> - }
> >>> - if (_fn) {
> >>> - ret = _fn(nh, fn, cookie);
> >>> - if (ret)
> >>> - return ret;
> >>> - }
> >>> + if (nh->nlmsg_seq != seq) {
> >>> + ret = -LIBBPF_ERRNO__INVSEQ;
> >>> + goto done;
> >>> + }
> >>> + switch (nh->nlmsg_type) {
> >>> + case NLMSG_ERROR:
> >>> + err = (struct nlmsgerr *)NLMSG_DATA(nh);
> >>> + if (!err->error)
> >>> + continue;
> >>> + ret = err->error;
> >>> + nla_dump_errormsg(nh);
> >>> + goto done;
> >>> + case NLMSG_DONE:
> >>> + return 0;
> >>> + default:
> >>> + break;
> >>> + }
> >>> + if (_fn) {
> >>> + ret = _fn(nh, fn, cookie);
> >>> + if (ret)
> >>> + return ret;
> >>> }
> >>> }
> >>> ret = 0;
> >>>
^ permalink raw reply
* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Jason A. Donenfeld @ 2018-09-12 0:02 UTC (permalink / raw)
To: David Miller
Cc: Andrew Lunn, Eric Biggers, Greg Kroah-Hartman, Ard Biesheuvel,
LKML, Netdev, Andrew Lutomirski, Samuel Neves,
Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <20180911.165739.2032677219588723041.davem@davemloft.net>
On Tue, Sep 11, 2018 at 5:57 PM David Miller <davem@davemloft.net> wrote:
> Both of Andrew's statements are completely true.
>
> I'm not looking at any of the networking bits until the crypto stuff
> is fully sorted and fully supported and Ack'd by crypto folks.
Seems reasonable to me.
Jason
^ permalink raw reply
* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Jason A. Donenfeld @ 2018-09-12 0:01 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Eric Biggers, Greg Kroah-Hartman, Ard Biesheuvel, LKML, Netdev,
David Miller, Andrew Lutomirski, Samuel Neves,
Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <2E01FBB6-030A-40AB-8BEE-F8F271A57568@amacapital.net>
Hi Andy,
On Tue, Sep 11, 2018 at 5:01 PM Andy Lutomirski <luto@amacapital.net> wrote:
> I think Ard’s point is valid: in the long run we don’t want two competing software implementations of each primitive. It clearly *should* be possible to make crypto API call into zinc for synchronous software operations, but a demonstration of how this actually works and that there isn’t some change to zinc to make it would well would be in order, I think.
>
> IMO the right approach is do one conversion right away and save the rest for later.
Alright, I'll go ahead and do this for v4. Thanks for the guidance.
Jason
^ permalink raw reply
* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: David Miller @ 2018-09-11 23:57 UTC (permalink / raw)
To: andrew
Cc: Jason, ebiggers, gregkh, ard.biesheuvel, linux-kernel, netdev,
luto, sneves, jeanphilippe.aumasson, linux-crypto
In-Reply-To: <20180911233015.GD11474@lunn.ch>
From: Andrew Lunn <andrew@lunn.ch>
Date: Wed, 12 Sep 2018 01:30:15 +0200
> Just as an FYI:
>
> 1) I don't think anybody in netdev has taken a serious look at the
> network code yet. There is little point until the controversial part
> of the code, Zinc, has been sorted out.
>
> 2) I personally would be surprised if DaveM took this code without
> having an Acked-by from the crypto subsystem people. In the same way,
> i doubt the crypto people would take an Ethernet driver without having
> DaveM's Acked-by.
Both of Andrew's statements are completely true.
I'm not looking at any of the networking bits until the crypto stuff
is fully sorted and fully supported and Ack'd by crypto folks.
^ permalink raw reply
* [Patch net-next] llc: avoid blocking in llc_sap_close()
From: Cong Wang @ 2018-09-11 18:42 UTC (permalink / raw)
To: netdev; +Cc: Cong Wang
llc_sap_close() is called by llc_sap_put() which
could be called in BH context in llc_rcv(). We can't
block in BH.
There is no reason to block it here, kfree_rcu() should
be sufficient.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
include/net/llc.h | 1 +
net/llc/llc_core.c | 4 +---
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/include/net/llc.h b/include/net/llc.h
index 890a87318014..df282d9b4017 100644
--- a/include/net/llc.h
+++ b/include/net/llc.h
@@ -66,6 +66,7 @@ struct llc_sap {
int sk_count;
struct hlist_nulls_head sk_laddr_hash[LLC_SK_LADDR_HASH_ENTRIES];
struct hlist_head sk_dev_hash[LLC_SK_DEV_HASH_ENTRIES];
+ struct rcu_head rcu;
};
static inline
diff --git a/net/llc/llc_core.c b/net/llc/llc_core.c
index 260b3dc1b4a2..64d4bef04e73 100644
--- a/net/llc/llc_core.c
+++ b/net/llc/llc_core.c
@@ -127,9 +127,7 @@ void llc_sap_close(struct llc_sap *sap)
list_del_rcu(&sap->node);
spin_unlock_bh(&llc_sap_list_lock);
- synchronize_rcu();
-
- kfree(sap);
+ kfree_rcu(sap, rcu);
}
static struct packet_type llc_packet_type __read_mostly = {
--
2.14.4
^ permalink raw reply related
* Re: [PATCH v2 net-next 0/4] net: batched receive in GRO path
From: Edward Cree @ 2018-09-11 18:34 UTC (permalink / raw)
To: Eric Dumazet, davem; +Cc: linux-net-drivers, netdev
In-Reply-To: <01bc9bf5-1780-2650-958f-961bd24b8c26@gmail.com>
On 07/09/18 03:32, Eric Dumazet wrote:
> Adding this complexity and icache pressure needs more experimental results.
> What about RPC workloads (eg 100 concurrent netperf -t TCP_RR -- -r 8000,8000 )
>
> Thanks.
Some more results. Note that the TCP_STREAM figures given in the cover
letter were '-m 1450'; when I run that with '-m 8000' I hit line rate on
my 10G NIC on both the old and new code. Also, these tests are still all
with IRQs bound to a single core on the RX side.
A further note: the Code Under Test is running on the netserver side (RX
side for TCP_STREAM tests); the netperf side is running stock RHEL7u3
(kernel 3.10.0-514.el7.x86_64). This potentially matters more for the
TCP_RR test as both sides have to receive data.
TCP_STREAM, 8000 bytes, GRO enabled (4 streams)
old: 9.415 Gbit/s
new: 9.417 Gbit/s
(Welch p = 0.087, n₁ = n₂ = 3)
There was however a noticeable reduction in *TX* CPU usage, of about 15%.
I don't know why that should be (changes in ack timing, perhaps?)
TCP_STREAM, 8000 bytes, GRO disabled (4 streams)
old: 5.200 Gbit/s
new: 5.839 Gbit/s (12.3% faster)
(Welch p < 0.001, n₁ = n₂ = 6)
TCP_RR, 8000 bytes, GRO enabled (100 streams)
(FoM is one-way latency, 0.5 / tps)
old: 855.833 us
new: 862.033 us (0.7% slower)
(Welch p = 0.040, n₁ = n₂ = 6)
TCP_RR, 8000 bytes, GRO disabled (100 streams)
old: 962.733 us
new: 871.417 us (9.5% faster)
(Welch p < 0.001, n₁ = n₂ = 6)
Conclusion: with GRO on we pay a small but real RR penalty. With GRO off
(thus also with traffic that can't be coalesced) we get a noticeable
speed boost from being able to use netif_receive_skb_list_internal().
-Ed
^ permalink raw reply
* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Andrew Lunn @ 2018-09-11 23:30 UTC (permalink / raw)
To: Jason A. Donenfeld
Cc: Eric Biggers, Greg Kroah-Hartman, Ard Biesheuvel, LKML, Netdev,
David Miller, Andrew Lutomirski, Samuel Neves,
Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <CAHmME9rFUruF-VN1pmU-k5nFsb9ppAPhPpW-5Ho9dKL2HCg4kA@mail.gmail.com>
On Tue, Sep 11, 2018 at 04:02:52PM -0600, Jason A. Donenfeld wrote:
> On Tue, Sep 11, 2018 at 3:47 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > Of course, the real problem is that even after multiple revisions of this
> > patchset, there's still no actual conversions of the existing crypto API
> > algorithms over to use the new implementations. "Zinc" is still completely
> > separate from the existing crypto API.
>
> No this is not, "the real problem [...] after multiple revisions"
> because I've offered to do this and stated pretty clearly my intent to
> do so. But, as I've mentioned before, I'd really prefer to land this
> series through net-next
Hi Jason
Just as an FYI:
1) I don't think anybody in netdev has taken a serious look at the
network code yet. There is little point until the controversial part
of the code, Zinc, has been sorted out.
2) I personally would be surprised if DaveM took this code without
having an Acked-by from the crypto subsystem people. In the same way,
i doubt the crypto people would take an Ethernet driver without having
DaveM's Acked-by.
Andrew
^ permalink raw reply
* Re: [PATCH v2] neighbour: confirm neigh entries when ARP packet is received
From: Vasiliy Khoruzhick @ 2018-09-11 18:23 UTC (permalink / raw)
To: Stephen Hemminger
Cc: David S. Miller, Roopa Prabhu, Alexey Dobriyan, Eric Dumazet,
Jim Westfall, Wolfgang Bumiller, Vasily Khoruzhick, Kees Cook,
Ihar Hrachyshka, netdev, linux-kernel
In-Reply-To: <20180911111217.3e3679c5@xeon-e3>
On Tue, Sep 11, 2018 at 11:12 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 11 Sep 2018 11:04:06 -0700
> Vasily Khoruzhick <vasilykh@arista.com> wrote:
>
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index aa19d86937af..56a554597db5 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -1180,6 +1180,12 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
>> lladdr = neigh->ha;
>> }
>>
>> + /* Update confirmed timestamp for neighbour entry after we
>> + * received ARP packet even if it doesn't change IP to MAC binding.
>> + */
>> + if (new & NUD_CONNECTED)
>> + neigh->confirmed = jiffies;
>
> You might want to do:
> if ((new & NUD_CONNECTED) && neigh->confirmed != jiffies)
> neigh->confirmed = jiffies;
>
> This avoid poisoning the cacheline with unnecessary write.
Sorry for duplicate - this time in plain text, so it should get
through lkml filter:
I don't think that it's performance-critical path, so this
optimization is unnecessary
and it doesn't improve code readability.
^ permalink raw reply
* Re: tools/bpf regression causing samples/bpf/ to hang
From: Yonghong Song @ 2018-09-11 18:21 UTC (permalink / raw)
To: Björn Töpel
Cc: Netdev, ast, Daniel Borkmann, Jesper Dangaard Brouer
In-Reply-To: <CAJ+HfNhXJbhp3z+fL-3iXJD-5yFtEL0P1OmdsTkmZHUMviuAcQ@mail.gmail.com>
On 9/11/18 10:15 AM, Björn Töpel wrote:
> Den tis 11 sep. 2018 kl 18:47 skrev Yonghong Song <yhs@fb.com>:
>>
>>
>>
>> On 9/11/18 4:11 AM, Björn Töpel wrote:
>>> Hi Yonghong, I tried to run the XDP samples from the bpf-next tip
>>> today, and was hit by a regression.
>>>
>>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
>>> functions into a new file") adds a while(1) around the recv call in
>>> bpf_set_link_xdp_fd making that call getting stuck in an infinite
>>> loop.
>>>
>>> I simply removed the loop, and that solved my problem (patch below).
>>>
>>> However, I don't know if removing the loop would break bpftool for
>>> you. If not, I can submit the patch as a proper one for bpf-next.
>>
>> Hi, Björn, thanks for reporting the problem.
>> The while loop is needed since the "recv" syscall buffer size
>> may not be big enough to hold all the returned information, in
>> which cases, multiple "recv" calls are needed.
>>
>> Could you try the following patch to see whether it fixed your
>> issue? Thanks!
>>
>
> Nope, it doesn't -- but if you move that hunk after the for-loop it works.
Could you try this patch?
commit 9a7fb19899ce87594fe8012f8a23fc8fc7b6b764 (HEAD -> fix)
Author: Yonghong Song <yhs@fb.com>
Date: Tue Sep 11 08:58:20 2018 -0700
tools/bpf: fix a netlink recv issue
Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
functions into a new file") introduced a while loop for the
netlink recv path. This while loop is needed since the
buffer in recv syscall may not be enough to hold all the
information and in such cases multiple recv calls are needed.
There is a bug introduced by the above commit as
the while loop may block on recv syscall if there is no
more messages are expected. The netlink message header
flag NLM_F_MULTI is used to indicate that more messages
are expected and this patch fixed the bug by doing
further recv syscall only if multipart message is expected.
The patch added another fix regarding to message length of 0.
When netlink recv returns message length of 0, there will be
no more messages for returning data so the while loop
can end.
Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
functions into a new file")
Reported-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 469e068dd0c5..fde1d7bf8199 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -65,18 +65,23 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
int seq,
__dump_nlmsg_t _fn, dump_nlmsg_t fn,
void *cookie)
{
+ bool multipart = true;
struct nlmsgerr *err;
struct nlmsghdr *nh;
char buf[4096];
int len, ret;
- while (1) {
+ while (multipart) {
+ multipart = false;
len = recv(sock, buf, sizeof(buf), 0);
if (len < 0) {
ret = -errno;
goto done;
}
+ if (len == 0)
+ break;
+
for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
nh = NLMSG_NEXT(nh, len)) {
if (nh->nlmsg_pid != nl_pid) {
@@ -87,6 +92,8 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
int seq,
ret = -LIBBPF_ERRNO__INVSEQ;
goto done;
}
+ if (nh->nlmsg_flags & NLM_F_MULTI)
+ multipart = true;
switch (nh->nlmsg_type) {
case NLMSG_ERROR:
err = (struct nlmsgerr *)NLMSG_DATA(nh);
>
> Cheers,
> Björn
>
>> commit 3eb1c0249dfc3ea4ad61aa223dce32262af7e049 (HEAD -> fix)
>> Author: Yonghong Song <yhs@fb.com>
>> Date: Tue Sep 11 08:58:20 2018 -0700
>>
>> tools/bpf: fix a netlink recv issue
>>
>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
>> functions into a new file") introduced a while loop for the
>> netlink recv path. This while loop is needed since the
>> buffer in recv syscall may not be big enough to hold all the
>> information and in such cases multiple recv calls are needed.
>>
>> When netlink recv returns message length of 0, there will be
>> no more messages for returning data so the while loop
>> can end.
>>
>> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
>> functions into a new file")
>> Reported-by: Björn Töpel <bjorn.topel@intel.com>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>
>> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
>> index 469e068dd0c5..37827319a50a 100644
>> --- a/tools/lib/bpf/netlink.c
>> +++ b/tools/lib/bpf/netlink.c
>> @@ -77,6 +77,9 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
>> int seq,
>> goto done;
>> }
>>
>> + if (len == 0)
>> + break;
>> +
>> for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
>> nh = NLMSG_NEXT(nh, len)) {
>> if (nh->nlmsg_pid != nl_pid) {
>>
>>
>>>
>>> Thanks!
>>> Björn
>>>
>>> From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= <bjorn.topel@intel.com>
>>> Date: Tue, 11 Sep 2018 12:35:44 +0200
>>> Subject: [PATCH] tools/bpf: remove loop around netlink recv
>>>
>>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
>>> functions into a new file") moved the bpf_set_link_xdp_fd and split it
>>> up into multiple functions. The added receive function
>>> bpf_netlink_recv added a loop around the recv syscall leading to
>>> multiple recv calls. This resulted in all XDP samples in the
>>> samples/bpf/ to stop working, since they were stuck in a blocking
>>> recv.
>>>
>>> This commits removes the while (1)-statement.
>>>
>>> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
>>> functions into a new file")
>>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>>> ---
>>> tools/lib/bpf/netlink.c | 64 ++++++++++++++++++++---------------------
>>> 1 file changed, 31 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
>>> index 469e068dd0c5..0eae1fbf46c6 100644
>>> --- a/tools/lib/bpf/netlink.c
>>> +++ b/tools/lib/bpf/netlink.c
>>> @@ -70,41 +70,39 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
>>> char buf[4096];
>>> int len, ret;
>>>
>>> - while (1) {
>>> - len = recv(sock, buf, sizeof(buf), 0);
>>> - if (len < 0) {
>>> - ret = -errno;
>>> + len = recv(sock, buf, sizeof(buf), 0);
>>> + if (len < 0) {
>>> + ret = -errno;
>>> + goto done;
>>> + }
>>> +
>>> + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
>>> + nh = NLMSG_NEXT(nh, len)) {
>>> + if (nh->nlmsg_pid != nl_pid) {
>>> + ret = -LIBBPF_ERRNO__WRNGPID;
>>> goto done;
>>> }
>>> -
>>> - for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
>>> - nh = NLMSG_NEXT(nh, len)) {
>>> - if (nh->nlmsg_pid != nl_pid) {
>>> - ret = -LIBBPF_ERRNO__WRNGPID;
>>> - goto done;
>>> - }
>>> - if (nh->nlmsg_seq != seq) {
>>> - ret = -LIBBPF_ERRNO__INVSEQ;
>>> - goto done;
>>> - }
>>> - switch (nh->nlmsg_type) {
>>> - case NLMSG_ERROR:
>>> - err = (struct nlmsgerr *)NLMSG_DATA(nh);
>>> - if (!err->error)
>>> - continue;
>>> - ret = err->error;
>>> - nla_dump_errormsg(nh);
>>> - goto done;
>>> - case NLMSG_DONE:
>>> - return 0;
>>> - default:
>>> - break;
>>> - }
>>> - if (_fn) {
>>> - ret = _fn(nh, fn, cookie);
>>> - if (ret)
>>> - return ret;
>>> - }
>>> + if (nh->nlmsg_seq != seq) {
>>> + ret = -LIBBPF_ERRNO__INVSEQ;
>>> + goto done;
>>> + }
>>> + switch (nh->nlmsg_type) {
>>> + case NLMSG_ERROR:
>>> + err = (struct nlmsgerr *)NLMSG_DATA(nh);
>>> + if (!err->error)
>>> + continue;
>>> + ret = err->error;
>>> + nla_dump_errormsg(nh);
>>> + goto done;
>>> + case NLMSG_DONE:
>>> + return 0;
>>> + default:
>>> + break;
>>> + }
>>> + if (_fn) {
>>> + ret = _fn(nh, fn, cookie);
>>> + if (ret)
>>> + return ret;
>>> }
>>> }
>>> ret = 0;
>>>
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox