* [PATCH 0/2] ethernet: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-17 13:18 UTC (permalink / raw)
To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel
zhong jiang (2):
ethernet: declance: Use NULL to compare with pointer-typed value
rather than 0
ethernet: lpc_eth: Use NULL to compare with pointer-typed value
rather than 0
drivers/net/ethernet/amd/declance.c | 2 +-
drivers/net/ethernet/nxp/lpc_eth.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--
1.7.12.4
^ permalink raw reply
* [PATCH net-next] net: dsa: mv88e6xxx: Share main switch IRQ
From: Marek Behún @ 2018-08-17 10:09 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev, Marek Behún
In-Reply-To: <20180817093055.3107-1-marek.behun@nic.cz>
On some boards the interrupt can be shared between multiple devices.
For example on Turris Mox the interrupt is shared between all switches.
Signed-off-by: Marek Behun <marek.behun@nic.cz>
---
drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 8da3d39e3218..b57f5403982a 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -434,7 +434,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip *chip)
err = request_threaded_irq(chip->irq, NULL,
mv88e6xxx_g1_irq_thread_fn,
- IRQF_ONESHOT,
+ IRQF_ONESHOT | IRQF_SHARED,
dev_name(chip->dev), chip);
if (err)
mv88e6xxx_g1_irq_free_common(chip);
--
2.16.4
^ permalink raw reply related
* Re: virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)
From: Harald Hoyer @ 2018-08-17 9:56 UTC (permalink / raw)
To: Siwei Liu, Samudrala, Sridhar
Cc: Jiri Pirko, initramfs, Michael S. Tsirkin, Netdev,
vijay.balakrishna, si-wei liu, liran.alon
In-Reply-To: <85bf8fce-8cdf-597c-3b50-a1ff2eb5c5c4@redhat.com>
On 17.08.2018 11:51, Harald Hoyer wrote:
> On 16.08.2018 00:17, Siwei Liu wrote:
>> On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
>> <sridhar.samudrala@intel.com> wrote:
>>> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>>>
>>>> Are we sure all userspace apps skip and ignore slave interfaces by
>>>> just looking at "IFLA_MASTER" attribute?
>>>>
>>>> When STANDBY is enabled on virtio-net, a failover master interface
>>>> will appear, which automatically enslaves the virtio device. But it is
>>>> found out that iSCSI (or any network boot) cannot boot strap over the
>>>> new failover interface together with a standby virtio (without any VF
>>>> or PT device in place).
>>>>
>>>> Dracut (initramfs) ends up with timeout and dropping into emergency shell:
>>>>
>>>> [ 228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>>>> timeout - starting timeout scripts
>>>> [ 228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>>> Starting Dracut Emergency Shell...
>>>> Generating "/run/initramfs/rdsosreport.txt"
>>>> Entering emergency mode. Exit the shell to continue.
>>>> Type "journalctl" to view system logs.
>>>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or
>>>> /boot
>>>> after mounting them and attach it to a bug report.
>>>> dracut:/# ip l sh
>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>> mode DEFAULT group default qlen 1000
>>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>>>> state UP mode DEFAULT group default qlen 1000
>>>> link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>>>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>>>> master eth0 state UP mode DEFAULT group default qlen 1000
>>>> link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>>>> dracut:/#
>>>>
>>>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>>>> network boot starts to work.
>>>
>>>
>>> Does dracut by default tries to use all the interfaces that are UP?
>>>
>> Yes. The specific dracut cmdline of our case is "ip=dhcp
>> netroot=iscsi:... ", but it's not specific to iscsi boot. And because
>> of same MAC address for failover and standby, while dracut tries to
>> run DHCP on all interfaces that are up it eventually gets same route
>> for each interface. Those conflict route entries kill off the network
>> connection.
>>
>>>
>>>>
>>>> The reason is that dracut has its own means to differentiate virtual
>>>> interfaces for network boot: it does not look at IFLA_MASTER and
>>>> ignores slave interfaces. Instead, users have to provide explicit
>>>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>>>> the config and ignore the slave interfaces.
>>>
>>>
>>> Isn't it possible to specify the interface that should be used for network
>>> boot?
>> As I understand it, one can only specify interface name for running
>> DHCP but not select interface for network boot. We want DHCP to run
>> on every NIC that is up (excluding the enslaved interfaces), and only
>> one of them can get a route entry to the network boot server (ie.g.
>> iSCSI target).
>>
>>>
>>>
>>>>
>>>> However, with automatic creation of failover interface that assumption
>>>> is no longer true. Can we change dracut to ignore all slave interface
>>>> by checking IFLA_MASTER? I don't think so. It has a large impact to
>>>> existing configs.
>>>
>>>
>>> What is the issue with checking for IFLA_MASTER? I guess this is used with
>>> team/bonding setups.
>> That should be discussed within and determined by the dracut
>> community. But the current dracut code doesn't check IFLA_MASTER for
>> team or bonding specifically. I guess this change might have broader
>> impact to existing userspace that might be already relying on the
>> current behaviour.
>>
>> Thanks,
>> -Siwei
>
> Is there a sysfs flag for IFF_SLAVE? Or any "ip" output I can use to detect, that it is a IFF_SLAVE?
>
Oh, it's the other way around.. dracut should ignore "master" (eth1).
Can the master enslave the "eth0", if it is already "UP" and busy later on?
^ permalink raw reply
* Re: virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)
From: Harald Hoyer @ 2018-08-17 9:51 UTC (permalink / raw)
To: Siwei Liu, Samudrala, Sridhar
Cc: Jiri Pirko, initramfs, Michael S. Tsirkin, Netdev,
vijay.balakrishna, si-wei liu, liran.alon
In-Reply-To: <CADGSJ206hwO_3RbB8C5875updN1e=LZqw1kwNJgne_-c4=9GoQ@mail.gmail.com>
On 16.08.2018 00:17, Siwei Liu wrote:
> On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>>
>>> Are we sure all userspace apps skip and ignore slave interfaces by
>>> just looking at "IFLA_MASTER" attribute?
>>>
>>> When STANDBY is enabled on virtio-net, a failover master interface
>>> will appear, which automatically enslaves the virtio device. But it is
>>> found out that iSCSI (or any network boot) cannot boot strap over the
>>> new failover interface together with a standby virtio (without any VF
>>> or PT device in place).
>>>
>>> Dracut (initramfs) ends up with timeout and dropping into emergency shell:
>>>
>>> [ 228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>>> timeout - starting timeout scripts
>>> [ 228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>> Starting Dracut Emergency Shell...
>>> Generating "/run/initramfs/rdsosreport.txt"
>>> Entering emergency mode. Exit the shell to continue.
>>> Type "journalctl" to view system logs.
>>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or
>>> /boot
>>> after mounting them and attach it to a bug report.
>>> dracut:/# ip l sh
>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>> mode DEFAULT group default qlen 1000
>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>>> state UP mode DEFAULT group default qlen 1000
>>> link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>>> master eth0 state UP mode DEFAULT group default qlen 1000
>>> link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>>> dracut:/#
>>>
>>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>>> network boot starts to work.
>>
>>
>> Does dracut by default tries to use all the interfaces that are UP?
>>
> Yes. The specific dracut cmdline of our case is "ip=dhcp
> netroot=iscsi:... ", but it's not specific to iscsi boot. And because
> of same MAC address for failover and standby, while dracut tries to
> run DHCP on all interfaces that are up it eventually gets same route
> for each interface. Those conflict route entries kill off the network
> connection.
>
>>
>>>
>>> The reason is that dracut has its own means to differentiate virtual
>>> interfaces for network boot: it does not look at IFLA_MASTER and
>>> ignores slave interfaces. Instead, users have to provide explicit
>>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>>> the config and ignore the slave interfaces.
>>
>>
>> Isn't it possible to specify the interface that should be used for network
>> boot?
> As I understand it, one can only specify interface name for running
> DHCP but not select interface for network boot. We want DHCP to run
> on every NIC that is up (excluding the enslaved interfaces), and only
> one of them can get a route entry to the network boot server (ie.g.
> iSCSI target).
>
>>
>>
>>>
>>> However, with automatic creation of failover interface that assumption
>>> is no longer true. Can we change dracut to ignore all slave interface
>>> by checking IFLA_MASTER? I don't think so. It has a large impact to
>>> existing configs.
>>
>>
>> What is the issue with checking for IFLA_MASTER? I guess this is used with
>> team/bonding setups.
> That should be discussed within and determined by the dracut
> community. But the current dracut code doesn't check IFLA_MASTER for
> team or bonding specifically. I guess this change might have broader
> impact to existing userspace that might be already relying on the
> current behaviour.
>
> Thanks,
> -Siwei
Is there a sysfs flag for IFF_SLAVE? Or any "ip" output I can use to detect, that it is a IFF_SLAVE?
^ permalink raw reply
* mv88e6xxx: question: can switch irq be shared?
From: Marek Behún @ 2018-08-17 9:30 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev, Marek Behún
Hello, I am wondering if the main device irq in
dsa/mv88e6xxx/chip.c can be requested as shared (see patch below).
The reason is that our board is wired so that irqs from all switches
come to the same gpio.
Marek
---
drivers/net/dsa/mv88e6xxx/chip.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index d90656e415b0..1caaa09e391e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -424,7 +424,8 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip *chip)
err = request_threaded_irq(chip->irq, NULL,
mv88e6xxx_g1_irq_thread_fn,
- IRQF_ONESHOT | IRQF_TRIGGER_FALLING,
+ IRQF_ONESHOT | IRQF_TRIGGER_FALLING
+ | IRQF_SHARED,
dev_name(chip->dev), chip);
if (err)
mv88e6xxx_g1_irq_free_common(chip);
^ permalink raw reply related
* [regression] r8169 without realtek_phy
From: Jouni Mettälä @ 2018-08-17 8:33 UTC (permalink / raw)
To: netdev
There is network regression for me. 4.18 was good. 4.18+ is bad. There
was some phy changes in r8169 driver. Fortunately adding
CONFIG_REALTEK_PHY=m to kernel config fixed the regression.
Should r8169 depend on realtek_phy? Does that breaks something else?
Network doesn't work with Generic PHY (output of dmesg)
Generic PHY r8169-300:00: attached PHY driver [Generic PHY]
(mii_bus:phy_addr=r8169-300:00, irq=IGNORE)
When realtek_phy is compiled, r8169 automatically uses it.
RTL8211B Gigabit Ethernet r8169-300:00: attached PHY driver [RTL8211B
Gigabit Ethernet] (mii_bus:phy_addr=r8169-300:00, irq=IGNORE)
Here is Ethernet controller's lspci for reference:
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
(rev 01)
Subsystem: ABIT Computer Corp. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller [147b:1078]
Flags: bus master, fast devsel, latency 0, IRQ 27
I/O ports at ce00 [size=256]
Memory at fddff000 (64-bit, non-prefetchable) [size=4K]
[virtual] Expansion ROM at fdd00000 [disabled] [size=64K]
Capabilities: [40] Power Management version 2
Capabilities: [48] Vital Product Data
Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [84] Vendor Specific Information: Len=4c <?>
Capabilities: [100] Advanced Error Reporting
Capabilities: [12c] Virtual Channel
Capabilities: [148] Device Serial Number 28-00-00-00-00-00-00-
00
Capabilities: [154] Power Budgeting <?>
Kernel driver in use: r8169
Kernel modules: r8169
^ permalink raw reply
* [PATCH 1/1] tap: RCU usage and comment fixes
From: Wang Jian @ 2018-08-17 8:22 UTC (permalink / raw)
To: davem, jasowang, mst, willemb, viro, wexu, netdev; +Cc: Wang Jian
The tap_queue and the 'tap_dev' are loosely coupled, not 'macvlan_dev'.
Taking rcu_read_lock a little later seems can slightly reduce rcu read critical section.
Signed-off-by: Wang Jian <jianjian.wang1@gmail.com>
---
drivers/net/tap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index f0f7cd977667..e5e5a8e4a60d 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -125,7 +125,7 @@ static struct tap_dev *tap_dev_get_rcu(const struct net_device *dev)
/*
* RCU usage:
- * The tap_queue and the macvlan_dev are loosely coupled, the
+ * The tap_queue and the tap_dev are loosely coupled, the
* pointers from one to the other can only be read while rcu_read_lock
* or rtnl is held.
*
@@ -720,8 +720,6 @@ static ssize_t tap_get_user(struct tap_queue *q, struct msghdr *m,
__vlan_get_protocol(skb, skb->protocol, &depth) != 0)
skb_set_network_header(skb, depth);
- rcu_read_lock();
- tap = rcu_dereference(q->tap);
/* copy skb_ubuf_info for callback when skb has no error */
if (zerocopy) {
skb_shinfo(skb)->destructor_arg = m->msg_control;
@@ -732,6 +730,8 @@ static ssize_t tap_get_user(struct tap_queue *q, struct msghdr *m,
uarg->callback(uarg, false);
}
+ rcu_read_lock();
+ tap = rcu_dereference(q->tap);
if (tap) {
skb->dev = tap->dev;
dev_queue_xmit(skb);
--
2.17.1
^ permalink raw reply related
* [PATCH] datapath.c: fix missing return value check of nla_nest_start()
From: Jiecheng Wu @ 2018-08-17 8:15 UTC (permalink / raw)
To: netdev
Function queue_userspace_packet() defined in net/openvswitch/datapath.c calls nla_nest_start() to allocate memory for struct nlattr which is dereferenced immediately. As nla_nest_start() may return NULL on failure, this code piece may cause NULL pointer dereference bug.
---
net/openvswitch/datapath.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 0f5ce77..ff4457d 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -460,6 +460,8 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
if (upcall_info->egress_tun_info) {
nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_EGRESS_TUN_KEY);
+ if (!nla)
+ return -EMSGSIZE;
err = ovs_nla_put_tunnel_info(user_skb,
upcall_info->egress_tun_info);
BUG_ON(err);
@@ -468,6 +470,8 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
if (upcall_info->actions_len) {
nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_ACTIONS);
+ if (!nla)
+ return -EMSGSIZE;
err = ovs_nla_put_actions(upcall_info->actions,
upcall_info->actions_len,
user_skb);
--
2.6.4
^ permalink raw reply related
* Dear intended recipient
From: Malik Sanfo @ 2018-08-17 9:24 UTC (permalink / raw)
--
Good-Day!,
Can you assist me to handle this transaction? I will forward you the
full details about the transaction if you are ready.
Yours faithfully
Mr. Malik Sanfo
^ permalink raw reply
* Re: [v3, net-next, 02/12] net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit
From: Jerome Brunet @ 2018-08-17 7:32 UTC (permalink / raw)
To: Jose Abreu, netdev, open list:ARM/Amlogic Meson..., Kevin Hilman
Cc: David S. Miller, Joao Pinto, Vitor Soares, Giuseppe Cavallaro,
Alexandre Torgue, Bartosz Gołaszewski
In-Reply-To: <9d0be5db11478d00a9194065abcf137b4d537c0a.1526651009.git.joabreu@synopsys.com>
On Fri, 2018-05-18 at 14:55 +0100, Jose Abreu wrote:
> This is cutting down performance. Once the timer is armed it should run
> after the time expires for the first packet sent and not the last one.
>
> After this change, running iperf, the performance gain is +/- 24%.
Hi Guys,
Since v4.18, we are getting a serious regression on Amlogic based SoCs.
I have tested this on amlogic's:
* gxbb S905 p200 (Micrel KSZ9031 - 1GBps)
* axg A113 s400 (Realtek RTL8211F - 1GBps)
Both SoCs use the synopsys gmac with stmmac driver.
I first noticed that running NFS root filesystem became unstable but I could not
understand why. Then, running a download as simple test with iperf3 (from an
initramfs) will break the 'network' in matter of seconds.
I don't know exactly what breaks but bisect clearly assign the blame to this
change. Reverting the change solve this problem.
I'll be happy to make more tests to help understand what is happening here.
In the meantime, should we consider reverting this patch ?
Best Regards
Jerome
>
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Vitor Soares <soares@synopsys.com>
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> ---
> drivers/net/ethernet/stmicro/stmmac/stmmac.h | 1 +
> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 ++++-
> 2 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> index 42fc76e..4d425b1 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> @@ -105,6 +105,7 @@ struct stmmac_priv {
> u32 tx_count_frames;
> u32 tx_coal_frames;
> u32 tx_coal_timer;
> + bool tx_timer_armed;
>
> int tx_coalesce;
> int hwts_tx_en;
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index d9dbe13..789bc22 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -3158,13 +3158,16 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> * element in case of no SG.
> */
> priv->tx_count_frames += nfrags + 1;
> - if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
> + if (likely(priv->tx_coal_frames > priv->tx_count_frames) &&
> + !priv->tx_timer_armed) {
> mod_timer(&priv->txtimer,
> STMMAC_COAL_TIMER(priv->tx_coal_timer));
> + priv->tx_timer_armed = true;
> } else {
> priv->tx_count_frames = 0;
> stmmac_set_tx_ic(priv, desc);
> priv->xstats.tx_set_ic_bit++;
> + priv->tx_timer_armed = false;
> }
>
> skb_tx_timestamp(skb);
^ permalink raw reply
* [PATCH iproute2-next] iproute_lwtunnel: allow specifying 'src' for 'encap ip' / 'encap ip6'
From: Shmulik Ladkani @ 2018-08-17 7:31 UTC (permalink / raw)
To: stephen, dsahern; +Cc: netdev, shmulik.ladkani, Shmulik Ladkani
This allows the user to specify the LWTUNNEL_IP_SRC/LWTUNNEL_IP6_SRC
when setting an lwtunnel encapsulation route.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---
ip/iproute_lwtunnel.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index 740da7c6..20d5545c 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -671,7 +671,7 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
static int parse_encap_ip(struct rtattr *rta, size_t len,
int *argcp, char ***argvp)
{
- int id_ok = 0, dst_ok = 0, tos_ok = 0, ttl_ok = 0;
+ int id_ok = 0, dst_ok = 0, src_ok = 0, tos_ok = 0, ttl_ok = 0;
char **argv = *argvp;
int argc = *argcp;
@@ -694,6 +694,15 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
get_addr(&addr, *argv, AF_INET);
rta_addattr_l(rta, len, LWTUNNEL_IP_DST,
&addr.data, addr.bytelen);
+ } else if (strcmp(*argv, "src") == 0) {
+ inet_prefix addr;
+
+ NEXT_ARG();
+ if (src_ok++)
+ duparg2("src", *argv);
+ get_addr(&addr, *argv, AF_INET);
+ rta_addattr_l(rta, len, LWTUNNEL_IP_SRC,
+ &addr.data, addr.bytelen);
} else if (strcmp(*argv, "tos") == 0) {
__u32 tos;
@@ -805,7 +814,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
static int parse_encap_ip6(struct rtattr *rta, size_t len,
int *argcp, char ***argvp)
{
- int id_ok = 0, dst_ok = 0, tos_ok = 0, ttl_ok = 0;
+ int id_ok = 0, dst_ok = 0, src_ok = 0, tos_ok = 0, ttl_ok = 0;
char **argv = *argvp;
int argc = *argcp;
@@ -828,6 +837,15 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
get_addr(&addr, *argv, AF_INET6);
rta_addattr_l(rta, len, LWTUNNEL_IP6_DST,
&addr.data, addr.bytelen);
+ } else if (strcmp(*argv, "src") == 0) {
+ inet_prefix addr;
+
+ NEXT_ARG();
+ if (src_ok++)
+ duparg2("src", *argv);
+ get_addr(&addr, *argv, AF_INET6);
+ rta_addattr_l(rta, len, LWTUNNEL_IP6_SRC,
+ &addr.data, addr.bytelen);
} else if (strcmp(*argv, "tc") == 0) {
__u32 tc;
--
2.18.0
^ permalink raw reply related
* Re: [PATCH 1/1] tap: comment fix
From: Wang Jian @ 2018-08-17 8:24 UTC (permalink / raw)
To: Jason
Cc: David S . Miller, girish.moodalbail, mst, Willem de Bruijn, viro,
wexu, netdev, linux-kernel
In-Reply-To: <d3c380fc-cebd-0a91-1acc-916a453c9f11@redhat.com>
Thanks for the reminder.
Because this change is trivial, I change the subject.
On Fri, Aug 17, 2018 at 12:29 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018年08月17日 03:30, David Miller wrote:
> > From: Wang Jian <jianjian.wang1@gmail.com>
> > Date: Thu, 16 Aug 2018 21:01:27 +0800
> >
> >> The tap_queue and the "tap_dev" are loosely coupled, not "macvlan_dev".
> >>
> >> And I also change one rcu_read_lock's place, seems can reduce rcu
> >> critical section a little.
> >>
> >> Signed-off-by: Wang Jian <jianjian.wang1@gmail.com>
> > This patch was corrupted by your email client, for example it turned
> > TAB characters into sequences of spaces.
> >
> > Please fix this, email a test patch to yourself, and do not resend the
> > patch to this mailing list until you can successfully extract and
> > cleanly apply the test patch you email to yourself.
> >
> > Thank you.
>
> Besides this, please split it into two patches. The RCU change does not
> belong to "comment fix" for sure.
>
> Thanks
>
--
Regards,
Wang Jian
^ permalink raw reply
* [PATCH net-next] esp: remove redundant define esph
From: Haishuang Yan @ 2018-08-17 7:51 UTC (permalink / raw)
To: Steffen Klassert, Herbert Xu, David S. Miller
Cc: netdev, linux-kernel, Haishuang Yan
The pointer 'esph' is defined but is never used hence it is redundant
and canbe removed.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
---
net/ipv4/esp4.c | 7 +++----
net/ipv6/esp6.c | 7 +++----
2 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 9768901..211caaf 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -683,12 +683,11 @@ static void esp_input_done_esn(struct crypto_async_request *base, int err)
*/
static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
{
- struct ip_esp_hdr *esph;
struct crypto_aead *aead = x->data;
struct aead_request *req;
struct sk_buff *trailer;
int ivlen = crypto_aead_ivsize(aead);
- int elen = skb->len - sizeof(*esph) - ivlen;
+ int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen;
int nfrags;
int assoclen;
int seqhilen;
@@ -698,13 +697,13 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
struct scatterlist *sg;
int err = -EINVAL;
- if (!pskb_may_pull(skb, sizeof(*esph) + ivlen))
+ if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen))
goto out;
if (elen <= 0)
goto out;
- assoclen = sizeof(*esph);
+ assoclen = sizeof(struct ip_esp_hdr);
seqhilen = 0;
if (x->props.flags & XFRM_STATE_ESN) {
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 88a7579..63b2b66 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -601,12 +601,11 @@ static void esp_input_done_esn(struct crypto_async_request *base, int err)
static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
{
- struct ip_esp_hdr *esph;
struct crypto_aead *aead = x->data;
struct aead_request *req;
struct sk_buff *trailer;
int ivlen = crypto_aead_ivsize(aead);
- int elen = skb->len - sizeof(*esph) - ivlen;
+ int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen;
int nfrags;
int assoclen;
int seqhilen;
@@ -616,7 +615,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
u8 *iv;
struct scatterlist *sg;
- if (!pskb_may_pull(skb, sizeof(*esph) + ivlen)) {
+ if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen)) {
ret = -EINVAL;
goto out;
}
@@ -626,7 +625,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
goto out;
}
- assoclen = sizeof(*esph);
+ assoclen = sizeof(struct ip_esp_hdr);
seqhilen = 0;
if (x->props.flags & XFRM_STATE_ESN) {
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
From: D. J. Bernstein @ 2018-08-17 7:31 UTC (permalink / raw)
To: Eric Biggers, Jason A. Donenfeld, Eric Biggers,
Linux Crypto Mailing List, LKML, Netdev, David Miller,
Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
Jean-Philippe Aumasson, Karthikeyan Bhargavan
In-Reply-To: <20180816194620.GA185651@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5175 bytes --]
Eric Biggers writes:
> If (more likely) you're talking about things like "use this NEON implementation
> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
> developers and community to test different CPUs and make appropriate decisions,
> and yes it can be very useful to have external benchmarks like SUPERCOP to refer
> to, and I appreciate your work in that area.
You seem to be talking about a process that selects (e.g.) ChaCha20
implementations as follows: manually inspect benchmarks of various
implementations on various CPUs, manually write code to map CPUs to
implementations, manually update the code as necessary for new CPUs, and
of course manually do the same for every other primitive that can see
differences between microarchitectures (which isn't something weird---
it's the normal situation after enough optimization effort).
This is quite a bit of manual work, so the kernel often doesn't do it,
so we end up with unhappy people talking about performance regressions.
For comparison, imagine one simple central piece of code in the kernel
to automatically do the following:
When a CPU core is booted:
For each primitive:
Benchmark all implementations of the primitive on the core.
Select the fastest for subsequent use on the core.
If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and
libpqcrypto) rather than something ad-hoc (as in raid6), then there's no
manual work per primitive, and no work per implementation. Each CPU, old
or new, automatically obtains the fastest available code for that CPU.
The only cost is a moment of benchmarking at boot time. _If_ this is a
noticeable cost then there are many ways to speed it up: for example,
automatically copy the results across identical cores, automatically
copy the results across boots if the cores are unchanged, automatically
copy results from a central database indexed by CPU identifiers, etc.
The SUPERCOP database is evolving towards enabling this type of sharing.
> A lot of code can be shared, but in practice different environments have
> different constraints, and kernel programming in particular has some distinct
> differences from userspace programming. For example, you cannot just use the
> FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
> most architectures they can't be used in some contexts such as hardirq context,
> and even when they *can* be used you have to run special code before and after
> which does things like saving all the FPU registers to the task_struct,
> disabling preemption, and/or enabling the FPU.
Is there some reason that each implementor is being pestered to handle
all this? Detecting FPU usage is a simple static-analysis exercise, and
the rest sounds like straightforward boilerplate that should be handled
centrally.
> But disabling preemption for
> long periods of time hurts responsiveness, so it's also desirable to yield the
> processor occasionally, which means that assembly implementations should be
> incremental rather than having a single entry point that does everything.
Doing this rewrite automatically is a bit more of a code-analysis
challenge, but the alternative approach of doing it by hand is insanely
error-prone. See, e.g., https://eprint.iacr.org/2017/891.
> Many people may have contributed to SUPERCOP already, but that doesn't mean
> there aren't things you could do to make it more appealing to contributors and
> more of a community project,
The logic in this sentence is impeccable, and is already illustrated by
many SUPERCOP improvements through the years from an increasing number
of contributors, as summarized in the 87 release announcements so far on
the relevant public mailing list, which you're welcome to study in
detail along with the 400 megabytes of current code and as many previous
versions as you're interested in. That's also the mailing list where
people are told to send patches, as you'll see if you RTFM.
> So Linux distributions may not want to take on the legal risk of
> distributing it
This is a puzzling comment. A moment ago we were talking about the
possibility of useful sharing of (e.g.) ChaCha20 implementations between
SUPERCOP and the Linux kernel, avoiding pointless fracturing of the
community's development process for these implementations. This doesn't
mean that the kernel should be grabbing implementations willy-nilly from
SUPERCOP---surely the kernel should be doing security audits, and the
kernel already has various coding requirements, and the kernel requires
GPL compatibility, while putting any of these requirements into SUPERCOP
would be counterproductive.
If you mean having the entire SUPERCOP benchmarking package distributed
through Linux distributions, I have no idea what your motivation is or
how this is supposed to be connected to anything else we're discussing.
Obviously SUPERCOP's broad code-inclusion policies make this idea a
non-starter.
> nor may companies want to take on the risk of contributing.
RTFM. People who submit code are authorizing public redistribution for
benchmarking. It's up to them to decide if they want to allow more.
---Dan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
^ permalink raw reply
* Re: [PATCH 1/1] tap: comment fix
From: Jason Wang @ 2018-08-17 4:28 UTC (permalink / raw)
To: David Miller, jianjian.wang1
Cc: girish.moodalbail, mst, willemb, viro, wexu, netdev, linux-kernel
In-Reply-To: <20180816.123002.844201503080913720.davem@davemloft.net>
On 2018年08月17日 03:30, David Miller wrote:
> From: Wang Jian <jianjian.wang1@gmail.com>
> Date: Thu, 16 Aug 2018 21:01:27 +0800
>
>> The tap_queue and the "tap_dev" are loosely coupled, not "macvlan_dev".
>>
>> And I also change one rcu_read_lock's place, seems can reduce rcu
>> critical section a little.
>>
>> Signed-off-by: Wang Jian <jianjian.wang1@gmail.com>
> This patch was corrupted by your email client, for example it turned
> TAB characters into sequences of spaces.
>
> Please fix this, email a test patch to yourself, and do not resend the
> patch to this mailing list until you can successfully extract and
> cleanly apply the test patch you email to yourself.
>
> Thank you.
Besides this, please split it into two patches. The RCU change does not
belong to "comment fix" for sure.
Thanks
^ permalink raw reply
* Re: bnxt: card intermittently hanging and dropping link
From: Michael Chan @ 2018-08-17 2:37 UTC (permalink / raw)
To: Daniel Axtens, Ashwin Thiagarajan, Carl Tung; +Cc: Netdev, jay.vosburgh
In-Reply-To: <87lg9665l4.fsf@linkitivity.dja.id.au>
On Thu, Aug 16, 2018 at 2:09 AM, Daniel Axtens <dja@axtens.net> wrote:
> Hi Michael,
>
>> The main issue is the TX timeout.
>> .....
>>
>>> [ 2682.911693] bnxt_en 0000:3b:00.0 eth4: TX timeout detected, starting reset task!
>>> [ 2683.782496] bnxt_en 0000:3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>>> [ 2683.783061] bnxt_en 0000:3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>>> [ 2684.634557] bnxt_en 0000:3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>>> [ 2684.635120] bnxt_en 0000:3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>>
>> and it is not recovering.
>>
>> Please provide ethtool -i eth4 which will show the firmware version on
>> the NIC. Let's see if the firmware is too old.
>
> driver: bnxt_en
> version: 1.8.0
> firmware-version: 20.6.151.0/pkg 20.06.05.11
I believe the firmware should be updated. My colleague will contact
you on how to proceed.
Thanks.
^ permalink raw reply
* KINDLY REPLY diplmosesd@gmail.com URGENTLY
From: MR MOSES @ 2018-08-17 2:16 UTC (permalink / raw)
To: Recipients
KINDLY REPLY diplmosesd@gmail.com URGENTLY
^ permalink raw reply
* [PATCH v2 net] r8169: don't use MSI-X on RTL8106e
From: Jian-Hong Pan @ 2018-08-17 5:07 UTC (permalink / raw)
To: Heiner Kallweit, David Miller, nic_swsd, netdev, linux-kernel,
linux
Cc: Jian-Hong Pan
In-Reply-To: <20180815062110.16155-1-jian-hong@endlessm.com>
Found the ethernet network on ASUS X441UAR doesn't come back on resume
from suspend when using MSI-X. The chip is RTL8106e - version 39.
[ 21.848357] libphy: r8169: probed
[ 21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
44900000, IRQ 127
[ 22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0
[ 29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[ 63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
flow control off
[ 124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
Here is the ethernet controller in detail:
02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
(rev 07)
Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
Ethernet controller [1043:200f]
Flags: bus master, fast devsel, latency 0, IRQ 16
I/O ports at e000 [size=256]
Memory at ef100000 (64-bit, non-prefetchable) [size=4K]
Memory at e0000000 (64-bit, prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: r8169
Kernel modules: r8169
Falling back to MSI fixes the issue.
Fixes: 6c6aa15fdea5 ("r8169: improve interrupt handling")
Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
---
Changes in v2:
- Make the commit message shorter
- Add "Fixes" tag in the commit message
drivers/net/ethernet/realtek/r8169.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 0d9c3831838f..0efa977c422d 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7071,17 +7071,20 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
{
unsigned int flags;
- if (tp->mac_version <= RTL_GIGA_MAC_VER_06) {
+ switch (tp->mac_version) {
+ case RTL_GIGA_MAC_VER_01 ... RTL_GIGA_MAC_VER_06:
RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
RTL_W8(tp, Cfg9346, Cfg9346_Lock);
flags = PCI_IRQ_LEGACY;
- } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
+ break;
+ case RTL_GIGA_MAC_VER_39 ... RTL_GIGA_MAC_VER_40:
/* This version was reported to have issues with resume
* from suspend when using MSI-X
*/
flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
- } else {
+ break;
+ default:
flags = PCI_IRQ_ALL_TYPES;
}
--
2.11.0
^ permalink raw reply related
* Re: [PATCH] r8169: don't use MSI-X on RTL8106e
From: Jian-Hong Pan @ 2018-08-17 2:03 UTC (permalink / raw)
To: Heiner Kallweit; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <ff5ea624-0bd5-ef9e-9c02-deb4c1de601b@gmail.com>
2018-08-17 2:59 GMT+08:00 Heiner Kallweit <hkallweit1@gmail.com>:
>> From: Jian-Hong Pan <jian-hong@endlessm.com>
>>
>> Found the ethernet network on ASUS X441UAR doesn't come back on resume
>> from suspend when using MSI-X. The chip is RTL8106e - version 39.
>>
> The patch itself looks good, just the commit message is wrong in one
> place and a little bit long.
>
>> asus@endless:~$ dmesg | grep r8169
>> [ 21.848357] libphy: r8169: probed
>> [ 21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
>> 44900000, IRQ 127
>> [ 22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0
>> [ 29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [ 63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>> [ 124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>>
>> Here is the ethernet controller in detail:
>>
>> asus@endless:~$ sudo lspci -nnvs 02:00.0
>> [sudo] password for asus:
>> 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>> RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
>> (rev 07)
>> Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
>> Ethernet controller [1043:200f]
>> Flags: bus master, fast devsel, latency 0, IRQ 16
>> I/O ports at e000 [size=256]
>> Memory at ef100000 (64-bit, non-prefetchable) [size=4K]
>> Memory at e0000000 (64-bit, prefetchable) [size=16K]
>> Capabilities: [40] Power Management version 3
>> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>> Capabilities: [70] Express Endpoint, MSI 01
>> Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>> Capabilities: [d0] Vital Product Data
>> Capabilities: [100] Advanced Error Reporting
>> Capabilities: [140] Virtual Channel
>> Capabilities: [160] Device Serial Number 01-00-00-00-36-4c-e0-00
>> Capabilities: [170] Latency Tolerance Reporting
>> Kernel driver in use: r8169
>> Kernel modules: r8169
>>
>> Here is the system interrupt table:
>>
>> asus@endless:~$ cat /proc/interrupts
>> CPU0 CPU1 CPU2 CPU3
>> 0: 22 0 0 0 IO-APIC 2-edge
>> timer
>> 1: 157 42 0 0 IO-APIC 1-edge
>> i8042
>> 8: 0 0 1 0 IO-APIC 8-edge
>> rtc0
>> 9: 10 13 0 0 IO-APIC 9-fasteoi
>> acpi
>> 16: 0 0 0 0 IO-APIC 16-fasteoi
>> i2c_designware.0, i801_smbus
>> 17: 2445 0 3453 0 IO-APIC 17-fasteoi
>> i2c_designware.1, rtl_pci
>> 109: 2 0 0 1 IO-APIC 109-fasteoi
>> FTE1200:00
>> 120: 0 0 0 0 PCI-MSI 458752-edge
>> PCIe PME
>> 121: 0 0 0 0 PCI-MSI 466944-edge
>> PCIe PME
>> 122: 0 0 0 0 PCI-MSI 468992-edge
>> PCIe PME
>> 123: 1465 0 0 21263 PCI-MSI 376832-edge
>> ahci[0000:00:17.0]
>> 124: 0 530 0 0 PCI-MSI 327680-edge
>> xhci_hcd
>> 125: 5204 0 0 0 PCI-MSI 32768-edge
>> i915
>> 126: 0 0 149 0 PCI-MSI 514048-edge
>> snd_hda_intel:card0
>> 127: 0 0 337 0 PCI-MSI 1048576-edge
>> enp2s0
>> NMI: 0 0 0 0 Non-maskable
>> interrupts
>> LOC: 45049 39474 38978 46677 Local timer
>> interrupts
>> SPU: 0 0 0 0 Spurious interrupts
>> PMI: 0 0 0 0 Performance
>> monitoring interrupts
>> IWI: 619 8 0 1 IRQ work interrupts
>> RTR: 6 0 0 0 APIC ICR read
>> retries
>> RES: 4918 4436 3835 2943 Rescheduling
>> interrupts
>> CAL: 1399 1478 1598 1465 Function call
>> interrupts
>> TLB: 608 513 723 559 TLB shootdowns
>> TRM: 0 0 0 0 Thermal event
>> interrupts
>> THR: 0 0 0 0 Threshold APIC
>> interrupts
>> DFR: 0 0 0 0 Deferred Error APIC
>> interrupts
>> MCE: 0 0 0 0 Machine check
>> exceptions
>> MCP: 3 4 4 4 Machine check polls
>> ERR: 0
>> MIS: 0
>> PIN: 0 0 0 0 Posted-interrupt
>> notification event
>> NPI: 0 0 0 0 Nested
>> posted-interrupt event
>> PIW: 0 0 0 0 Posted-interrupt
>> wakeup event
>>
>> It is the IRQ 127 - PCI-MSI used by enp2s0. However, lspci lists MSI is
>> disabled and MSI-X is enabled which conflicts to the interrupt table.
>>
> Both types of interrupts, MSI and MSI-X, are listed with irq chip name
> "PCI-MSI", because MSI-X is treated as a sub-feature of MSI.
> Therefore the output of /proc/interrupts doesn't allow to tell whether
> a MSI or MSI-X interrupt is used, and as a consequence there is no such
> conflict.
> Indeed only lspci provides the information whether MSI or MSI-X is used.
Oh! Thanks for your information. I learned and noted.
I am preparing the version 2 patch with modified commit message.
>> Falling back to MSI fixes the issue.
>>
>> Here is the test result with this patch in dmesg:
>>
>> asus@endless:~$ dmesg | grep r8169
>> [ 22.017477] libphy: r8169: probed
>> [ 22.017735] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
>> 44900000, IRQ 127
>> [ 22.041489] r8169 0000:02:00.0 enp2s0: renamed from eth0
>> [ 29.138312] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [ 30.927359] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>> [ 289.998077] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>> [ 290.508084] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [ 290.745690] r8169 0000:02:00.0 enp2s0: Link is Down
>> [ 292.367717] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>>
>> lspci lists MSI is enabled and MSI-X is disabled with this patch:
>>
>> asus@endless:~/linux-net$ sudo lspci -nnvs 02:00.0
>> [sudo] password for asus:
>> 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>> RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
>> (rev 07)
>> Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
>> Ethernet controller [1043:200f]
>> Flags: bus master, fast devsel, latency 0, IRQ 127
>> I/O ports at e000 [size=256]
>> Memory at ef100000 (64-bit, non-prefetchable) [size=4K]
>> Memory at e0000000 (64-bit, prefetchable) [size=16K]
>> Capabilities: [40] Power Management version 3
>> Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> Capabilities: [70] Express Endpoint, MSI 01
>> Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
>> Capabilities: [d0] Vital Product Data
>> Capabilities: [100] Advanced Error Reporting
>> Capabilities: [140] Virtual Channel
>> Capabilities: [160] Device Serial Number 01-00-00-00-36-4c-e0-00
>> Capabilities: [170] Latency Tolerance Reporting
>> Kernel driver in use: r8169
>> Kernel modules: r8169
>>
>> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
>> ---
>> drivers/net/ethernet/realtek/r8169.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index 0d9c3831838f..0efa977c422d 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -7071,17 +7071,20 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
>> {
>> unsigned int flags;
>>
>> - if (tp->mac_version <= RTL_GIGA_MAC_VER_06) {
>> + switch (tp->mac_version) {
>> + case RTL_GIGA_MAC_VER_01 ... RTL_GIGA_MAC_VER_06:
>> RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
>> RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
>> RTL_W8(tp, Cfg9346, Cfg9346_Lock);
>> flags = PCI_IRQ_LEGACY;
>> - } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
>> + break;
>> + case RTL_GIGA_MAC_VER_39 ... RTL_GIGA_MAC_VER_40:
>> /* This version was reported to have issues with resume
>> * from suspend when using MSI-X
>> */
>> flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
>> - } else {
>> + break;
>> + default:
>> flags = PCI_IRQ_ALL_TYPES;
>> }
>>
>>
>
>
^ permalink raw reply
* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: maowenan @ 2018-08-17 2:48 UTC (permalink / raw)
To: Michal Kubecek, Greg KH
Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw, stable,
Takashi Iwai
In-Reply-To: <20180816160616.u3refk4mqpyqagzi@unicorn.suse.cz>
On 2018/8/17 0:06, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
>> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
>>>
>>> Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
>>>
>>> What I can see, though, is that with current stable 4.4 code, modified
>>> testcase which sends something like
>>>
>>> 2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
>>>
>>> I quickly eat 6 MB of memory for receive queue of one socket while
>>> earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
>>> Takashi's follow-up yet but I'm pretty sure it will help while
>>> preserving nice performance when using the original segmentsmack
>>> testcase (with increased packet ratio).
>>
>> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
>> push out a new 4.4-rc later tonight. Can everyone standardize on that
>> and test and let me know if it does, or does not, fix the reported
>> issues?
>
> I did repeat the tests with Takashi's fix and the CPU utilization is
> similar to what we have now, i.e. 3-5% with 10K pkt/s. I could still
> saturate one CPU somewhere around 50K pkt/s but that already requires
> 2.75 MB/s (22 Mb/s) of throughput. (My previous tests with Mao Wenan's
> changes in fact used lower speeds as the change from 128 to 1024 would
> need to be done in two places.)
>
> Where Takashi's patch does help is that it does not prevent collapsing
> of ranges of adjacent segments with total length shorter than ~4KB. It
> took more time to verify: it cannot be checked by watching the socket
> memory consumption with ss as tcp_collapse_ofo_queue isn't called until
> we reach the limits. So I needed to trace when and how tcp_collpse() is
> called with both current stable 4.4 code and one with Takashi's fix.
The POC is default to attack Raspberry Pi system, whose cpu performance is lower,
so the default parameter is not aggressive, we would enlarge parameter to test
in our intel skylake system(with high performance), if don't do this, cpu usage isn't
obvious different with fixed patch and without fixed patch, you can't distinguish
whether the patch can really fix it or not.
I have made series testing here, including low rate attacking(128B,100ms interval)
and high rate attacking(1024B,10ms interval), with original 4.4 kernel, only Takashi's patch,
and only Mao Wenan's patches. I will check the cpu usage of ksoftirq.
original Takashi Mao Wenan
low rate 3% 2% 2%
high rate 50% 49% ~10%
so, I can't identify whether Takashi's patch can really fix radical issue, which I think
the root reason exist in simple queue, and Eric's patch
72cd43ba tcp: free batches of packets in tcp_prune_ofo_queue() can completely fix this,
which have already involved in my patch series. This patch need change simple queue to
RB tree, and it is high efficiency searching and dropping packets, and avoid large tcp retransmitting.
so cpu usage will be fall down.
>
>> If not, we can go from there and evaluate this much larger patch
>> series. But let's try the simple thing first.
>
> At high packet rates (say 30K pkt/s and more), we can still saturate the
> CPU. This is also mentioned in the announcement with claim that switch
> to rbtree based queue would be necessary to fully address that. My tests
> seem to confirm that but I'm still not sure it is worth backporting
> something as intrusive into stable 4.4.
>
> Michal Kubecek
>
> .
>
^ permalink raw reply
* Re: Under what conditions is phy_device "adjust_link()" called?
From: Andrew Lunn @ 2018-08-16 23:33 UTC (permalink / raw)
To: Florian Fainelli; +Cc: rpjday, netdev
In-Reply-To: <53c990a5-9b27-df58-43c3-517708e97511@gmail.com>
On Thu, Aug 16, 2018 at 11:59:19AM -0700, Florian Fainelli wrote:
> On 08/16/2018 10:26 AM, rpjday@crashcourse.ca wrote:
> >
> > I can see from the documentation that the callback adjust_link() is invoked
> > "for the enet controller to respond to changes in the link state." Is there
> > a specific list of the events that would generate such a change? Are we
> > talking initially opening the device, ifup/ifdown, physically unplugging
> > from the port, some or all of the above?
>
> adjust_link() is typically called on transitions from link UP to DOWN
> and DOWN to UP. This may include the initial configuration of the PHY
> during e.g: phy_connect() and then typically when an event occurs than
> requires a re-configuration of the MAC: link parameters (speed, status,
> duplex, pause) changed.
Just adding to that, if you have a 10G PHY, the MAC-PHY interface mode
can also change, e.g. SGMII for 10/100/1000, and 10GBase-T for 10G,
etc.
Andrew
^ permalink raw reply
* [GIT PULL] 9p updates for 4.19
From: Dominique Martinet @ 2018-08-17 2:33 UTC (permalink / raw)
To: Linus Torvalds; +Cc: v9fs-developer, linux-kernel, netdev
In-Reply-To: <20180813012347.GA32555@nautica>
Hi Linus,
9p has seen some recent surge of activity recently and I've stepped up
to help maintaining 9p; please let me know if there are things to
improve in how I do things.
In particular, the gpg key I used to sign the tag is getting rather old
(1024 bit dsa is considered weak nowadays) and I will have a newer key
ready for 4.20, but I didn't think I would be able to gather signatures
for 4.19 so went for this one for now. Sorry for that.
I've rebased the branch at the start of the week to add some stable Cc
but all the patches have been in linux-next for two weeks, and all
transports have been tested since the rebase (thanks to Stefano
Stabellini for xen!)
The following changes since commit 94710cac0ef4ee177a63b5227664b38c95bbf703:
Linux 4.18 (2018-08-12 13:41:04 -0700)
are available in the Git repository at:
git://github.com/martinetd/linux tags/9p-for-4.19-2
for you to fetch changes up to edcd9d977354304cb85aee61c2b96809edce41ed:
net/9p/trans_virtio.c: add null terminal for mount tag (2018-08-13 09:34:58 +0900)
----------------------------------------------------------------
Pull request for inclusion in 4.19 for 9p
Contains mostly fixes (6 to be backported to stable) and a few changes,
here is the breakdown:
* Rework how fids are attributed by replacing some custom tracking in a
list by an idr (f28cdf0430fc)
* For packet-based transports (virtio/rdma) validate that the packet
length matches what the header says (f984579a01d8)
* A few race condition fixes found by syzkaller (9f476d7c540c,
430ac66eb4c5)
* Missing argument check when NULL device is passed in sys_mount
(10aa14527f45)
* A few virtio fixes (23cba9cbde0b, 31934da81036, d28c756caee6)
* Some spelling and style fixes
----------------------------------------------------------------
Chirantan Ekbote (1):
9p/net: Fix zero-copy path in the 9p virtio transport
Colin Ian King (1):
fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
Jean-Philippe Brucker (1):
net/9p: fix error path of p9_virtio_probe
Matthew Wilcox (4):
9p: Fix comment on smp_wmb
9p: Change p9_fid_create calling convention
9p: Replace the fidlist with an IDR
9p: Embed wait_queue_head into p9_req_t
Souptick Joarder (1):
fs/9p/vfs_file.c: use new return type vm_fault_t
Stephen Hemminger (1):
9p: fix whitespace issues
Tomas Bortoli (5):
net/9p/client.c: version pointer uninitialized
net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
net/9p/trans_fd.c: fix race by holding the lock
9p: validate PDU length
9p: fix multiple NULL-pointer-dereferences
jiangyiwen (2):
net/9p/virtio: Fix hard lockup in req_done
9p/virtio: fix off-by-one error in sg list bounds check
piaojun (5):
net/9p/client.c: add missing '\n' at the end of p9_debug()
9p/net/protocol.c: return -ENOMEM when kmalloc() failed
net/9p/trans_virtio.c: fix some spell mistakes in comments
fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
net/9p/trans_virtio.c: add null terminal for mount tag
fs/9p/v9fs.c | 2 +-
fs/9p/vfs_file.c | 2 +-
fs/9p/xattr.c | 6 ++-
include/net/9p/client.h | 11 ++---
net/9p/client.c | 119 ++++++++++++++++++++----------------------------
net/9p/protocol.c | 2 +-
net/9p/trans_fd.c | 22 ++++++---
net/9p/trans_rdma.c | 4 ++
net/9p/trans_virtio.c | 66 ++++++++++++++++-----------
net/9p/trans_xen.c | 3 ++
net/9p/util.c | 1 -
11 files changed, 122 insertions(+), 116 deletions(-)
--
Dominique Martinet
^ permalink raw reply
* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
From: Petar Penkov @ 2018-08-16 23:14 UTC (permalink / raw)
To: Song Liu
Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <CAPhsuW5pOYdTx+w06=xW0XMPkjx62RsJ2EP5iJZCx_3QQZb=xw@mail.gmail.com>
On Thu, Aug 16, 2018 at 3:40 PM, Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is kept as a global variable so it is
> > accessible to all flow dissectors.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> > include/linux/bpf_types.h | 1 +
> > include/linux/skbuff.h | 7 +
> > include/net/flow_dissector.h | 16 +++
> > include/uapi/linux/bpf.h | 14 +-
> > kernel/bpf/syscall.c | 8 ++
> > kernel/bpf/verifier.c | 2 +
> > net/core/filter.c | 157 ++++++++++++++++++++++
> > net/core/flow_dissector.c | 76 +++++++++++
> > tools/bpf/bpftool/prog.c | 1 +
> > tools/include/uapi/linux/bpf.h | 5 +-
> > tools/lib/bpf/libbpf.c | 2 +
> > tools/testing/selftests/bpf/bpf_helpers.h | 3 +
> > 12 files changed, 290 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index cd26c090e7c0..22083712dd18 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > #ifdef CONFIG_INET
> > BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
> > #endif
> > +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
> >
> > BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 17a13e4785fc..ce0e863f02a2 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -243,6 +243,8 @@ struct scatterlist;
> > struct pipe_inode_info;
> > struct iov_iter;
> > struct napi_struct;
> > +struct bpf_prog;
> > +union bpf_attr;
> >
> > #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
> > struct nf_conntrack {
> > @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> > const struct flow_dissector_key *key,
> > unsigned int key_count);
> >
> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> > + struct bpf_prog *prog);
> > +
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
> > +
> > bool __skb_flow_dissect(const struct sk_buff *skb,
> > struct flow_dissector *flow_dissector,
> > void *target_container,
> > diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> > index 6a4586dcdede..edb919d320c1 100644
> > --- a/include/net/flow_dissector.h
> > +++ b/include/net/flow_dissector.h
> > @@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
> > extern struct flow_dissector flow_keys_dissector;
> > extern struct flow_dissector flow_keys_basic_dissector;
> >
> > +/* struct bpf_flow_dissect_cb:
> > + *
> > + * This struct is used to pass parameters to BPF programs of type
> > + * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
> > + * the control block of the skb to be a struct of this type. The first field is
> > + * used to communicate the next header offset between the BPF programs and the
> > + * first value of it is passed from the kernel. The last two fields are used for
> > + * writing out flow keys.
> > + */
> > +struct bpf_flow_dissect_cb {
> > + u16 nhoff;
> > + u16 unused;
> > + void *target_container;
> > + struct flow_dissector *flow_dissector;
> > +};
> > +
> > /* struct flow_keys_digest:
> > *
> > * This structure is used to hold a digest of the full flow keys. This is a
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 66917a4eba27..8bc0fdab685d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
> > BPF_PROG_TYPE_LWT_SEG6LOCAL,
> > BPF_PROG_TYPE_LIRC_MODE2,
> > BPF_PROG_TYPE_SK_REUSEPORT,
> > + BPF_PROG_TYPE_FLOW_DISSECTOR,
> > };
> >
> > enum bpf_attach_type {
> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
> > BPF_CGROUP_UDP4_SENDMSG,
> > BPF_CGROUP_UDP6_SENDMSG,
> > BPF_LIRC_MODE2,
> > + BPF_FLOW_DISSECTOR,
> > __MAX_BPF_ATTACH_TYPE
> > };
> >
> > @@ -2141,6 +2143,15 @@ union bpf_attr {
> > * request in the skb.
> > * Return
> > * 0 on success, or a negative error in case of failure.
> > + *
> > + * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
> > + * Description
> > + * Try to write *len* bytes from the source pointer into the offset
> > + * of the key with id *key_id*. If *len* is different from the
> > + * size of the key, an error is returned. If the key is not used,
> > + * this function exits with no effect and code 0.
> > + * Return
> > + * 0 on success, negative error in case of failure.
> > */
> > #define __BPF_FUNC_MAPPER(FN) \
> > FN(unspec), \
> > @@ -2226,7 +2237,8 @@ union bpf_attr {
> > FN(get_current_cgroup_id), \
> > FN(get_local_storage), \
> > FN(sk_select_reuseport), \
> > - FN(skb_ancestor_cgroup_id),
> > + FN(skb_ancestor_cgroup_id), \
> > + FN(flow_dissector_write_keys),
> >
> > /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > * function eBPF program intends to call
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 43727ed0d94a..a06568841a92 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> > case BPF_LIRC_MODE2:
> > ptype = BPF_PROG_TYPE_LIRC_MODE2;
> > break;
> > + case BPF_FLOW_DISSECTOR:
> > + ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
> > + break;
> > default:
> > return -EINVAL;
> > }
> > @@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> > case BPF_PROG_TYPE_LIRC_MODE2:
> > ret = lirc_prog_attach(attr, prog);
> > break;
> > + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> > + ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
> > + break;
> > default:
> > ret = cgroup_bpf_prog_attach(attr, ptype, prog);
> > }
> > @@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> > return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
> > case BPF_LIRC_MODE2:
> > return lirc_prog_detach(attr);
> > + case BPF_FLOW_DISSECTOR:
> > + return skb_flow_dissector_bpf_prog_detach(attr);
> > default:
> > return -EINVAL;
> > }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index ca90679a7fe5..6d3f268fa8e0 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> > case BPF_PROG_TYPE_LWT_XMIT:
> > case BPF_PROG_TYPE_SK_SKB:
> > case BPF_PROG_TYPE_SK_MSG:
> > + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> > if (meta)
> > return meta->pkt_access;
> >
> > @@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
> > case BPF_PROG_TYPE_SOCKET_FILTER:
> > case BPF_PROG_TYPE_SCHED_CLS:
> > case BPF_PROG_TYPE_SCHED_ACT:
> > + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> > return true;
> > default:
> > return false;
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index fd423ce3da34..03d3037e6508 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
> > return false;
> > }
> >
> > +BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
> > + const void *, from, u32, len, enum flow_dissector_key_id, key_id)
> > +{
> > + struct bpf_flow_dissect_cb *cb;
> > + void *dest;
> > +
> > + cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
> > +
> > + /* Make sure the dissector actually uses the key. It is not an error if
> > + * it does not, but we should not continue past this point in that case
> > + */
> > + if (!dissector_uses_key(cb->flow_dissector, key_id))
> > + return 0;
> > +
> > + /* Make sure the length is correct */
> > + switch (key_id) {
> > + case FLOW_DISSECTOR_KEY_CONTROL:
> > + case FLOW_DISSECTOR_KEY_ENC_CONTROL:
> > + if (len != sizeof(struct flow_dissector_key_control))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_BASIC:
> > + if (len != sizeof(struct flow_dissector_key_basic))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
> > + case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
> > + if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
> > + case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
> > + if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_ICMP:
> > + if (len != sizeof(struct flow_dissector_key_icmp))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_PORTS:
> > + case FLOW_DISSECTOR_KEY_ENC_PORTS:
> > + if (len != sizeof(struct flow_dissector_key_ports))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_ETH_ADDRS:
> > + if (len != sizeof(struct flow_dissector_key_eth_addrs))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_TIPC:
> > + if (len != sizeof(struct flow_dissector_key_tipc))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_ARP:
> > + if (len != sizeof(struct flow_dissector_key_arp))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_VLAN:
> > + case FLOW_DISSECTOR_KEY_CVLAN:
> > + if (len != sizeof(struct flow_dissector_key_vlan))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_FLOW_LABEL:
> > + if (len != sizeof(struct flow_dissector_key_tags))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_GRE_KEYID:
> > + case FLOW_DISSECTOR_KEY_ENC_KEYID:
> > + case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
> > + if (len != sizeof(struct flow_dissector_key_keyid))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_MPLS:
> > + if (len != sizeof(struct flow_dissector_key_mpls))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_TCP:
> > + if (len != sizeof(struct flow_dissector_key_tcp))
> > + return -EINVAL;
> > + break;
> > + case FLOW_DISSECTOR_KEY_IP:
> > + case FLOW_DISSECTOR_KEY_ENC_IP:
> > + if (len != sizeof(struct flow_dissector_key_ip))
> > + return -EINVAL;
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
> > + cb->target_container);
> > +
> > + memcpy(dest, from, len);
> > + return 0;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
> > + .func = bpf_flow_dissector_write_keys,
> > + .gpl_only = false,
> > + .ret_type = RET_INTEGER,
> > + .arg1_type = ARG_PTR_TO_CTX,
> > + .arg2_type = ARG_PTR_TO_MEM,
> > + .arg3_type = ARG_CONST_SIZE,
> > + .arg4_type = ARG_ANYTHING,
> > +};
> > +
> > static const struct bpf_func_proto *
> > bpf_base_func_proto(enum bpf_func_id func_id)
> > {
> > @@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > }
> > }
> >
> > +static const struct bpf_func_proto *
> > +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > + switch (func_id) {
> > + case BPF_FUNC_skb_load_bytes:
> > + return &bpf_skb_load_bytes_proto;
> > + case BPF_FUNC_flow_dissector_write_keys:
> > + return &bpf_flow_dissector_write_keys_proto;
> > + default:
> > + return bpf_base_func_proto(func_id);
> > + }
> > +}
> > +
> > static const struct bpf_func_proto *
> > lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > {
> > @@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
> > return true;
> > }
> >
> > +static bool flow_dissector_is_valid_access(int off, int size,
> > + enum bpf_access_type type,
> > + const struct bpf_prog *prog,
> > + struct bpf_insn_access_aux *info)
> > +{
> > + if (type == BPF_WRITE) {
> > + switch (off) {
> > + case bpf_ctx_range(struct __sk_buff, cb[0]):
> > + break;
> > + default:
> > + return false;
> > + }
> > + }
> > +
> > + switch (off) {
> > + case bpf_ctx_range(struct __sk_buff, data):
> > + info->reg_type = PTR_TO_PACKET;
> > + break;
> > + case bpf_ctx_range(struct __sk_buff, data_end):
> > + info->reg_type = PTR_TO_PACKET_END;
> > + break;
> > + case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> > + case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
> > + return false;
> > + }
> > +
> > + return bpf_skb_is_valid_access(off, size, type, prog, info);
> > +}
> > +
> > static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> > const struct bpf_insn *si,
> > struct bpf_insn *insn_buf,
> > @@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
> > const struct bpf_prog_ops sk_msg_prog_ops = {
> > };
> >
> > +const struct bpf_verifier_ops flow_dissector_verifier_ops = {
> > + .get_func_proto = flow_dissector_func_proto,
> > + .is_valid_access = flow_dissector_is_valid_access,
> > + .convert_ctx_access = bpf_convert_ctx_access,
> > + .gen_ld_abs = bpf_gen_ld_abs,
> > +};
> > +
> > +const struct bpf_prog_ops flow_dissector_prog_ops = {
> > +};
> > +
> > int sk_detach_filter(struct sock *sk)
> > {
> > int ret = -ENOENT;
> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> > index ce9eeeb7c024..767daa231f04 100644
> > --- a/net/core/flow_dissector.c
> > +++ b/net/core/flow_dissector.c
> > @@ -25,6 +25,11 @@
> > #include <net/flow_dissector.h>
> > #include <scsi/fc/fc_fcoe.h>
> > #include <uapi/linux/batadv_packet.h>
> > +#include <linux/bpf.h>
> > +
> > +/* BPF program accessible by all flow dissectors */
> > +static struct bpf_prog __rcu *flow_dissector_prog;
> > +static DEFINE_MUTEX(flow_dissector_mutex);
> >
> > static void dissector_set_key(struct flow_dissector *flow_dissector,
> > enum flow_dissector_key_id key_id)
> > @@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> > }
> > EXPORT_SYMBOL(skb_flow_dissector_init);
> >
> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> > + struct bpf_prog *prog)
> > +{
> > + struct bpf_prog *attached;
> > +
> > + mutex_lock(&flow_dissector_mutex);
> > + attached = rcu_dereference_protected(flow_dissector_prog,
> > + lockdep_is_held(&flow_dissector_mutex));
> > + if (attached) {
> > + /* Only one BPF program can be attached at a time */
> > + mutex_unlock(&flow_dissector_mutex);
> > + return -EEXIST;
> > + }
> > + rcu_assign_pointer(flow_dissector_prog, prog);
> > + mutex_unlock(&flow_dissector_mutex);
> > + return 0;
> > +}
> > +
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> > +{
> > + struct bpf_prog *attached;
> > +
> > + mutex_lock(&flow_dissector_mutex);
> > + attached = rcu_dereference_protected(flow_dissector_prog,
> > + lockdep_is_held(&flow_dissector_mutex));
> > + if (!flow_dissector_prog) {
> > + mutex_unlock(&flow_dissector_mutex);
> > + return -EINVAL;
> > + }
> > + bpf_prog_put(attached);
> > + RCU_INIT_POINTER(flow_dissector_prog, NULL);
> > + mutex_unlock(&flow_dissector_mutex);
> > + return 0;
> > +}
> > /**
> > * skb_flow_get_be16 - extract be16 entity
> > * @skb: sk_buff to extract from
> > @@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > struct flow_dissector_key_vlan *key_vlan;
> > enum flow_dissect_ret fdret;
> > enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
> > + struct bpf_prog *attached;
> > int num_hdrs = 0;
> > u8 ip_proto = 0;
> > bool ret;
> > @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > FLOW_DISSECTOR_KEY_BASIC,
> > target_container);
> >
> > + rcu_read_lock();
> > + attached = rcu_dereference(flow_dissector_prog);
> > + if (attached) {
> > + /* Note that even though the const qualifier is discarded
> > + * throughout the execution of the BPF program, all changes(the
> > + * control block) are reverted after the BPF program returns.
> > + * Therefore, __skb_flow_dissect does not alter the skb.
> > + */
> > + struct bpf_flow_dissect_cb *cb;
> > + u8 cb_saved[BPF_SKB_CB_LEN];
> > + u32 result;
> > +
> > + cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
> > +
> > + /* Save Control Block */
> > + memcpy(cb_saved, cb, sizeof(cb_saved));
> > + memset(cb, 0, sizeof(cb_saved));
> > +
> > + /* Pass parameters to the BPF program */
> > + cb->nhoff = nhoff;
> > + cb->target_container = target_container;
> > + cb->flow_dissector = flow_dissector;
> > +
> > + bpf_compute_data_pointers((struct sk_buff *)skb);
> > + result = BPF_PROG_RUN(attached, skb);
> > +
> > + /* Restore state */
> > + memcpy(cb, cb_saved, sizeof(cb_saved));
> > +
> > + key_control->thoff = min_t(u16, key_control->thoff,
> > + skb ? skb->len : hlen);
> > + rcu_read_unlock();
> > + return result == BPF_OK;
> > + }
>
> If the BPF program cannot handle certain protocol, shall we fall back
> to the built-in logic? Otherwise, all BPF programs need to have some
> code for all protocols.
>
> Song
I believe that if we fall back to the built-in logic we lose all security
guarantees from BPF and this is why the code does not support
fall back.
Petar
>
>
> > + rcu_read_unlock();
> > +
> > if (dissector_uses_key(flow_dissector,
> > FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
> > struct ethhdr *eth = eth_hdr(skb);
> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > index dce960d22106..b1cd3bc8db70 100644
> > --- a/tools/bpf/bpftool/prog.c
> > +++ b/tools/bpf/bpftool/prog.c
> > @@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
> > [BPF_PROG_TYPE_RAW_TRACEPOINT] = "raw_tracepoint",
> > [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
> > [BPF_PROG_TYPE_LIRC_MODE2] = "lirc_mode2",
> > + [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector",
> > };
> >
> > static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 66917a4eba27..acd74a0dd063 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
> > BPF_PROG_TYPE_LWT_SEG6LOCAL,
> > BPF_PROG_TYPE_LIRC_MODE2,
> > BPF_PROG_TYPE_SK_REUSEPORT,
> > + BPF_PROG_TYPE_FLOW_DISSECTOR,
> > };
> >
> > enum bpf_attach_type {
> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
> > BPF_CGROUP_UDP4_SENDMSG,
> > BPF_CGROUP_UDP6_SENDMSG,
> > BPF_LIRC_MODE2,
> > + BPF_FLOW_DISSECTOR,
> > __MAX_BPF_ATTACH_TYPE
> > };
> >
> > @@ -2226,7 +2228,8 @@ union bpf_attr {
> > FN(get_current_cgroup_id), \
> > FN(get_local_storage), \
> > FN(sk_select_reuseport), \
> > - FN(skb_ancestor_cgroup_id),
> > + FN(skb_ancestor_cgroup_id), \
> > + FN(flow_dissector_write_keys),
> >
> > /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > * function eBPF program intends to call
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 2abd0f112627..0c749ce1b717 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
> > case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > case BPF_PROG_TYPE_LIRC_MODE2:
> > case BPF_PROG_TYPE_SK_REUSEPORT:
> > + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> > return false;
> > case BPF_PROG_TYPE_UNSPEC:
> > case BPF_PROG_TYPE_KPROBE:
> > @@ -2121,6 +2122,7 @@ static const struct {
> > BPF_PROG_SEC("sk_skb", BPF_PROG_TYPE_SK_SKB),
> > BPF_PROG_SEC("sk_msg", BPF_PROG_TYPE_SK_MSG),
> > BPF_PROG_SEC("lirc_mode2", BPF_PROG_TYPE_LIRC_MODE2),
> > + BPF_PROG_SEC("flow_dissector", BPF_PROG_TYPE_FLOW_DISSECTOR),
> > BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
> > BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
> > BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
> > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> > index e4be7730222d..4204c496a04f 100644
> > --- a/tools/testing/selftests/bpf/bpf_helpers.h
> > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> > @@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
> > (void *) BPF_FUNC_skb_cgroup_id;
> > static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
> > (void *) BPF_FUNC_skb_ancestor_cgroup_id;
> > +static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
> > + int key) =
> > + (void *) BPF_FUNC_flow_dissector_write_keys;
> >
> > /* llvm builtin functions that eBPF C program may use to
> > * emit BPF_LD_ABS and BPF_LD_IND instructions
> > --
> > 2.18.0.865.gffc8e1a3cd6-goog
> >
^ permalink raw reply
* Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Russell King - ARM Linux @ 2018-08-16 22:58 UTC (permalink / raw)
To: Marc Haber
Cc: Peter Robinson, linux-arm-kernel, netdev, labbott, Eric Dumazet,
Daniel Borkmann
In-Reply-To: <20180816203515.GA7688@torres.zugschlus.de>
On Thu, Aug 16, 2018 at 10:35:16PM +0200, Marc Haber wrote:
> On Mon, Jun 25, 2018 at 05:41:27PM +0100, Peter Robinson wrote:
> > So with that and the other fix there was no improvement, with those
> > and the BPF JIT disabled it works, I'm not sure if the two patches
> > have any effect with the JIT disabled though.
>
> I can confirm the crash with the released 4.18.1 on Banana Pi, and I can
> also confirm that disabling BPF JIT makes the Banana Pi work again.,
Hi,
I'm afraid that the information in the crash dumps is insufficient
to be able to work very much out about these crashes.
We need a recipe (kernel configuration and what userspace is doing)
so that it's possible to recreate the crash, or we need responses
to requests for information - I requested the disassembly of
sk_filter_trim_cap and the BPF code dump via setting a sysctl back
in early July. Without this, as I say, I don't see how this problem
can be progressed.
If the problem is at boot, one way to set the sysctl would be to
hack the kernel and explicitly initialise the sysctl to '2', or
boot with init=/bin/sh, then manually mount /proc, set the sysctl,
and then "exec /sbin/init" from that shell. (Remember there's no
job control in that shell, so ^z, ^c, etc do not work.)
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps up
According to speedtest.net: 13Mbps down 490kbps up
^ permalink raw reply
* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
From: Song Liu @ 2018-08-16 22:40 UTC (permalink / raw)
To: Petar Penkov
Cc: Networking, David S . Miller, Alexei Starovoitov, Daniel Borkmann,
simon.horman, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180816164423.14368-2-peterpenkov96@gmail.com>
On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is kept as a global variable so it is
> accessible to all flow dissectors.
>
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
> include/linux/bpf_types.h | 1 +
> include/linux/skbuff.h | 7 +
> include/net/flow_dissector.h | 16 +++
> include/uapi/linux/bpf.h | 14 +-
> kernel/bpf/syscall.c | 8 ++
> kernel/bpf/verifier.c | 2 +
> net/core/filter.c | 157 ++++++++++++++++++++++
> net/core/flow_dissector.c | 76 +++++++++++
> tools/bpf/bpftool/prog.c | 1 +
> tools/include/uapi/linux/bpf.h | 5 +-
> tools/lib/bpf/libbpf.c | 2 +
> tools/testing/selftests/bpf/bpf_helpers.h | 3 +
> 12 files changed, 290 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index cd26c090e7c0..22083712dd18 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> #ifdef CONFIG_INET
> BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
> #endif
> +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
>
> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 17a13e4785fc..ce0e863f02a2 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -243,6 +243,8 @@ struct scatterlist;
> struct pipe_inode_info;
> struct iov_iter;
> struct napi_struct;
> +struct bpf_prog;
> +union bpf_attr;
>
> #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
> struct nf_conntrack {
> @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> const struct flow_dissector_key *key,
> unsigned int key_count);
>
> +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> + struct bpf_prog *prog);
> +
> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
> +
> bool __skb_flow_dissect(const struct sk_buff *skb,
> struct flow_dissector *flow_dissector,
> void *target_container,
> diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> index 6a4586dcdede..edb919d320c1 100644
> --- a/include/net/flow_dissector.h
> +++ b/include/net/flow_dissector.h
> @@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
> extern struct flow_dissector flow_keys_dissector;
> extern struct flow_dissector flow_keys_basic_dissector;
>
> +/* struct bpf_flow_dissect_cb:
> + *
> + * This struct is used to pass parameters to BPF programs of type
> + * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
> + * the control block of the skb to be a struct of this type. The first field is
> + * used to communicate the next header offset between the BPF programs and the
> + * first value of it is passed from the kernel. The last two fields are used for
> + * writing out flow keys.
> + */
> +struct bpf_flow_dissect_cb {
> + u16 nhoff;
> + u16 unused;
> + void *target_container;
> + struct flow_dissector *flow_dissector;
> +};
> +
> /* struct flow_keys_digest:
> *
> * This structure is used to hold a digest of the full flow keys. This is a
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 66917a4eba27..8bc0fdab685d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -152,6 +152,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_LWT_SEG6LOCAL,
> BPF_PROG_TYPE_LIRC_MODE2,
> BPF_PROG_TYPE_SK_REUSEPORT,
> + BPF_PROG_TYPE_FLOW_DISSECTOR,
> };
>
> enum bpf_attach_type {
> @@ -172,6 +173,7 @@ enum bpf_attach_type {
> BPF_CGROUP_UDP4_SENDMSG,
> BPF_CGROUP_UDP6_SENDMSG,
> BPF_LIRC_MODE2,
> + BPF_FLOW_DISSECTOR,
> __MAX_BPF_ATTACH_TYPE
> };
>
> @@ -2141,6 +2143,15 @@ union bpf_attr {
> * request in the skb.
> * Return
> * 0 on success, or a negative error in case of failure.
> + *
> + * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
> + * Description
> + * Try to write *len* bytes from the source pointer into the offset
> + * of the key with id *key_id*. If *len* is different from the
> + * size of the key, an error is returned. If the key is not used,
> + * this function exits with no effect and code 0.
> + * Return
> + * 0 on success, negative error in case of failure.
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -2226,7 +2237,8 @@ union bpf_attr {
> FN(get_current_cgroup_id), \
> FN(get_local_storage), \
> FN(sk_select_reuseport), \
> - FN(skb_ancestor_cgroup_id),
> + FN(skb_ancestor_cgroup_id), \
> + FN(flow_dissector_write_keys),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 43727ed0d94a..a06568841a92 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> case BPF_LIRC_MODE2:
> ptype = BPF_PROG_TYPE_LIRC_MODE2;
> break;
> + case BPF_FLOW_DISSECTOR:
> + ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
> + break;
> default:
> return -EINVAL;
> }
> @@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> case BPF_PROG_TYPE_LIRC_MODE2:
> ret = lirc_prog_attach(attr, prog);
> break;
> + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> + ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
> + break;
> default:
> ret = cgroup_bpf_prog_attach(attr, ptype, prog);
> }
> @@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
> case BPF_LIRC_MODE2:
> return lirc_prog_detach(attr);
> + case BPF_FLOW_DISSECTOR:
> + return skb_flow_dissector_bpf_prog_detach(attr);
> default:
> return -EINVAL;
> }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ca90679a7fe5..6d3f268fa8e0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> case BPF_PROG_TYPE_LWT_XMIT:
> case BPF_PROG_TYPE_SK_SKB:
> case BPF_PROG_TYPE_SK_MSG:
> + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> if (meta)
> return meta->pkt_access;
>
> @@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
> case BPF_PROG_TYPE_SOCKET_FILTER:
> case BPF_PROG_TYPE_SCHED_CLS:
> case BPF_PROG_TYPE_SCHED_ACT:
> + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> return true;
> default:
> return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index fd423ce3da34..03d3037e6508 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
> return false;
> }
>
> +BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
> + const void *, from, u32, len, enum flow_dissector_key_id, key_id)
> +{
> + struct bpf_flow_dissect_cb *cb;
> + void *dest;
> +
> + cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
> +
> + /* Make sure the dissector actually uses the key. It is not an error if
> + * it does not, but we should not continue past this point in that case
> + */
> + if (!dissector_uses_key(cb->flow_dissector, key_id))
> + return 0;
> +
> + /* Make sure the length is correct */
> + switch (key_id) {
> + case FLOW_DISSECTOR_KEY_CONTROL:
> + case FLOW_DISSECTOR_KEY_ENC_CONTROL:
> + if (len != sizeof(struct flow_dissector_key_control))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_BASIC:
> + if (len != sizeof(struct flow_dissector_key_basic))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
> + case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
> + if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
> + case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
> + if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_ICMP:
> + if (len != sizeof(struct flow_dissector_key_icmp))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_PORTS:
> + case FLOW_DISSECTOR_KEY_ENC_PORTS:
> + if (len != sizeof(struct flow_dissector_key_ports))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_ETH_ADDRS:
> + if (len != sizeof(struct flow_dissector_key_eth_addrs))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_TIPC:
> + if (len != sizeof(struct flow_dissector_key_tipc))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_ARP:
> + if (len != sizeof(struct flow_dissector_key_arp))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_VLAN:
> + case FLOW_DISSECTOR_KEY_CVLAN:
> + if (len != sizeof(struct flow_dissector_key_vlan))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_FLOW_LABEL:
> + if (len != sizeof(struct flow_dissector_key_tags))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_GRE_KEYID:
> + case FLOW_DISSECTOR_KEY_ENC_KEYID:
> + case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
> + if (len != sizeof(struct flow_dissector_key_keyid))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_MPLS:
> + if (len != sizeof(struct flow_dissector_key_mpls))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_TCP:
> + if (len != sizeof(struct flow_dissector_key_tcp))
> + return -EINVAL;
> + break;
> + case FLOW_DISSECTOR_KEY_IP:
> + case FLOW_DISSECTOR_KEY_ENC_IP:
> + if (len != sizeof(struct flow_dissector_key_ip))
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
> + cb->target_container);
> +
> + memcpy(dest, from, len);
> + return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
> + .func = bpf_flow_dissector_write_keys,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_CTX,
> + .arg2_type = ARG_PTR_TO_MEM,
> + .arg3_type = ARG_CONST_SIZE,
> + .arg4_type = ARG_ANYTHING,
> +};
> +
> static const struct bpf_func_proto *
> bpf_base_func_proto(enum bpf_func_id func_id)
> {
> @@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> }
> }
>
> +static const struct bpf_func_proto *
> +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> + switch (func_id) {
> + case BPF_FUNC_skb_load_bytes:
> + return &bpf_skb_load_bytes_proto;
> + case BPF_FUNC_flow_dissector_write_keys:
> + return &bpf_flow_dissector_write_keys_proto;
> + default:
> + return bpf_base_func_proto(func_id);
> + }
> +}
> +
> static const struct bpf_func_proto *
> lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> {
> @@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
> return true;
> }
>
> +static bool flow_dissector_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + const struct bpf_prog *prog,
> + struct bpf_insn_access_aux *info)
> +{
> + if (type == BPF_WRITE) {
> + switch (off) {
> + case bpf_ctx_range(struct __sk_buff, cb[0]):
> + break;
> + default:
> + return false;
> + }
> + }
> +
> + switch (off) {
> + case bpf_ctx_range(struct __sk_buff, data):
> + info->reg_type = PTR_TO_PACKET;
> + break;
> + case bpf_ctx_range(struct __sk_buff, data_end):
> + info->reg_type = PTR_TO_PACKET_END;
> + break;
> + case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> + case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
> + return false;
> + }
> +
> + return bpf_skb_is_valid_access(off, size, type, prog, info);
> +}
> +
> static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> const struct bpf_insn *si,
> struct bpf_insn *insn_buf,
> @@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
> const struct bpf_prog_ops sk_msg_prog_ops = {
> };
>
> +const struct bpf_verifier_ops flow_dissector_verifier_ops = {
> + .get_func_proto = flow_dissector_func_proto,
> + .is_valid_access = flow_dissector_is_valid_access,
> + .convert_ctx_access = bpf_convert_ctx_access,
> + .gen_ld_abs = bpf_gen_ld_abs,
> +};
> +
> +const struct bpf_prog_ops flow_dissector_prog_ops = {
> +};
> +
> int sk_detach_filter(struct sock *sk)
> {
> int ret = -ENOENT;
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index ce9eeeb7c024..767daa231f04 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -25,6 +25,11 @@
> #include <net/flow_dissector.h>
> #include <scsi/fc/fc_fcoe.h>
> #include <uapi/linux/batadv_packet.h>
> +#include <linux/bpf.h>
> +
> +/* BPF program accessible by all flow dissectors */
> +static struct bpf_prog __rcu *flow_dissector_prog;
> +static DEFINE_MUTEX(flow_dissector_mutex);
>
> static void dissector_set_key(struct flow_dissector *flow_dissector,
> enum flow_dissector_key_id key_id)
> @@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> }
> EXPORT_SYMBOL(skb_flow_dissector_init);
>
> +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> + struct bpf_prog *prog)
> +{
> + struct bpf_prog *attached;
> +
> + mutex_lock(&flow_dissector_mutex);
> + attached = rcu_dereference_protected(flow_dissector_prog,
> + lockdep_is_held(&flow_dissector_mutex));
> + if (attached) {
> + /* Only one BPF program can be attached at a time */
> + mutex_unlock(&flow_dissector_mutex);
> + return -EEXIST;
> + }
> + rcu_assign_pointer(flow_dissector_prog, prog);
> + mutex_unlock(&flow_dissector_mutex);
> + return 0;
> +}
> +
> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> +{
> + struct bpf_prog *attached;
> +
> + mutex_lock(&flow_dissector_mutex);
> + attached = rcu_dereference_protected(flow_dissector_prog,
> + lockdep_is_held(&flow_dissector_mutex));
> + if (!flow_dissector_prog) {
> + mutex_unlock(&flow_dissector_mutex);
> + return -EINVAL;
> + }
> + bpf_prog_put(attached);
> + RCU_INIT_POINTER(flow_dissector_prog, NULL);
> + mutex_unlock(&flow_dissector_mutex);
> + return 0;
> +}
> /**
> * skb_flow_get_be16 - extract be16 entity
> * @skb: sk_buff to extract from
> @@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> struct flow_dissector_key_vlan *key_vlan;
> enum flow_dissect_ret fdret;
> enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
> + struct bpf_prog *attached;
> int num_hdrs = 0;
> u8 ip_proto = 0;
> bool ret;
> @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> FLOW_DISSECTOR_KEY_BASIC,
> target_container);
>
> + rcu_read_lock();
> + attached = rcu_dereference(flow_dissector_prog);
> + if (attached) {
> + /* Note that even though the const qualifier is discarded
> + * throughout the execution of the BPF program, all changes(the
> + * control block) are reverted after the BPF program returns.
> + * Therefore, __skb_flow_dissect does not alter the skb.
> + */
> + struct bpf_flow_dissect_cb *cb;
> + u8 cb_saved[BPF_SKB_CB_LEN];
> + u32 result;
> +
> + cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
> +
> + /* Save Control Block */
> + memcpy(cb_saved, cb, sizeof(cb_saved));
> + memset(cb, 0, sizeof(cb_saved));
> +
> + /* Pass parameters to the BPF program */
> + cb->nhoff = nhoff;
> + cb->target_container = target_container;
> + cb->flow_dissector = flow_dissector;
> +
> + bpf_compute_data_pointers((struct sk_buff *)skb);
> + result = BPF_PROG_RUN(attached, skb);
> +
> + /* Restore state */
> + memcpy(cb, cb_saved, sizeof(cb_saved));
> +
> + key_control->thoff = min_t(u16, key_control->thoff,
> + skb ? skb->len : hlen);
> + rcu_read_unlock();
> + return result == BPF_OK;
> + }
If the BPF program cannot handle certain protocol, shall we fall back
to the built-in logic? Otherwise, all BPF programs need to have some
code for all protocols.
Song
> + rcu_read_unlock();
> +
> if (dissector_uses_key(flow_dissector,
> FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
> struct ethhdr *eth = eth_hdr(skb);
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index dce960d22106..b1cd3bc8db70 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
> [BPF_PROG_TYPE_RAW_TRACEPOINT] = "raw_tracepoint",
> [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
> [BPF_PROG_TYPE_LIRC_MODE2] = "lirc_mode2",
> + [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector",
> };
>
> static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 66917a4eba27..acd74a0dd063 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -152,6 +152,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_LWT_SEG6LOCAL,
> BPF_PROG_TYPE_LIRC_MODE2,
> BPF_PROG_TYPE_SK_REUSEPORT,
> + BPF_PROG_TYPE_FLOW_DISSECTOR,
> };
>
> enum bpf_attach_type {
> @@ -172,6 +173,7 @@ enum bpf_attach_type {
> BPF_CGROUP_UDP4_SENDMSG,
> BPF_CGROUP_UDP6_SENDMSG,
> BPF_LIRC_MODE2,
> + BPF_FLOW_DISSECTOR,
> __MAX_BPF_ATTACH_TYPE
> };
>
> @@ -2226,7 +2228,8 @@ union bpf_attr {
> FN(get_current_cgroup_id), \
> FN(get_local_storage), \
> FN(sk_select_reuseport), \
> - FN(skb_ancestor_cgroup_id),
> + FN(skb_ancestor_cgroup_id), \
> + FN(flow_dissector_write_keys),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 2abd0f112627..0c749ce1b717 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
> case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> case BPF_PROG_TYPE_LIRC_MODE2:
> case BPF_PROG_TYPE_SK_REUSEPORT:
> + case BPF_PROG_TYPE_FLOW_DISSECTOR:
> return false;
> case BPF_PROG_TYPE_UNSPEC:
> case BPF_PROG_TYPE_KPROBE:
> @@ -2121,6 +2122,7 @@ static const struct {
> BPF_PROG_SEC("sk_skb", BPF_PROG_TYPE_SK_SKB),
> BPF_PROG_SEC("sk_msg", BPF_PROG_TYPE_SK_MSG),
> BPF_PROG_SEC("lirc_mode2", BPF_PROG_TYPE_LIRC_MODE2),
> + BPF_PROG_SEC("flow_dissector", BPF_PROG_TYPE_FLOW_DISSECTOR),
> BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
> BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
> BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index e4be7730222d..4204c496a04f 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
> (void *) BPF_FUNC_skb_cgroup_id;
> static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
> (void *) BPF_FUNC_skb_ancestor_cgroup_id;
> +static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
> + int key) =
> + (void *) BPF_FUNC_flow_dissector_write_keys;
>
> /* llvm builtin functions that eBPF C program may use to
> * emit BPF_LD_ABS and BPF_LD_IND instructions
> --
> 2.18.0.865.gffc8e1a3cd6-goog
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox