* Re: [PATCH 4/4] cpsw: add switchdev support
From: Florian Fainelli @ 2018-06-05 21:37 UTC (permalink / raw)
To: Grygorii Strashko, Ilias Apalodimas
Cc: Andrew Lunn, netdev, ivan.khoronzhuk, nsekhar, jiri, ivecera,
francois.ozog, yogeshs, spatton
In-Reply-To: <e5832a2c-da63-d91d-e93a-51b7e84959a0@ti.com>
On 06/05/2018 02:03 PM, Grygorii Strashko wrote:
>
>
> On 06/02/2018 05:34 AM, Ilias Apalodimas wrote:
>> Hi Florian,
>>
>> Thanks for taking time to look into this
>>
>> On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:
>>>
>>>
>>> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:
>>>> On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:
>>>>> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:
>>>>>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:
>>>>>>> Device tree is supposed to describe the hardware. Using that hardware
>>>>>>> in different ways is not something you should describe in DT.
>>>>>>>
>>>>>> The new switchdev mode is applied with a .config option in the kernel. What you
>>>>>> see is pre-existing code, so i am not sure if i should change it in this
>>>>>> patchset.
>>>>>
>>>>> If you break the code up into a library and two drivers, it becomes a
>>>>> moot point.
>>>> Agree
>>>>
>>>>>
>>>>> But what i don't like here is that the device tree says to do dual
>>>>> mac. But you ignore that and do sometime else. I would prefer that if
>>>>> DT says dual mac, and switchdev is compiled in, the probe fails with
>>>>> EINVAL. Rather than ignore something, make it clear it is invalid.
>>>> The switch has 3 modes of operation as is.
>>>> 1. switch mode, to enable that you don't need to add anything on
>>>> the DTS and linux registers a single netdev interface.
>>>> 2. dual mac mode, this is when you need to add dual_emac; on the DTS.
>>>> 3. switchdev mode which is controlled by a .config option, since as you
>>>> pointed out DTS was not made for controlling config options.
>>>>
>>>> I agree that this is far from beautiful. If the driver remains as in though,
>>>> i'd prefer either keeping what's there or making "switchdev" a DTS option,
>>>> following the pre-existing erroneous usage rather than making the device
>>>> unusable. If we end up returning some error and refuse to initialize, users
>>>> that remote upgrade their equipment, without taking a good look at changelog,
>>>> will loose access to their devices with no means of remotely fixing that.
>>>
>>> It seems to me that the mistake here is seeing multiple modes of
>>> operations for the cpsw. There are not actually many, there is one
>>> usage, and then there is what you can and cannot offload.
>
>> CPSW has in fact 2 modes of operation, different FIFO usage/lookup entry(it's
>> called ALE in the current driver) by-pass(which is used in dual emac for
>> example) and other features. Again Grygorii is better suited to answer the
>> exact differences.
>
> dual_mac is HW enabled mode of operation, so having DT option is pretty
> reasonable as for me.
> 1) when enabled it configures internal FIFOs in special way so both
> external Ports became equal in the direction toward to Host port 0.
>
> TRM "The intention of this mode is to allow packets from both ethernet ports
> to be written into the FIFO without one port starving the other port."
>
> 2) ALE, out of the box, does not support this mode and, as result, two
> default vlan have to be created to direct traffic P1->P0 (vlan1) and
> P2-P0 (vlan2), but any packets P0->P1 and P0->P2 have to be directed
> (and will bypass ALE). This way traffic separated on cpsw egress towards to P0,
>
>>> The basic
>>> premise with switchdev and DSA (which uses switchdev) is that each
>>> user-facing port of your switch needs to work as if it were a normal
>>> Ethernet NIC, that is what you call dual-MAC I believe. Then, when you
>>> create a bridge and you enslave those ports into the bridge, you need to
>>> have forwarding done in hardware between these two ports when the
>>> SMAC/DMAC are not for the host/CPU/management interface and you must
>>> simultaneously still have the host have the ability to send/receive
>>> traffic through the bridge device.
>
> TRM
> "When operating in dual mac mode the intention is to transfer packets between ports 0 and 1 and ports 0
> and 2, but not between ports 1 and 2. Each CPGMAC_SL appears as a single MAC with no bridging
> between MAC’s. Each CPGMAC_SL has at least one unique (not the same) mac address."
>
> So, we cant't talk about forwarding in dual_mac mode. Interfaces expected to be
> completely independent without any packet leaking between interfaces.
>
> !! BUT !! CPSW ALE (switch core) still expected to be used, but for packet filtering
> - only registered vlans
> - only registered mcast/bcast
> - ingress mcast/bcast rate limiting (it's actually more like coalescing -
> limits number of mcast/bcast packets per sec.
>
> And all offloading ALE (val/mdb) entries should always contain two ports in masks:
> p1&p0 or p2&p0. Never ever all three ports.
>
>> Yes dual emac does that. But dual emac configures the port facing VLAN to the
>> CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is
>> configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU port
>> That's exactly what the current RFC does as well, with the addition of
>> registering a sw0p0 (i'll explain why later on this mail)
>> A little more detail on the issue we are having. On my description
>> sw0p0 -> CPU port, sw0p1 -> port 1 sw0p2 -> port 2. sw0p1/sw0p2 are the ports
>> that have PHYs attached.
>>
>> When we start in the new switchdev mode all interfaces are added to VLAN 0
>> so CPU port + port1 + port2 are all in the same VLAN group. In that case sw0p1
>> and sw0p2 are working as you describe. So those 2 interfaces can send/receive
>> traffic normally which matches the switchdev case.
>>
>> When we add them on a bridge(let's say br0), VLAN1(or any default bridge VLAN)
>> is now configured on sw0p1 and sw0p2 but *not* on the CPU port.
>> From this point on the whole fuunctionality just collapses. The switch will
>> work and offload traffic between sw0p1/sw0p2 but the bridge won't be able to
>> get an ip address (since VLAN1 is not a member of the CPU port and the packet
>> gets dropped).
>> IGMPv2/V3 messages will never reach the br_multicast.c code to trigger
>> switchdev and configure the MDBs on the ports. i am prety sure there are other
>> fail scenarios which i haven't discovered already, but those 2 are the most
>> basic ones. If we add VLAN1 on the CPU port, everything works as intended of
>> course.
>>
>> That's the reason we registered sw0p0 as the CPU port. It can't do any "real"
>> traffic, but you can configure the CPU port independantly and not be forced to
>> do an OR on every VLAN add/delete grouping the CPU port with your port command.
>> The TL;DR version of this is that the switch is working exactly as switchdev is
>> expecting offloading traffic to the hardware when possible as long as the CPU
>> port is member of the proper VLANs
>>
>> Petr's patch solves this for us (9c86ce2c1ae337fc10568a12aea812ed03de8319).
>> We can now do "bridge vlan add dev br0 vid 100 pvid untagged self" and decide
>> when to add the CPU port or not.
>>
>> There are still a couple of cases that are not covered though, if we don't
>> register the CPU port. We cant decide when to forward multicast
>> traffic on the CPU port if a join hasn't been sent from br0.
>> So let's say you got 2 hosts doing multicast and for whatever reason the host
>> wants to see that traffic.
>> With the CPU port present you can do a
>> "bridge mdb add dev br0 port sw0p0 grp 239.1.1.1 permanent" which will offload
>> the traffic to the CPU port and thus the host. If this goes away we are forced
>> to send a join.
>>
>>> It seems to me like this is entirely doable given that the dual MAC use
>>> case is supported already.
>>>
>>> switchdev is just a stateless framework to get notified from the
>>> networking stack about what you can possibly offload in hardware, so
>>> having a DTS option gate that is unfortunately wrong because it is
>>> really implementing a SW policy in DTS which is not what it is meant for.
>> The DTS option for configuration pre-existed, i don't like that either switchdev
>> mode is activated by a .config option not DTS(it just overrides whatever config
>> you have on the DTS). Far from pretty though fair point, i am open to
>> suggestions.
>
> Again this is option describing HW mode which not expected to be changed on the fly.
> I'm honestly, propose descope dual_mac and concentrate on switch mode (switchdev) -
> right now I don't see how dual_mac can be supported with switchdev as per above.
> The same way as I do not see how we can re-use switchdev with 50% of devices which
> have "only one" user-facing external port (P1 or P2).
This is still not appropriate for Device Tree, because this is
completely orthogonal from one another. Also, I think you tend to
conflate what the switch can do, with in which mode does it start by
default, which are, again, two different things.
--
Florian
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Andrew Lunn @ 2018-06-05 21:37 UTC (permalink / raw)
To: Grygorii Strashko
Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <05111013-4986-b1f2-a2d1-9cab48be1ce5@ti.com>
> Right, thanky for the info, but still (sry, to be annoying) why default vlan is added by bridge
> when vlan_filtering == 0?
I have no idea!
I just made sure the Marvell driver works with this.
You might want to do a git blame and find out who added it, and it
might say why.
Andrew
^ permalink raw reply
* Re: [PATCH 1/1] net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
From: Florian Fainelli @ 2018-06-05 21:41 UTC (permalink / raw)
To: Arun Parameswaran, Vivien Didelot, Andrew Lunn, David S. Miller
Cc: netdev, linux-kernel, bcm-kernel-feedback-list,
Clément Péron
In-Reply-To: <1528231092-31472-1-git-send-email-arun.parameswaran@broadcom.com>
On 06/05/2018 01:38 PM, Arun Parameswaran wrote:
> In the Broadcom Cygnus SoC, the brcm tag needs to be inserted
> in between the mac address and the ether type (should use
> 'DSA_PROTO_TAG_BRCM') for the packets sent to the internal
> b53 switch.
>
> Since the Cygnus was added with the BCM58XX device id and the
> BCM58XX uses 'DSA_PROTO_TAG_BRCM_PREPEND', the data path is
> broken, due to the incorrect brcm tag location.
>
> Add a new b53 device id (BCM583XX) for Cygnus family to fix the
> issue. Add the new device id to the BCM58XX family as Cygnus
> is similar to the BCM58XX in most other functionalities.
>
> Fixes: 11606039604c ("net: dsa: b53: Support prepended Broadcom tags")
>
> Signed-off-by: Arun Parameswaran <arun.parameswaran@broadcom.com>
Clement originally reported this to me/us:
Reported-by: Clément Péron <peron.clem@gmail.com>
I completely overlooked that when adding support for prepended Broadcom
tags, thanks for the fix Arun!
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
David, can you also queue this up for -stable? Thank you
> ---
> drivers/net/dsa/b53/b53_common.c | 15 ++++++++++++++-
> drivers/net/dsa/b53/b53_priv.h | 2 ++
> drivers/net/dsa/b53/b53_srab.c | 4 ++--
> 3 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
> index 3da5fca..bbc6cc6 100644
> --- a/drivers/net/dsa/b53/b53_common.c
> +++ b/drivers/net/dsa/b53/b53_common.c
> @@ -684,7 +684,8 @@ static int b53_switch_reset(struct b53_device *dev)
> * still use this driver as a library and need to perform the reset
> * earlier.
> */
> - if (dev->chip_id == BCM58XX_DEVICE_ID) {
> + if (dev->chip_id == BCM58XX_DEVICE_ID ||
> + dev->chip_id == BCM583XX_DEVICE_ID) {
> b53_read8(dev, B53_CTRL_PAGE, B53_SOFTRESET, ®);
> reg |= SW_RST | EN_SW_RST | EN_CH_RST;
> b53_write8(dev, B53_CTRL_PAGE, B53_SOFTRESET, reg);
> @@ -1880,6 +1881,18 @@ struct b53_chip_data {
> .jumbo_size_reg = B53_JUMBO_MAX_SIZE,
> },
> {
> + .chip_id = BCM583XX_DEVICE_ID,
> + .dev_name = "BCM583xx/11360",
> + .vlans = 4096,
> + .enabled_ports = 0x103,
> + .arl_entries = 4,
> + .cpu_port = B53_CPU_PORT,
> + .vta_regs = B53_VTA_REGS,
> + .duplex_reg = B53_DUPLEX_STAT_GE,
> + .jumbo_pm_reg = B53_JUMBO_PORT_MASK,
> + .jumbo_size_reg = B53_JUMBO_MAX_SIZE,
> + },
> + {
> .chip_id = BCM7445_DEVICE_ID,
> .dev_name = "BCM7445",
> .vlans = 4096,
> diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
> index 3b57f47..b232aaa 100644
> --- a/drivers/net/dsa/b53/b53_priv.h
> +++ b/drivers/net/dsa/b53/b53_priv.h
> @@ -62,6 +62,7 @@ enum {
> BCM53018_DEVICE_ID = 0x53018,
> BCM53019_DEVICE_ID = 0x53019,
> BCM58XX_DEVICE_ID = 0x5800,
> + BCM583XX_DEVICE_ID = 0x58300,
> BCM7445_DEVICE_ID = 0x7445,
> BCM7278_DEVICE_ID = 0x7278,
> };
> @@ -181,6 +182,7 @@ static inline int is5301x(struct b53_device *dev)
> static inline int is58xx(struct b53_device *dev)
> {
> return dev->chip_id == BCM58XX_DEVICE_ID ||
> + dev->chip_id == BCM583XX_DEVICE_ID ||
> dev->chip_id == BCM7445_DEVICE_ID ||
> dev->chip_id == BCM7278_DEVICE_ID;
> }
> diff --git a/drivers/net/dsa/b53/b53_srab.c b/drivers/net/dsa/b53/b53_srab.c
> index c37ffd1..8247481 100644
> --- a/drivers/net/dsa/b53/b53_srab.c
> +++ b/drivers/net/dsa/b53/b53_srab.c
> @@ -364,7 +364,7 @@ static int b53_srab_write64(struct b53_device *dev, u8 page, u8 reg,
> { .compatible = "brcm,bcm53018-srab" },
> { .compatible = "brcm,bcm53019-srab" },
> { .compatible = "brcm,bcm5301x-srab" },
> - { .compatible = "brcm,bcm11360-srab", .data = (void *)BCM58XX_DEVICE_ID },
> + { .compatible = "brcm,bcm11360-srab", .data = (void *)BCM583XX_DEVICE_ID },
> { .compatible = "brcm,bcm58522-srab", .data = (void *)BCM58XX_DEVICE_ID },
> { .compatible = "brcm,bcm58525-srab", .data = (void *)BCM58XX_DEVICE_ID },
> { .compatible = "brcm,bcm58535-srab", .data = (void *)BCM58XX_DEVICE_ID },
> @@ -372,7 +372,7 @@ static int b53_srab_write64(struct b53_device *dev, u8 page, u8 reg,
> { .compatible = "brcm,bcm58623-srab", .data = (void *)BCM58XX_DEVICE_ID },
> { .compatible = "brcm,bcm58625-srab", .data = (void *)BCM58XX_DEVICE_ID },
> { .compatible = "brcm,bcm88312-srab", .data = (void *)BCM58XX_DEVICE_ID },
> - { .compatible = "brcm,cygnus-srab", .data = (void *)BCM58XX_DEVICE_ID },
> + { .compatible = "brcm,cygnus-srab", .data = (void *)BCM583XX_DEVICE_ID },
> { .compatible = "brcm,nsp-srab", .data = (void *)BCM58XX_DEVICE_ID },
> { /* sentinel */ },
> };
>
--
Florian
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Grygorii Strashko @ 2018-06-05 21:42 UTC (permalink / raw)
To: Andrew Lunn
Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <20180605212840.GA3796@lunn.ch>
On 06/05/2018 04:28 PM, Andrew Lunn wrote:
>> I hope you are right - question is always in number of available options
>> and which one to select - and, most important, explain it to the end user :(
>
> The end customer being ptp4linux? At least for Marvell switches, it is
> happy about everything except that the switch is a bit slow, so we
> need to modify some of the time outs in the configuration file.
>
>> For example:
>> phc_index is returned as part of .get_ts_info() = cpsw_get_ts_info(),
>> so which intf should return phc_index?
>
> It is not a 1:1 relationship. See:
>
> https://elixir.bootlin.com/linux/latest/source/drivers/net/dsa/mv88e6xxx/hwtstamp.c#L61
>
> All interfaces return the same index.
>
> In fact, for a switch, having a PHC per port would be odd. That would
> mean you need to sync the PHCs in order to act as a boundary clock.
PHC only one, but hw timestamping blocks are per port.
--
regards,
-grygorii
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Andrew Lunn @ 2018-06-05 21:55 UTC (permalink / raw)
To: Grygorii Strashko
Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <3531006d-6cdf-f0a1-a0ce-042194aece45@ti.com>
> PHC only one, but hw timestamping blocks are per port.
Yes, same as the Marvell. Per port, there are two receive time stamps
and one transmit time stamp.
Andrew
^ permalink raw reply
* [PATCH net] net: ipv6: ip6_output: alloc skb with tailroom
From: Alexander Aring @ 2018-06-05 22:04 UTC (permalink / raw)
To: netdev
Cc: davem, yoshfuji, david.palma, rabinarayans0828, jhs, stefan,
linux-wpan, kernel, Alexander Aring
This patch adds care about tailroom length for allocate a skb from ipv6
level stack. In case of 6lowpan we had the problem the skb runs into a
skb_over_panic() in some special length cases. The root was there was no
tailroom allocated for the IEEE 802.15.4 checksum, although we had
the necessary tailroom specified inside the netdev structure.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195059
Reported-by: David Palma <david.palma@ntnu.no>
Reported-by: Rabi Narayan Sahoo <rabinarayans0828@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
---
Hi,
nasty bug, I suppose this is the correct fix to my last question for
what dev->needed_tailroom is designed for.
I added two Reported-by here, David Palma reported this bug one year ago
and I didn't had time to investigate. Interesting is that he told this
bug doesn't occur (in case of 6lowpan 802.15.4) on 32 bit systems.
Maybe alignment is related to the question why it works on 32 bit.
Anyway, a week ago "Rabi Narayan Sahoo" reported the bug again and I
needed to investigate something "why", since I also use a 64 bit vm.
David Palma did a nice job for reproduce this bug and he (I think) lives
at least one year with it, so I put him at first.
Anyway, Rabi Narayan Sahoo was very very close to fix it and found the
right code part which I also found. I read his mail afterwards because
it was received messed on the linux-wpan mailinglist. So it's correct
to give him credits too. :-)
I hope there are no other cases where tailroom is missing.
The second one is not needed to fix my bug but I think we need it there.
Also hh_len is also used inside a skb_resever() in this function,
but this is for headroom only.
- Alex
net/ipv6/ip6_output.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7b6d1689087b..b4e521cfe3cf 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1262,6 +1262,7 @@ static int __ip6_append_data(struct sock *sk,
int exthdrlen = 0;
int dst_exthdrlen = 0;
int hh_len;
+ int t_len;
int copy;
int err;
int offset = 0;
@@ -1283,6 +1284,7 @@ static int __ip6_append_data(struct sock *sk,
orig_mtu = mtu;
hh_len = LL_RESERVED_SPACE(rt->dst.dev);
+ t_len = rt->dst.dev->needed_tailroom;
fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
(opt ? opt->opt_nflen : 0);
@@ -1425,13 +1427,13 @@ static int __ip6_append_data(struct sock *sk,
}
if (transhdrlen) {
skb = sock_alloc_send_skb(sk,
- alloclen + hh_len,
+ alloclen + hh_len + t_len,
(flags & MSG_DONTWAIT), &err);
} else {
skb = NULL;
if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
2 * sk->sk_sndbuf)
- skb = alloc_skb(alloclen + hh_len,
+ skb = alloc_skb(alloclen + hh_len + t_len,
sk->sk_allocation);
if (unlikely(!skb))
err = -ENOBUFS;
--
2.11.0
^ permalink raw reply related
* Re: [PATCH bpf-next v4 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Jakub Kicinski @ 2018-06-05 22:16 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: netdev
In-Reply-To: <152821020087.23694.8231039605257373797.stgit@alrua-kau>
On Tue, 05 Jun 2018 16:50:00 +0200, Toke Høiland-Jørgensen wrote:
> Add two new helper functions to trace_helpers that supports polling
> multiple perf file descriptors for events. These are used to the XDP
> perf_event_output example, which needs to work with one perf fd per CPU.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> ---
> tools/testing/selftests/bpf/trace_helpers.c | 47 ++++++++++++++++++++++++++-
> tools/testing/selftests/bpf/trace_helpers.h | 4 ++
> 2 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
> index 3868dcb63420..1e62d89f34cf 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -88,7 +88,7 @@ static int page_size;
> static int page_cnt = 8;
> static struct perf_event_mmap_page *header;
>
> -int perf_event_mmap(int fd)
> +int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
> {
> void *base;
> int mmap_size;
> @@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
> return -1;
> }
>
> - header = base;
> + *header = base;
> return 0;
> }
>
> +int perf_event_mmap(int fd)
> +{
> + return perf_event_mmap_header(fd, &header);
> +}
> +
> static int perf_event_poll(int fd)
> {
> struct pollfd pfd = { .fd = fd, .events = POLLIN };
> @@ -163,3 +168,41 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
>
> return ret;
> }
> +
> +int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
> + int num_fds, perf_event_print_fn output_fn)
> +{
> + enum bpf_perf_event_ret ret;
> + struct pollfd *pfds;
> + void *buf = NULL;
> + size_t len = 0;
> + int i;
> +
> + pfds = malloc(sizeof(*pfds) * num_fds);
> + if (!pfds)
> + return -1;
nit: this is correct, but better use LIBBPF_PERF_EVENT_ERROR? This
function is supposed to return enum bpf_perf_event_ret values I assume?
In case someone moves this code to libbpf later on...
> + memset(pfds, 0, sizeof(*pfds) * num_fds);
> + for (i = 0; i < num_fds; i++) {
> + pfds[i].fd = fds[i];
> + pfds[i].events = POLLIN;
> + }
> +
> + for (;;) {
> + poll(pfds, num_fds, 1000);
> + for (i = 0; i < num_fds; i++) {
> + if (pfds[i].revents) {
nit: save yourself a lever of indentation by doing:
if (!pfds[i].revents)
continue;
and you won't have to go over 80 chars.
> + ret = bpf_perf_event_read_simple(headers[i], page_cnt * page_size,
> + page_size, &buf, &len,
> + bpf_perf_event_print,
> + output_fn);
> + if (ret != LIBBPF_PERF_EVENT_CONT)
> + break;
> + }
> + }
> + }
> + free(buf);
> + free(pfds);
> +
> + return ret;
> +}
^ permalink raw reply
* Re: [PATCH bpf-next v4 2/2] samples/bpf: Add xdp_sample_pkts example
From: Jakub Kicinski @ 2018-06-05 22:24 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: netdev
In-Reply-To: <152821020092.23694.4231922206556589059.stgit@alrua-kau>
On Tue, 05 Jun 2018 16:50:00 +0200, Toke Høiland-Jørgensen wrote:
> Add an example program showing how to sample packets from XDP using the
> perf event buffer. The example userspace program just prints the ethernet
> header for every packet sampled.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
> new file mode 100644
> index 000000000000..4560522ca015
> --- /dev/null
> +++ b/samples/bpf/xdp_sample_pkts_kern.c
> @@ -0,0 +1,62 @@
> +#include <linux/ptrace.h>
> +#include <linux/version.h>
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +#define SAMPLE_SIZE 64ul
> +#define MAX_CPUS 24
That may be a lil' too few for modern HW with hyper-threading on ;)
My development machine says:
$ ncpus
28
128, maybe?
> +#define bpf_printk(fmt, ...) \
> +({ \
> + char ____fmt[] = fmt; \
> + bpf_trace_printk(____fmt, sizeof(____fmt), \
> + ##__VA_ARGS__); \
> +})
> +
> +struct bpf_map_def SEC("maps") my_map = {
> + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> + .key_size = sizeof(int),
> + .value_size = sizeof(u32),
> + .max_entries = MAX_CPUS,
> +};
> +
> +SEC("xdp_sample")
> +int xdp_sample_prog(struct xdp_md *ctx)
> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + void *data = (void *)(long)ctx->data;
> +
> + /* Metadata will be in the perf event before the packet data. */
> + struct S {
> + u16 cookie;
> + u16 pkt_len;
> + } __attribute__((packed)) metadata;
> +
> + if (data + SAMPLE_SIZE < data_end) {
> + /* The XDP perf_event_output handler will use the upper 32 bits
> + * of the flags argument as a number of bytes to include of the
> + * packet payload in the event data. If the size is too big, the
> + * call to bpf_perf_event_output will fail and return -EFAULT.
> + *
> + * See bpf_xdp_event_output in net/core/filter.c.
> + *
> + * The BPF_F_CURRENT_CPU flag means that the event output fd
> + * will be indexed by the CPU number in the event map.
> + */
> + u64 flags = (SAMPLE_SIZE << 32) | BPF_F_CURRENT_CPU;
> + int ret;
> +
> + metadata.cookie = 0xdead;
> + metadata.pkt_len = (u16)(data_end - data);
> +
> + ret = bpf_perf_event_output(ctx, &my_map, flags,
> + &metadata, sizeof(metadata));
> + if(ret)
Please run checkpatch --strict on the samples.
> + bpf_printk("perf_event_output failed: %d\n", ret);
> + }
> +
> + return XDP_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;
^ permalink raw reply
* Re: umh build...
From: Alexei Starovoitov @ 2018-06-05 22:29 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20180605.170359.357501154106614131.davem@davemloft.net>
On Tue, Jun 05, 2018 at 05:03:59PM -0400, David Miller wrote:
>
> Alexei, I tried to build bpfilter on sparc64 and it shows that
> CONFIG_OUTPUT_FORMAT is an x86 specific Kconfig value.
>
> And, even if I added it, it's not clear what it should even be set to.
>
> Right now, for example, my userland builds default to 32-bit sparc
> even though I'm building a 64-bit kernel.
>
> I think what ends up happening has to be in some way in response to
> what kind of "native" binaries HOSTCC is actually building.
I guess you mean native-ness not of HOSTCC, but what target rootfs will be.
I guess it's probably equal to 32 or 64-bitness of the kernel from CC.
I completely forgot that I need to fix CONFIG_OUTPUT_FORMAT.
On my arm64 I just did
export CONFIG_OUTPUT_FORMAT=elf64-littleaarch64
before doing 'make'.
I can do a hack to extract it from objdump
similar to the hack I do for '-B ...'
Like the patch below...
and it works on x64 too, but I'm not sure about sparc.
>From e216123e54041f73ecf1676e355bc83e80c63d1d Mon Sep 17 00:00:00 2001
Date: Tue, 5 Jun 2018 15:27:07 -0700
Subject: [PATCH] bpfilter: fix OUTPUT_FORMAT
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
net/bpfilter/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index aafa72001fcd..e0bbe7583e58 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -21,7 +21,7 @@ endif
# which bpfilter_kern.c passes further into umh blob loader at run-time
quiet_cmd_copy_umh = GEN $@
cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
- $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+ $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
-B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
--rename-section .data=.init.rodata $< $@
^ permalink raw reply related
* Re: [PATCH iproute2-next v2 1/1] tc: add json support in csum action
From: David Ahern @ 2018-06-05 22:33 UTC (permalink / raw)
To: Keara Leibovitz, dsahern; +Cc: stephen, netdev, kernel
In-Reply-To: <1528231459-26546-1-git-send-email-kleib@mojatatu.com>
On 6/5/18 1:44 PM, Keara Leibovitz wrote:
> Add json output support for checksum action.
>
...
>
> Signed-off-by: Keara Leibovitz <kleib@mojatatu.com>
> ---
> tc/m_csum.c | 16 ++++++++++------
> 1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/tc/m_csum.c b/tc/m_csum.c
> index 8391071d73f2..0bdbcf361a28 100644
> --- a/tc/m_csum.c
> +++ b/tc/m_csum.c
> @@ -162,6 +162,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr *arg)
> char *uflag_5 = "";
> char *uflag_6 = "";
> char *uflag_7 = "";
> + char buf[64];
changed that line to 'SPRINT_BUF(buf);' per Stephen's comment and
applied to iproute2-next.
^ permalink raw reply
* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Julian Anastasov @ 2018-06-05 22:35 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
kernel-team, lvs-devel
In-Reply-To: <20180605213128.tlskvjhuk6vx3m6r@kafai-mbp.dhcp.thefacebook.com>
Hello,
On Tue, 5 Jun 2018, Martin KaFai Lau wrote:
> On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> > So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> > we should have no other issues.
> Hi Julian,
>
> Do you have a chance to work on the IPv6 side?
I was chasing other IPVS issues while preparing
to finalize these changes. I plan to do tests this weekend
and to submit my patch but without gso_size modifications.
Can post patch for this check in __ip6_rt_update_pmtu too,
separately after making sure all is fine.
Regards
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Grygorii Strashko @ 2018-06-05 22:45 UTC (permalink / raw)
To: Andrew Lunn
Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <20180603004937.GD14515@lunn.ch>
Hi Andrew,
Thanks a lot for you comments.
On 06/02/2018 07:49 PM, Andrew Lunn wrote:
> Hi Grygorii
>
>> Don't know howto:
>> 1) add FDB entry with "blocked" flag - ALE can discard all packets with SRC/DST
>> address = blocked MAC
>> 2) add multicast MAC address with Supervisory Packet flag set.
>> Such packets will bypass most of checks inside ALE and will be forwarded in all port's
>> states except "disabled".
>> 3) add "unknown vlan configuration" : ALE provides possibility to configure
>> default behavior for tagged packets with "unknown vlan" by configuring
>> - Unknown VLAN Force Untagged Egress ports Mask.
>> - Unknown VLAN Registered Multicast Flood Ports Mask
>> - Unknown VLAN Multicast Flood ports Mask
>> - Unknown VLAN Member ports List
>> 4) The way to detect "brctl stp br0 on/off"
>
> You are probably looking at this from the wrong direction. Yes, the
> switch can do these things. But the real question is, why would the
> network stack want to do this? As i've said before, you are
> accelerating the network stack by offloading things to the hardware.
Right. Thanks.
>
> Does the software bridge support FDB with a blocked flag? I don't
> think it does. So you first need to extend the software bridge with
> this concept. Then you can offload it to the hardware to accelerate
> it.
Sry, for possible misunderstanding: in "Don't know howto" i've listed
things I was not able to discover from code or documentation with hope
that expert opinion will help to confirm if this this a real/possible gap
or I/we've just missed smth. And if this is a real gap - get "green" or "red"
flag for future work (which need to be planned somehow).
So, my understanding for (1) "blocked FDB entry support" is reasonable
extension for bridge/switchdev ("green").
>
> Does the network stack need for forward specific multicast MAC
> addresses between bridge ports independent of the state? If there is
> no need for it, you don't need to accelerate it.
Assume this is about item 2 - this question is related to STP packets.
CPSW/ALE will drop STP packets if receiving port is in blocking/learning states
unless corresponding mcast entry exist in ALE entry with (Supervisory Packet) flag set
(Actually ALE mcast entry has two fields (TRM):
Supervisory Packet (SUPER): When set, this field indicates that the packet
with a matching multicast destination address is a supervisory packet.
Multicast Forward State (MCAST_FWD_STATE): Indicates the port state(s) required for the received port
on a destination address lookup in order for the multicast packet to be forwarded to
the transmit port(s). A transmit port must be in the Forwarding state in
order to forward the packet.)
Question 4 was asked with assumption that if (2) not supported and "red" flag
- then option (4) can be used as w/a (again if "green" flag) and STP mcast address
can be added in ALE on event "stp on".
** "unknown vlan configuration"
This is about following use case. Non static network configuration when
CPSW based device knows what traffic it can accept (Host port 0), but
it has no knowledge (or limited) about network segments attached to Port 1 and Port 2.
For example: Host 0 can accept only vlan 100 traffic coming from Port 1.
ALE entry: vid =100, port_mask 0x3
But there can be vlan 50 created in attached network segments.
Unknown VLAN Force Untagged Egress ports Mask = 0x0
Unknown VLAN Registered Multicast Flood Ports Mask = 0x6 (P1|P2)
Unknown VLAN Multicast Flood ports Mask = 0x6 (P1|P2)
Unknown VLAN Member ports List = 0x6 (P1|P2)
with above configuration packets with "unknown vlan" (no ALE entry) will
still be forwarded between port 1 and 2, but not Port 0.
So, is it reasonable to add "unknown vlan configuration" to bridge/switchdev
if not exist yet? will any other hw known benefit from it?
--
regards,
-grygorii
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Grygorii Strashko @ 2018-06-05 22:59 UTC (permalink / raw)
To: Andrew Lunn
Cc: Ilias Apalodimas, netdev, ivan.khoronzhuk, nsekhar, jiri, ivecera,
francois.ozog, yogeshs, spatton
In-Reply-To: <20180602140833.GA8574@lunn.ch>
On 06/02/2018 09:08 AM, Andrew Lunn wrote:
> On Fri, Jun 01, 2018 at 04:29:08PM -0500, Grygorii Strashko wrote:
>> Hi Ilias,
>
>
>> Second, Thanks a lot for your great work. I'm still testing it with different
>> use cases and trying to consolidate my reply for all questions.
>>
>> All, thanks for your comments.
>
> Hi Grygorii
>
> Something i've said to Ilias already. I would recommend you don't try
> to cover all your uses cases with the first version. Keep it simple
> and clean, don't do anything controversial and get it merged. Then add
> more features one by one. We can then discuss any odd ball features
> while being able to look at the complete system, driver, switchdev and
> the network stack.
>
yes. It definitely no problem from my side, except basic customer use-cases
simply not working without sw0p0, at least with current LKML :(
And I just have to look a little bit in the future as selected approach
expected to be extended on future SoC (and other parts of existing SoCs - ICSS-G SW switch)
where we going to have more features, like TSN, EST and packet Policing and Classification.
And I very, very appreciated for your (and all others) time and comments.
Thank you.
--
regards,
-grygorii
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-05 21:52 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kys, haiyangz, davem, sridhar.samudrala, netdev,
Stephen Hemminger
In-Reply-To: <20180605221049-mutt-send-email-mst@kernel.org>
On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> >
> > See:
> > https://patchwork.ozlabs.org/patch/851711/
>
> Let me try to summarize that:
>
> You wanted to speed up the delayed link up. You had an idea to
> additionally take link up when userspace renames the interface (standby
> one which is also the failover for netvsc).
>
> But userspace might not do any renames, in which case there will
> still be the delay, and so this never got applied.
>
> Is this a good summary?
>
> Davem said delay should go away completely as it's not robust, and I
> think I agree. So I don't think we should make all failover users use
> delay. IIUC failover kept a delay option especially for netvsc to
> minimize the surprise factor. Hopefully we can come up with
> something more robust and drop that option completely.
The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions. The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.
My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.
^ permalink raw reply
* [RFC PATCH ghak86 V1] audit: eliminate audit_enabled magic number comparison
From: Richard Guy Briggs @ 2018-06-05 23:20 UTC (permalink / raw)
To: Linux-Audit Mailing List, LKML,
Linux NetDev Upstream Mailing List, Netfilter Devel List,
Linux Security Module list
Cc: Eric Paris, Paul Moore, Steve Grubb, Richard Guy Briggs
Remove comparison of audit_enabled to magic numbers outside of audit.
Related: https://github.com/linux-audit/audit-kernel/issues/86
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
drivers/tty/tty_audit.c | 2 +-
include/linux/audit.h | 5 ++++-
include/net/xfrm.h | 2 +-
kernel/audit.c | 3 ---
net/netfilter/xt_AUDIT.c | 2 +-
net/netlabel/netlabel_user.c | 2 +-
6 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/tty/tty_audit.c b/drivers/tty/tty_audit.c
index e30aa6b..50f567b 100644
--- a/drivers/tty/tty_audit.c
+++ b/drivers/tty/tty_audit.c
@@ -92,7 +92,7 @@ static void tty_audit_buf_push(struct tty_audit_buf *buf)
{
if (buf->valid == 0)
return;
- if (audit_enabled == 0) {
+ if (audit_enabled == AUDIT_OFF) {
buf->valid = 0;
return;
}
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 69c7847..9334fbe 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -117,6 +117,9 @@ struct audit_field {
extern void audit_log_session_info(struct audit_buffer *ab);
+#define AUDIT_OFF 0
+#define AUDIT_ON 1
+#define AUDIT_LOCKED 2
#ifdef CONFIG_AUDIT
/* These are defined in audit.c */
/* Public API */
@@ -202,7 +205,7 @@ static inline int audit_log_task_context(struct audit_buffer *ab)
static inline void audit_log_task_info(struct audit_buffer *ab,
struct task_struct *tsk)
{ }
-#define audit_enabled 0
+#define audit_enabled AUDIT_OFF
#endif /* CONFIG_AUDIT */
#ifdef CONFIG_AUDIT_COMPAT_GENERIC
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 7f2e31a..ce995a1 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -734,7 +734,7 @@ static inline struct audit_buffer *xfrm_audit_start(const char *op)
{
struct audit_buffer *audit_buf = NULL;
- if (audit_enabled == 0)
+ if (audit_enabled == AUDIT_OFF)
return NULL;
audit_buf = audit_log_start(audit_context(), GFP_ATOMIC,
AUDIT_MAC_IPSEC_EVENT);
diff --git a/kernel/audit.c b/kernel/audit.c
index e7478cb..8442c65 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -83,9 +83,6 @@
#define AUDIT_INITIALIZED 1
static int audit_initialized;
-#define AUDIT_OFF 0
-#define AUDIT_ON 1
-#define AUDIT_LOCKED 2
u32 audit_enabled = AUDIT_OFF;
bool audit_ever_enabled = !!AUDIT_OFF;
diff --git a/net/netfilter/xt_AUDIT.c b/net/netfilter/xt_AUDIT.c
index f368ee6..af883f1 100644
--- a/net/netfilter/xt_AUDIT.c
+++ b/net/netfilter/xt_AUDIT.c
@@ -72,7 +72,7 @@ static bool audit_ip6(struct audit_buffer *ab, struct sk_buff *skb)
struct audit_buffer *ab;
int fam = -1;
- if (audit_enabled == 0)
+ if (audit_enabled == AUDIT_OFF)
goto errout;
ab = audit_log_start(NULL, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
if (ab == NULL)
diff --git a/net/netlabel/netlabel_user.c b/net/netlabel/netlabel_user.c
index 2f328af..4676f5b 100644
--- a/net/netlabel/netlabel_user.c
+++ b/net/netlabel/netlabel_user.c
@@ -101,7 +101,7 @@ struct audit_buffer *netlbl_audit_start_common(int type,
char *secctx;
u32 secctx_len;
- if (audit_enabled == 0)
+ if (audit_enabled == AUDIT_OFF)
return NULL;
audit_buf = audit_log_start(audit_context(), GFP_ATOMIC, type);
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Grygorii Strashko @ 2018-06-05 23:23 UTC (permalink / raw)
To: Andrew Lunn
Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <20180603002618.GB14515@lunn.ch>
On 06/02/2018 07:26 PM, Andrew Lunn wrote:
>> *After this patch set*: goal keep things working the same as max as
>> possible and get rid of TI custom tool.
>
> We are happy to keep things the same, if they fit with the switchdev
> model. Anything in your customer TI tool/model which does not fit the
> switchdev model you won't be able to keep, except if we agree to
> extend the model.
Right. That's the main goal of RFC to identify those gaps.
>
> I can say now, sw0p0 is going to cause problems. I really do suggest
> you drop it for the moment in order to get a minimal driver
> accepted. sw0p0 does not fit the switchdev model.
Honestly, this is not the first patchset and we started without sw0p0,
but then.... (with current LKML)
- default vlan offloading breaks traffic reception to P0
(Ilias saying it's fixed in next - good)
- adding vlan to P1/P2 works, but not for P0 (again as per Ilias -fixed)
- mcast - no way to manually add static record and include or exclude P0.
:( above are basic functionality required.
>
>> Below I've described some tested use cases (not include full static configuration),
>> but regarding sw0p0 - there is work done by Ivan Khoronzhuk [1] which enables
>> adds MQPRIO and CBS Qdisc and targets AVB network features. It required to
>> offload MQPRIO and CBS parameters on all ports including P0. In case of P0,
>> CPDMA TX channels shapers need to be configured, and in case
>> of sw0p1/sw0p2 internal FIFOS.
>> sw0p0 also expected to be used to configure CPDMA interface in general -
>> number of tx/rx channels, rates, ring sizes.
>
> Can this be derives from the configuration on sw0p1 and sw0p2?
> sw0p1 has 1 tx channel, sw0p2 has 2 tx channels, so give p0 3 tx
> channels?
This not exactly what is required, sry I probably will need just repeat what Ivan
described in https://lkml.org/lkml/2018/5/18/1135. I'd try to provide this info tomorrow.
Unfortunately, I'm not sure if all this reworking and switchdev conversation would make sense
if we will not be able to fit Ivan's work in new CPSW driver model ;..(
and do AVB bridge.
>
>> In addition there is set of global CPSW parameters (not related to P1/P2, like
>> MAC Authorization Mode, OUI Deny Mode, crc ) which I've
>> thought can be added to sw0p0 (using ethtool -priv-flags).
>
> You should describe these features, and then we can figure out how
> best to model them. devlink might be an option if they are switch
> global.
Ok. that can be postponed.
--
regards,
-grygorii
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Andrew Lunn @ 2018-06-05 23:40 UTC (permalink / raw)
To: Grygorii Strashko
Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <06981861-01aa-90c8-9b30-68426cbef447@ti.com>
> So, my understanding for (1) "blocked FDB entry support" is reasonable
> extension for bridge/switchdev ("green").
You might have to justify it, but yes.
> > Does the network stack need for forward specific multicast MAC
> > addresses between bridge ports independent of the state? If there is
> > no need for it, you don't need to accelerate it.
>
> Assume this is about item 2 - this question is related to STP packets.
> CPSW/ALE will drop STP packets if receiving port is in blocking/learning states
> unless corresponding mcast entry exist in ALE entry with (Supervisory Packet) flag set
> (Actually ALE mcast entry has two fields (TRM):
> Supervisory Packet (SUPER): When set, this field indicates that the packet
> with a matching multicast destination address is a supervisory packet.
> Multicast Forward State (MCAST_FWD_STATE): Indicates the port state(s) required for the received port
> on a destination address lookup in order for the multicast packet to be forwarded to
> the transmit port(s). A transmit port must be in the Forwarding state in
> order to forward the packet.)
So i this case, i would expect your driver to just add these entries
to the ALE. No need for configuration for above. Just do it as soon as
a port is made a member of a bridge. And remove it when the port
leaves the bridge.
DSA devices are smarter. They all have hardware which looks for BPDU
and forwards them to the host independent of the state of the port.
They also tend to have hardware looking for IGMP packets, and will
forward them to the host, even if the multicast address is not being
forwarded to the host.
> ** "unknown vlan configuration"
>
> This is about following use case. Non static network configuration when
> CPSW based device knows what traffic it can accept (Host port 0), but
> it has no knowledge (or limited) about network segments attached to Port 1 and Port 2.
>
> For example: Host 0 can accept only vlan 100 traffic coming from Port 1.
> ALE entry: vid =100, port_mask 0x3
>
> But there can be vlan 50 created in attached network segments.
> Unknown VLAN Force Untagged Egress ports Mask = 0x0
> Unknown VLAN Registered Multicast Flood Ports Mask = 0x6 (P1|P2)
> Unknown VLAN Multicast Flood ports Mask = 0x6 (P1|P2)
> Unknown VLAN Member ports List = 0x6 (P1|P2)
>
> with above configuration packets with "unknown vlan" (no ALE entry) will
> still be forwarded between port 1 and 2, but not Port 0.
>
> So, is it reasonable to add "unknown vlan configuration" to bridge/switchdev
> if not exist yet? will any other hw known benefit from it?
You need to think about the case of two e1000e. How do you configure
this setup to do what you want? It probably is already possible.
Andrew
^ permalink raw reply
* wir bieten 2% Kredite
From: Ronald Bernstein @ 2018-06-05 23:25 UTC (permalink / raw)
To: Recipients
Sehr geehrte Damen und Herren,
Sie brauchen Geld? Sie sind auf der suche nach einem Darlehen? Seriös und
unkompliziert?
Dann sind Sie hier bei uns genau richtig.
Durch unsere jahrelange Erfahrung und kompetente Beratung sind wir
Europaweit tätig.
Wir bieten jedem ein GÜNSTIGES Darlehen zu TOP Konditionen an.
Darlehnen zwischen 5000 CHF/Euro bis zu 20 Millionen CHF/Euro möglich.
Wir erheben dazu 2% Zinssatz.
Lassen Sie sich von unserem kompetenten Team beraten.
Zögern Sie nicht und kontaktieren Sie mich unter für weitere Infos &
Anfragen unter der eingeblendeten Email Adresse.
Ich freue mich von Ihnen zu hören.
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Andrew Lunn @ 2018-06-05 23:49 UTC (permalink / raw)
To: Grygorii Strashko
Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <4908e5f8-7533-6fa5-866f-1fb32fd13857@ti.com>
On Tue, Jun 05, 2018 at 06:23:45PM -0500, Grygorii Strashko wrote:
>
>
> On 06/02/2018 07:26 PM, Andrew Lunn wrote:
> >> *After this patch set*: goal keep things working the same as max as
> >> possible and get rid of TI custom tool.
> >
> > We are happy to keep things the same, if they fit with the switchdev
> > model. Anything in your customer TI tool/model which does not fit the
> > switchdev model you won't be able to keep, except if we agree to
> > extend the model.
>
> Right. That's the main goal of RFC to identify those gaps.
>
> >
> > I can say now, sw0p0 is going to cause problems. I really do suggest
> > you drop it for the moment in order to get a minimal driver
> > accepted. sw0p0 does not fit the switchdev model.
>
> Honestly, this is not the first patchset and we started without sw0p0,
> but then.... (with current LKML)
> - default vlan offloading breaks traffic reception to P0
> (Ilias saying it's fixed in next - good)
> - adding vlan to P1/P2 works, but not for P0 (again as per Ilias -fixed)
> - mcast - no way to manually add static record and include or exclude P0.
>
>
> :( above are basic functionality required.
For a DSA driver, this is way more than basic. A basic DSA driver just
provides interfaces, and does everything in software. No offload at
all. Generally, FDB offload is next, then MDB, and then VLAN, each as
separate patch sets.
> Unfortunately, I'm not sure if all this reworking and switchdev conversation would make sense
> if we will not be able to fit Ivan's work in new CPSW driver model ;..(
> and do AVB bridge.
AVB bridge should fit the switchdev model. You can offload TC via
switchdev e.g. the b53 has mirred, mellonex has flower and a lot more.
Andrew
^ permalink raw reply
* Re: umh build...
From: David Miller @ 2018-06-05 23:42 UTC (permalink / raw)
To: alexei.starovoitov; +Cc: netdev
In-Reply-To: <20180605222932.k4nmh23ggf4pnsft@ast-mbp>
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 5 Jun 2018 15:29:34 -0700
> On Tue, Jun 05, 2018 at 05:03:59PM -0400, David Miller wrote:
>>
>> Alexei, I tried to build bpfilter on sparc64 and it shows that
>> CONFIG_OUTPUT_FORMAT is an x86 specific Kconfig value.
>>
>> And, even if I added it, it's not clear what it should even be set to.
>>
>> Right now, for example, my userland builds default to 32-bit sparc
>> even though I'm building a 64-bit kernel.
>>
>> I think what ends up happening has to be in some way in response to
>> what kind of "native" binaries HOSTCC is actually building.
>
> I guess you mean native-ness not of HOSTCC, but what target rootfs will be.
Target rootfs is where I am building the kernel in this case.
Kernel running is 64-bit, most of userland is 32-bit. Kernel is built
explicitly with "-m64" but gcc by default is building 32-bit.
> I guess it's probably equal to 32 or 64-bitness of the kernel from
> CC.
Nope, they are and can be different, even natively.
> I completely forgot that I need to fix CONFIG_OUTPUT_FORMAT.
> On my arm64 I just did
> export CONFIG_OUTPUT_FORMAT=elf64-littleaarch64
> before doing 'make'.
> I can do a hack to extract it from objdump
> similar to the hack I do for '-B ...'
> Like the patch below...
> and it works on x64 too, but I'm not sure about sparc.
...
> @@ -21,7 +21,7 @@ endif
> # which bpfilter_kern.c passes further into umh blob loader at run-time
> quiet_cmd_copy_umh = GEN $@
> cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> - $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
> + $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
> -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> --rename-section .data=.init.rodata $< $@
Yeah this doesn't do it in the 64-bit kernel 32-bit userspace scenerio.
We build UMH which ends up being a 32-bit ELF binary, then we later
try to link it:
ld -m elf64_sparc -r -o net/bpfilter/bpfilter.o net/bpfilter/bpfilter_kern.o net/bpfilter/bpfilter_umh.o ; scripts/mod/modpost net/bpfilter/bpfilter.o
ld: sparc:v8plusb architecture of input file `net/bpfilter/bpfilter_umh.o' is incompatible with sparc:v9 output
But your patch is a step in the right direction because it gets us
away from depending upon CONFIG_OUTPUT_FORMAT.
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Samudrala, Sridhar @ 2018-06-05 23:52 UTC (permalink / raw)
To: Stephen Hemminger, Michael S. Tsirkin
Cc: kys, haiyangz, davem, netdev, Stephen Hemminger
In-Reply-To: <20180605145222.477e5ae8@xeon-e3>
On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
> On Tue, 5 Jun 2018 22:38:43 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>>> See:
>>> https://patchwork.ozlabs.org/patch/851711/
>> Let me try to summarize that:
>>
>> You wanted to speed up the delayed link up. You had an idea to
>> additionally take link up when userspace renames the interface (standby
>> one which is also the failover for netvsc).
>>
>> But userspace might not do any renames, in which case there will
>> still be the delay, and so this never got applied.
>>
>> Is this a good summary?
>>
>> Davem said delay should go away completely as it's not robust, and I
>> think I agree. So I don't think we should make all failover users use
>> delay. IIUC failover kept a delay option especially for netvsc to
>> minimize the surprise factor. Hopefully we can come up with
>> something more robust and drop that option completely.
> The timeout was the original solution to how to complete setup after
> userspace has had a chance to rename the device. Unfortunately, the whole network
> device initialization (cooperation with udev and userspace) is a a mess because
> there is no well defined specification, and there are multiple ways userspace
> does this in old and new distributions. The timeout has its own issues
> (how long, handling errors during that window, what if userspace modifies other
> device state); and open to finding a better solution.
>
> My point was that if name change can not be relied on (or used) by netvsc,
> then we can't allow it for net_failover either.
I think the push back was with the usage of the delay, not bringing up the primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Andrew Lunn @ 2018-06-05 23:53 UTC (permalink / raw)
To: Grygorii Strashko
Cc: Ilias Apalodimas, netdev, ivan.khoronzhuk, nsekhar, jiri, ivecera,
francois.ozog, yogeshs, spatton
In-Reply-To: <85eb9f54-a4de-969e-4658-38878d481292@ti.com>
> And I just have to look a little bit in the future as selected approach
> expected to be extended on future SoC (and other parts of existing SoCs - ICSS-G SW switch)
> where we going to have more features, like TSN, EST and packet Policing and Classification.
You should probably took a closer look at the Mellonex driver. It has
a lot of TC offload, which is what policing and classification is.
TSN is being worked on in general, and i think the i210 is taking the
lead. So you probably want to keep an eye on that, and join the
discussion.
Andrew
^ permalink raw reply
* [PATCH net-next] enic: fix UDP rss bits
From: Govindarajulu Varadarajan @ 2018-06-05 17:14 UTC (permalink / raw)
To: davem, netdev; +Cc: benve, Govindarajulu Varadarajan
In commit 48398b6e7065 ("enic: set UDP rss flag") driver needed to set a
single bit to enable UDP rss. This is changed to two bit. One for UDP
IPv4 and other bit for UDP IPv6. The hardware which supports this is not
released yet. When released, driver should set 2 bit to enable UDP rss for
both IPv4 and IPv6.
Also add spinlock around vnic_dev_capable_rss_hash_type().
Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
---
.../net/ethernet/cisco/enic/enic_ethtool.c | 18 +++++++++++++----
drivers/net/ethernet/cisco/enic/enic_main.c | 20 +++++++++++++------
drivers/net/ethernet/cisco/enic/enic_res.c | 7 ++++++-
drivers/net/ethernet/cisco/enic/vnic_dev.c | 18 ++++++++++-------
drivers/net/ethernet/cisco/enic/vnic_dev.h | 2 +-
drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 20 ++++++++++++++++++-
drivers/net/ethernet/cisco/enic/vnic_nic.h | 3 ++-
7 files changed, 67 insertions(+), 21 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_ethtool.c b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
index 869006c2002d..f42f7a6e1559 100644
--- a/drivers/net/ethernet/cisco/enic/enic_ethtool.c
+++ b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
@@ -476,18 +476,28 @@ static int enic_grxclsrule(struct enic *enic, struct ethtool_rxnfc *cmd)
static int enic_get_rx_flow_hash(struct enic *enic, struct ethtool_rxnfc *cmd)
{
+ u8 rss_hash_type = 0;
cmd->data = 0;
+ spin_lock_bh(&enic->devcmd_lock);
+ (void)vnic_dev_capable_rss_hash_type(enic->vdev, &rss_hash_type);
+ spin_unlock_bh(&enic->devcmd_lock);
switch (cmd->flow_type) {
case TCP_V6_FLOW:
case TCP_V4_FLOW:
- cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
- /* Fall through */
+ cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3 |
+ RXH_IP_SRC | RXH_IP_DST;
+ break;
case UDP_V6_FLOW:
+ cmd->data |= RXH_IP_SRC | RXH_IP_DST;
+ if (rss_hash_type & NIC_CFG_RSS_HASH_TYPE_UDP_IPV6)
+ cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
+ break;
case UDP_V4_FLOW:
- if (vnic_dev_capable_udp_rss(enic->vdev))
+ cmd->data |= RXH_IP_SRC | RXH_IP_DST;
+ if (rss_hash_type & NIC_CFG_RSS_HASH_TYPE_UDP_IPV4)
cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
- /* Fall through */
+ break;
case SCTP_V4_FLOW:
case AH_ESP_V4_FLOW:
case AH_V4_FLOW:
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 8a8b12b720ef..30d2eaa18c04 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -2320,16 +2320,24 @@ static int enic_set_rss_nic_cfg(struct enic *enic)
{
struct device *dev = enic_get_dev(enic);
const u8 rss_default_cpu = 0;
- u8 rss_hash_type = NIC_CFG_RSS_HASH_TYPE_IPV4 |
- NIC_CFG_RSS_HASH_TYPE_TCP_IPV4 |
- NIC_CFG_RSS_HASH_TYPE_IPV6 |
- NIC_CFG_RSS_HASH_TYPE_TCP_IPV6;
const u8 rss_hash_bits = 7;
const u8 rss_base_cpu = 0;
+ u8 rss_hash_type;
+ int res;
u8 rss_enable = ENIC_SETTING(enic, RSS) && (enic->rq_count > 1);
- if (vnic_dev_capable_udp_rss(enic->vdev))
- rss_hash_type |= NIC_CFG_RSS_HASH_TYPE_UDP;
+ spin_lock_bh(&enic->devcmd_lock);
+ res = vnic_dev_capable_rss_hash_type(enic->vdev, &rss_hash_type);
+ spin_unlock_bh(&enic->devcmd_lock);
+ if (res) {
+ /* defaults for old adapters
+ */
+ rss_hash_type = NIC_CFG_RSS_HASH_TYPE_IPV4 |
+ NIC_CFG_RSS_HASH_TYPE_TCP_IPV4 |
+ NIC_CFG_RSS_HASH_TYPE_IPV6 |
+ NIC_CFG_RSS_HASH_TYPE_TCP_IPV6;
+ }
+
if (rss_enable) {
if (!enic_set_rsskey(enic)) {
if (enic_set_rsscpu(enic, rss_hash_bits)) {
diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
index 9c96911fb2c8..40b20817ddd5 100644
--- a/drivers/net/ethernet/cisco/enic/enic_res.c
+++ b/drivers/net/ethernet/cisco/enic/enic_res.c
@@ -149,6 +149,7 @@ int enic_set_nic_cfg(struct enic *enic, u8 rss_default_cpu, u8 rss_hash_type,
u8 rss_hash_bits, u8 rss_base_cpu, u8 rss_enable, u8 tso_ipid_split_en,
u8 ig_vlan_strip_en)
{
+ enum vnic_devcmd_cmd cmd = CMD_NIC_CFG;
u64 a0, a1;
u32 nic_cfg;
int wait = 1000;
@@ -160,7 +161,11 @@ int enic_set_nic_cfg(struct enic *enic, u8 rss_default_cpu, u8 rss_hash_type,
a0 = nic_cfg;
a1 = 0;
- return vnic_dev_cmd(enic->vdev, CMD_NIC_CFG, &a0, &a1, wait);
+ if (rss_hash_type & (NIC_CFG_RSS_HASH_TYPE_UDP_IPV4 |
+ NIC_CFG_RSS_HASH_TYPE_UDP_IPV6))
+ cmd = CMD_NIC_CFG_CHK;
+
+ return vnic_dev_cmd(enic->vdev, cmd, &a0, &a1, wait);
}
int enic_set_rss_key(struct enic *enic, dma_addr_t key_pa, u64 len)
diff --git a/drivers/net/ethernet/cisco/enic/vnic_dev.c b/drivers/net/ethernet/cisco/enic/vnic_dev.c
index 76cdd4c9d11f..e9db811df59c 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_dev.c
+++ b/drivers/net/ethernet/cisco/enic/vnic_dev.c
@@ -1282,19 +1282,23 @@ int vnic_dev_get_supported_feature_ver(struct vnic_dev *vdev, u8 feature,
return ret;
}
-bool vnic_dev_capable_udp_rss(struct vnic_dev *vdev)
+int vnic_dev_capable_rss_hash_type(struct vnic_dev *vdev, u8 *rss_hash_type)
{
u64 a0 = CMD_NIC_CFG, a1 = 0;
- u64 rss_hash_type;
int wait = 1000;
int err;
err = vnic_dev_cmd(vdev, CMD_CAPABILITY, &a0, &a1, wait);
- if (err || !a0)
- return false;
+ /* rss_hash_type is valid only when a0 is 1. Adapter which does not
+ * support CMD_CAPABILITY for rss_hash_type has a0 = 0
+ */
+ if (err || (a0 != 1))
+ return -EOPNOTSUPP;
+
+ a1 = (a1 >> NIC_CFG_RSS_HASH_TYPE_SHIFT) &
+ NIC_CFG_RSS_HASH_TYPE_MASK_FIELD;
- rss_hash_type = (a1 >> NIC_CFG_RSS_HASH_TYPE_SHIFT) &
- NIC_CFG_RSS_HASH_TYPE_MASK_FIELD;
+ *rss_hash_type = (u8)a1;
- return (rss_hash_type & NIC_CFG_RSS_HASH_TYPE_UDP);
+ return 0;
}
diff --git a/drivers/net/ethernet/cisco/enic/vnic_dev.h b/drivers/net/ethernet/cisco/enic/vnic_dev.h
index 59d4cc8fbb85..714fc1ed79e3 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_dev.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_dev.h
@@ -184,6 +184,6 @@ int vnic_dev_overlay_offload_cfg(struct vnic_dev *vdev, u8 overlay,
u16 vxlan_udp_port_number);
int vnic_dev_get_supported_feature_ver(struct vnic_dev *vdev, u8 feature,
u64 *supported_versions, u64 *a1);
-bool vnic_dev_capable_udp_rss(struct vnic_dev *vdev);
+int vnic_dev_capable_rss_hash_type(struct vnic_dev *vdev, u8 *rss_hash_type);
#endif /* _VNIC_DEV_H_ */
diff --git a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
index 41de4ba622a1..fef5a0a0663d 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
@@ -148,8 +148,26 @@ enum vnic_devcmd_cmd {
/* del VLAN id in (u16)a0 */
CMD_VLAN_DEL = _CMDCNW(_CMD_DIR_WRITE, _CMD_VTYPE_ENET, 15),
- /* nic_cfg in (u32)a0 */
+ /* nic_cfg (no wait, always succeeds)
+ * in: (u32)a0
+ *
+ * Capability query:
+ * out: (u64) a0 = 1 if a1 is valid
+ * (u64) a1 = (NIC_CFG bits supported) | (flags << 32)
+ *
+ * flags are CMD_NIC_CFG_CAPF_xxx
+ */
CMD_NIC_CFG = _CMDCNW(_CMD_DIR_WRITE, _CMD_VTYPE_ALL, 16),
+ /* nic_cfg_chk (will return error if flags are invalid)
+ * in: (u32)a0
+ *
+ * Capability query:
+ * out: (u64) a0 = 1 if a1 is valid
+ * (u64) a1 = (NIC_CFG bits supported) | (flags << 32)
+ *
+ * flags are CMD_NIC_CFG_CAPF_xxx
+ */
+ CMD_NIC_CFG_CHK = _CMDC(_CMD_DIR_WRITE, _CMD_VTYPE_ALL, 16),
/* union vnic_rss_key in mem: (u64)a0=paddr, (u16)a1=len */
CMD_RSS_KEY = _CMDC(_CMD_DIR_WRITE, _CMD_VTYPE_ENET, 17),
diff --git a/drivers/net/ethernet/cisco/enic/vnic_nic.h b/drivers/net/ethernet/cisco/enic/vnic_nic.h
index 5a93db0d7afc..84ff8ca17fcb 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_nic.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_nic.h
@@ -41,13 +41,14 @@
#define NIC_CFG_IG_VLAN_STRIP_EN_MASK_FIELD 1UL
#define NIC_CFG_IG_VLAN_STRIP_EN_SHIFT 24
+#define NIC_CFG_RSS_HASH_TYPE_UDP_IPV4 (1 << 0)
#define NIC_CFG_RSS_HASH_TYPE_IPV4 (1 << 1)
#define NIC_CFG_RSS_HASH_TYPE_TCP_IPV4 (1 << 2)
#define NIC_CFG_RSS_HASH_TYPE_IPV6 (1 << 3)
#define NIC_CFG_RSS_HASH_TYPE_TCP_IPV6 (1 << 4)
#define NIC_CFG_RSS_HASH_TYPE_IPV6_EX (1 << 5)
#define NIC_CFG_RSS_HASH_TYPE_TCP_IPV6_EX (1 << 6)
-#define NIC_CFG_RSS_HASH_TYPE_UDP (1 << 7)
+#define NIC_CFG_RSS_HASH_TYPE_UDP_IPV6 (1 << 7)
static inline void vnic_set_nic_cfg(u32 *nic_cfg,
u8 rss_default_cpu, u8 rss_hash_type,
--
2.17.1
^ permalink raw reply related
* Re: [PATCH bpf-next v7 3/6] bpf: Add IPv6 Segment Routing helpers
From: Mathieu Xhonneux @ 2018-06-06 1:45 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: netdev, David Lebrun, Alexei Starovoitov
In-Reply-To: <0555b355-be53-ed2f-2a8b-8093bf8aef2c@iogearbox.net>
2018-05-30 13:00 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
>> Instead of doing this inside the helper you can reject the program already
>> in the lwt_*_func_proto() by returning NULL when !CONFIG_IPV6_SEG6_BPF. That
>> way programs get rejected at verification time instead of runtime, so the
>> user can probe availability more easily.
>
My initial idea here was that we could still allow End.BPF when
!CONFIG_IPV6_SEG6_BPF, even if the helpers aren't accessible. But
indeed, having a homogeneous behavior is probably a better solution,
that would make things more clearer for users.
> Mathieu, before this gets lost in archives, plan to follow-up on this one?
Sure, I have been totally overbooked the last two weeks and will still
be for one week or two. I'm planning to send some patches ASAP, there
are a few things to improve and debug. iproute2 support has still not
been sent upstream as well.
^ permalink raw reply
* Re: general protection fault in sockfs_setattr
From: shankarapailoor @ 2018-06-06 2:19 UTC (permalink / raw)
To: Cong Wang; +Cc: David Miller, LKML, syzkaller, Linux Kernel Network Developers
In-Reply-To: <CAM_iQpUaK1xaZAKmPY0JnyAmebBJQ5GPJsiG8HgvXLrxNqphZg@mail.gmail.com>
Hi Cong,
I added that check and it seems to stop the crash. Like you said, I
don't see where the reference count for the file is increased. The
inode lock also seems to be held during this call.
Regards,
Shankara
On Tue, Jun 5, 2018 at 12:14 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Mon, Jun 4, 2018 at 9:53 PM, shankarapailoor
> <shankarapailoor@gmail.com> wrote:
>> Hi,
>>
>> I have been fuzzing Linux 4.17-rc7 with Syzkaller and found the
>> following crash: https://pastebin.com/ixX3RB9j
>>
>> Syzkaller isolated the cause of the bug to the following program:
>>
>> socketpair$unix(0x1, 0x1, 0x0,
>> &(0x7f0000000000)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff})
>> getresuid(&(0x7f0000000080)=<r2=>0x0, &(0x7f00000000c0),
>> &(0x7f0000000700))r3 = getegid()
>> fchownat(r0, &(0x7f0000000040)='\x00', r2, r3, 0x1000)
>> dup3(r1, r0, 0x80000)
>>
>>
>> The problematic area appears to be here:
>>
>> static int sockfs_setattr(struct dentry *dentry, struct iattr *iattr)
>> {
>> int err = simple_setattr(dentry, iattr);
>>
>> if (!err && (iattr->ia_valid & ATTR_UID)) {
>> struct socket *sock = SOCKET_I(d_inode(dentry));
>>
>> sock->sk->sk_uid = iattr->ia_uid; //KASAN GPF
>> }
>> return err;
>> }
>>
>> If dup3 is called concurrently with fchownat then can sock->sk be NULL?
>
> Although dup3() implies a close(), fd is refcnt'ted, if dup3() runs
> concurrently with fchownat() it should not be closed until whoever
> the last closes it.
>
> Or maybe fchownat() doesn't even hold refcnt of fd, since it aims
> to change the file backed.
>
>
> Not sure if the following is sufficient, inode might need to be protected
> with some lock...
>
> diff --git a/net/socket.c b/net/socket.c
> index f10f1d947c78..6294b4b3132e 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -537,7 +537,10 @@ static int sockfs_setattr(struct dentry *dentry,
> struct iattr *iattr)
> if (!err && (iattr->ia_valid & ATTR_UID)) {
> struct socket *sock = SOCKET_I(d_inode(dentry));
>
> - sock->sk->sk_uid = iattr->ia_uid;
> + if (sock->sk)
> + sock->sk->sk_uid = iattr->ia_uid;
> + else
> + err = -ENOENT;
> }
>
> return err;
--
Regards,
Shankara Pailoor
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox