Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] drivers/net: add 6WIND SHULTI support
From: Nicolas Dichtel @ 2016-04-27 15:38 UTC (permalink / raw)
  To: Jiri Pirko, Florian Westphal; +Cc: davem, netdev
In-Reply-To: <20160427151336.GB1962@nanopsycho.orion>

Le 27/04/2016 17:14, Jiri Pirko a écrit :
> Wed, Apr 27, 2016 at 11:56:15AM CEST, fw@strlen.de wrote:
>> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>>> This patch adds the support of the 6WIND SHULTI switch. It is a software
>>> switch doing L2 forwarding.
>>>
>>> This first version implements the minimum needed to get the device working.
>>> It also implements, via switchdev and rtnetlink, bridge forwarding offload,
>>> including FDB static entries, FDB learning and FDB ageing.
>>
>> How is this different from net/bridge?
>> How is this different from openvswitch?
> 
> The difference is that it this tries to allow userspace crap to mirror
> setting user does for bridge/ovs. Basically this looks to me like an
> attempt to enable userspace SDKs and such.
> 
It is software switch, allowed by the switchdev model (see
Documentation/networking/switchdev.txt), same design as mellanox spectrum.

What's wrong with that?

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
From: Sam Ravnborg @ 2016-04-27 15:37 UTC (permalink / raw)
  To: Andreas Larsson; +Cc: David S. Miller, netdev, linux-kernel, software
In-Reply-To: <1461768370-3011-1-git-send-email-andreas@gaisler.com>

Hi Andreas.

Tak!

	Sam

On Wed, Apr 27, 2016 at 04:46:10PM +0200, Andreas Larsson wrote:
> Signed-off-by: Andreas Larsson <andreas@gaisler.com>
> ---
>  MAINTAINERS |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1d5b4be..08ffebd 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4903,7 +4903,7 @@ F:	net/ipv4/gre_offload.c
>  F:	include/net/gre.h
>  
>  GRETH 10/100/1G Ethernet MAC device driver
> -M:	Kristoffer Glembo <kristoffer@gaisler.com>
> +M:	Andreas Larsson <andreas@gaisler.com>
>  L:	netdev@vger.kernel.org
>  S:	Maintained
>  F:	drivers/net/ethernet/aeroflex/
> -- 
> 1.7.10.4
> 

^ permalink raw reply

* Re: ixgbe: cannot enable LRO
From: Alexander Duyck @ 2016-04-27 15:37 UTC (permalink / raw)
  To: Otto Sabart; +Cc: Netdev, Jirka Hladky, Adam Okuliar
In-Reply-To: <20160427093646.GA15205@redhat.com>

On Wed, Apr 27, 2016 at 2:36 AM, Otto Sabart <osabart@redhat.com> wrote:
>
> Hello everyone,
> does anybody have a problem with LRO on ixge (on latest 4.6-rc5)?
> I cannot find a way to enable it.
>
> On stable RHEL7.2 kernel everything works fine.
>
> I opened a bug report [0].
>
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=117291
>
>
> Thanks!
>
> Ota

So I am able to turn on LRO without any issues.

Do you know if you have done anything that might disable LRO such as
modified the rx-usecs to a value less than 10 or enabled routing or
bridging on the device?  Also I think a stacked device might be able
to block you from enabling LRO unless all the devices stacked on the
interface can support it.

- Alex

^ permalink raw reply

* Re: [PATCH] ip: add udp_csum, udp6_csum_tx, udp6_csum_rx control flags to ip l2tp add tunnel
From: Wang Shanker @ 2016-04-27 15:33 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev
In-Reply-To: <CAEwTi7SkMrgxpPi-EuC2UDiwv7L0-vdRAhtdAYLgnbnw5H-Eig@mail.gmail.com>



> 在 2016年4月27日，20:21，James Chapman <jchapman@katalix.com> 写道：
> 
> On 26 April 2016 at 15:15, Wang Shanker <shankerwangmiao@gmail.com> wrote:
>> Hi, all
>> 
>> It’s my first time to contribute to such an important open source project. Things began when I upgraded my server, called "Server A", form ubuntu 14.04 to 16.04, which is shipped with new kernel version, 4.4. After upgrade, I soon found a l2tp tunnel between this server and another linux server, called "Server B", via ipv6 broke down. Here is the network topology:
>> 
>>  +----------+                    +----------+
>>  | Server A | -- IPV6 Network -- | Server B |
>>  +----------+                    +----------+
>> 
>> The l2tp tunnel was encapsulated in udp datagrams. All the configuration was normal and could work after I reverted the kernel on Server A to original version.
>> 
>> Here is what i did to create the tunnel:
>> 
>> ```
>> on Server A:
>> 
>> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 remote 2001:db8::aaaa local 2001:db8::bbbb udp_sport 1086 udp_dport 1086
>> ip l2tp add session name l2tpeth0 tunnel_id 86 session_id 86 peer_session_id 86
>> ip l s l2tpeth0 up
>> 
>> on Server B:
>> 
>> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 local 2001:db8::aaaa remote 2001:db8::bbbb udp_sport 1086 udp_dport 1086
>> ip l2tp add session name l2tpeth0 tunnel_id 86 session_id 86 peer_session_id 86
>> ip l s l2tpeth0 up
>> 
>> ```
>> 
>> When I used tcpdump to diagnose the problem, I got such result:
>> 
>> ```
>> on Server A:
>> 
>> arping -i l2tpeth0 -0 1.2.3.4
>> 
>> on Server B:
>> 
>> tcpdump -i eth0 -n port 1086  -v
>> 
>> 21:35:57.818810 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
>> 21:35:58.820572 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
>> 21:35:59.822216 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
>> 
>> ```
>> 
>> After looking into kernel source, I found out that in this commit a new feature to set udp6 checksum to zero in commit 6b649fe, which added `L2TP_ATTR_UDP_ZERO_CSUM6_TX` and `L2TP_ATTR_UDP_ZERO_CSUM6_TX`.
>> 
>> As a result, I added `udp_csum`, `udp6_csum_tx`, `udp6_csum_rx` control flags to `ip l2tp add tunnel` to control those attributes about checksum.
>> 
>> Using this to create the tunnel instead on Server A:
>> 
>> ```
>> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 remote 2001:db8::aaaa local 2001:db8::bbbb udp_sport 1086 udp_dport 1086 udp6_csum_tx on udp6_csum_rx on
>> ```
>> 
>> I finally got:
>> 
>> ```
>> on Server A:
>> 
>> arping -i l2tpeth0 -0 1.2.3.4
>> 
>> on Server B:
>> 
>> tcpdump -i eth0 -n port 1086  -v
>> 
>> 22:07:03.844297 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
>> 22:07:04.845717 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
>> 22:07:05.847965 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
>> 
>> tcpdump -i l2tpeth0 -v
>> 
>> 22:10:35.691326 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
>> 22:10:36.693627 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
>> 22:10:37.695010 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
>> 22:10:38.697121 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
>> 
>> ```
>> 
>> It seems to work. However, is it the real point that should be fixed and does my patch fix it well? I need your suggestion.
> 
> This seems reasonable to me. It is good to provide user control of
> L2TP checksum options.
> 
> However, there is a problem with your patch. The netlink attributes
> you are accessing to control checksums are flags, not u8 values.

I’m not so familiar with kernel code. However, in `<linux/l2tp.h>` : 

```

/*
 * ATTR types defined for L2TP
 */
enum {
	L2TP_ATTR_NONE,			/* no data */
// ...
	L2TP_ATTR_IP6_DADDR,		/* struct in6_addr */
	L2TP_ATTR_UDP_ZERO_CSUM6_TX,	/* u8 */
	L2TP_ATTR_UDP_ZERO_CSUM6_RX,	/* u8 */
// ...

}
```

isn’t L2TP_ATTR_UDP_ZERO_CSUM6_TX a u8 value? Or should I use `addattr` instead of `addattr8`? 

> 
> Maybe the default checksum setting for such l2tp tunnels should be
> changed in the l2tp kernel code to match the previous behaviour where
> IPv6 checksums were disabled?

I think so. However, I’m confused with those code.

From the name of the attrs `L2TP_ATTR_UDP_ZERO_CSUM6_TX` and `L2TP_ATTR_UDP_ZERO_CSUM6_RX`, I 
can tell that when those flags are set, the checksum will be zero. Also, according to the 
comment of commit 6b649fe in kernel source, “Added new L2TP configuration options to allow 
TX and RX of zero checksums in IPv6. Default is not to use them.”, checksums shouldn't have
been zero by default. However, in fact, they are. I think there may be some bugs in kernel
source. 

> 
>> 
>> 
>> ---
>> ip/ipl2tp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 45 insertions(+)
>> 
>> diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
>> index 3c8ee93..67a6482 100644
>> --- a/ip/ipl2tp.c
>> +++ b/ip/ipl2tp.c
>> @@ -56,6 +56,8 @@ struct l2tp_parm {
>> 
>>        uint16_t pw_type;
>>        uint16_t mtu;
>> +       int udp6_csum_tx:1;
>> +       int udp6_csum_rx:1;
>>        int udp_csum:1;
>>        int recv_seq:1;
>>        int send_seq:1;
>> @@ -117,6 +119,9 @@ static int create_tunnel(struct l2tp_parm *p)
>>        if (p->encap == L2TP_ENCAPTYPE_UDP) {
>>                addattr16(&req.n, 1024, L2TP_ATTR_UDP_SPORT, p->local_udp_port);
>>                addattr16(&req.n, 1024, L2TP_ATTR_UDP_DPORT, p->peer_udp_port);
>> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_CSUM, p->udp_csum);
>> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_ZERO_CSUM6_TX, p->udp6_csum_tx);
>> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_ZERO_CSUM6_RX, p->udp6_csum_rx);
>>        }
>> 
>>        if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
>> @@ -282,6 +287,14 @@ static int get_response(struct nlmsghdr *n, void *arg)
>>                p->l2spec_len = rta_getattr_u8(attrs[L2TP_ATTR_L2SPEC_LEN]);
>> 
>>        p->udp_csum = !!attrs[L2TP_ATTR_UDP_CSUM];
>> +       if (attrs[L2TP_ATTR_UDP_ZERO_CSUM6_TX])
>> +         p->udp6_csum_tx = rta_getattr_u8(attrs[L2TP_ATTR_UDP_ZERO_CSUM6_TX]);
>> +       else
>> +         p->udp6_csum_tx = 1;
>> +       if (attrs[L2TP_ATTR_UDP_ZERO_CSUM6_RX])
>> +         p->udp6_csum_rx = rta_getattr_u8(attrs[L2TP_ATTR_UDP_ZERO_CSUM6_RX]);
>> +       else
>> +         p->udp6_csum_rx = 1;
>>        if (attrs[L2TP_ATTR_COOKIE])
>>                memcpy(p->cookie, RTA_DATA(attrs[L2TP_ATTR_COOKIE]),
>>                       p->cookie_len = RTA_PAYLOAD(attrs[L2TP_ATTR_COOKIE]));
>> @@ -470,6 +483,9 @@ static void usage(void)
>>        fprintf(stderr, "          tunnel_id ID peer_tunnel_id ID\n");
>>        fprintf(stderr, "          [ encap { ip | udp } ]\n");
>>        fprintf(stderr, "          [ udp_sport PORT ] [ udp_dport PORT ]\n");
>> +       fprintf(stderr, "          [ udp_csum { on | off } ]\n");
>> +       fprintf(stderr, "          [ udp6_csum_tx { on | off } ]\n");
>> +       fprintf(stderr, "          [ udp6_csum_rx { on | off } ]\n");
>>        fprintf(stderr, "Usage: ip l2tp add session [ name NAME ]\n");
>>        fprintf(stderr, "          tunnel_id ID\n");
>>        fprintf(stderr, "          session_id ID peer_session_id ID\n");
>> @@ -500,6 +516,8 @@ static int parse_args(int argc, char **argv, int cmd, struct l2tp_parm *p)
>>        /* Defaults */
>>        p->l2spec_type = L2TP_L2SPECTYPE_DEFAULT;
>>        p->l2spec_len = 4;
>> +       p->udp6_csum_rx = 1;
>> +       p->udp6_csum_tx = 1;
>> 
>>        while (argc > 0) {
>>                if (strcmp(*argv, "encap") == 0) {
>> @@ -569,6 +587,33 @@ static int parse_args(int argc, char **argv, int cmd, struct l2tp_parm *p)
>>                        if (get_u16(&uval, *argv, 0))
>>                                invarg("invalid port\n", *argv);
>>                        p->peer_udp_port = uval;
>> +               } else if (strcmp(*argv, "udp_csum") == 0) {
>> +                       NEXT_ARG();
>> +                       if (strcmp(*argv, "on") == 0) {
>> +                               p->udp_csum = 1;
>> +                       } else if (strcmp(*argv, "off") == 0) {
>> +                               p->udp_csum = 0;
>> +                       } else {
>> +                               invarg("invalid option for udp_csum\n", *argv);
>> +                       }
>> +               } else if (strcmp(*argv, "udp6_csum_rx") == 0) {
>> +                       NEXT_ARG();
>> +                       if (strcmp(*argv, "on") == 0) {
>> +                               p->udp6_csum_rx = 1;
>> +                       } else if (strcmp(*argv, "off") == 0) {
>> +                               p->udp6_csum_rx = 0;
>> +                       } else {
>> +                               invarg("invalid option for udp6_csum_rx\n", *argv);
>> +                       }
>> +               } else if (strcmp(*argv, "udp6_csum_tx") == 0) {
>> +                       NEXT_ARG();
>> +                       if (strcmp(*argv, "on") == 0) {
>> +                               p->udp6_csum_tx = 1;
>> +                       } else if (strcmp(*argv, "off") == 0) {
>> +                               p->udp6_csum_tx = 0;
>> +                       } else {
>> +                               invarg("invalid option for udp6_csum_tx\n", *argv);
>> +                       }
>>                } else if (strcmp(*argv, "offset") == 0) {
>>                        __u8 uval;
>> 
>> --
>> 2.5.2
>> 

^ permalink raw reply

* Re: [PATCH v2 net-next 11/13] Documentation: Bindings: Update DT binding for separating dsaf dev support
From: Rob Herring @ 2016-04-27 15:25 UTC (permalink / raw)
  To: Yisen Zhuang
  Cc: David Miller, devicetree@vger.kernel.org, netdev,
	linux-arm-kernel@lists.infradead.org, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, Will Deacon, Catalin Marinas, yankejian,
	huangdaode, salil.mehta, lipeng321, liguozhu, xieqianqian, Wei Xu,
	Linuxarm
In-Reply-To: <57203309.1090501@huawei.com>

On Tue, Apr 26, 2016 at 10:33 PM, Yisen Zhuang <Yisen.zhuang@huawei.com> wrote:
> Hi Rob and David,
>
> Please see my comments inline.
>
> David have merged this series to net-next, but we need to modify some codes according
> to Rob's comments. I am not sure if i need to send V3 for this series, or separate
> patches of documentation to independent series and generate a new patch for hns base
> on current net-next?

That's David's call. I'm guessing he wants follow-up patches on top of these.

> 在 2016/4/26 20:48, Rob Herring 写道:
>> On Sat, Apr 23, 2016 at 05:05:15PM +0800, Yisen Zhuang wrote:
>>> Because debug dsaf port was separated from service dsaf port, this patch
>>> updates the related information of DT binding.
>>
>> Separated when? New version of the h/w? If so, where's the new
>> compatible string? This is quite a big binding change.
>
> There isn't any change of h/w. I separated debug dsaf port from sevice dsaf
> port to make the code more simple and readability.

Okay.

[...]

>>> +  serdes-syscon rather than this address.
>>>    The third region is the PPE register base and size.
>>> -  The fourth region is dsa fabric base register and size.
>>> -  The fifth region is cpld base register and size, it is not required if do not use cpld.
>>> -- phy-handle: phy handle of physicl port, 0 if not any phy device. see ethernet.txt [1].
>>> +  The fourth region is dsa fabric base register and size. It is not required for
>>> +  single-port mode.
>>> +- reg-names: may be ppe-base and(or) dsaf-base. It is used to find the
>>> +  corresponding reg's index.
>>
>> But you have up to 5 regions.
>>
>> The variable nature of what regions you have tells me you need more
>> specific compatible strings for each chip.
>
> we didn't add support of new h/w. We added these regions to make code simple and readability.
> If we need to add support of next h/w version next time, we don't need to add many branches
> for these attributes. So we didn't add a new compatible here.

Not sure what you mean by branches. It's fine to put properties for
things that vary among h/w versions, but new compatible strings will
be needed for any new versions.


>>> +- port: subnodes of dsaf. A dsaf node may contain several port nodes(Depending
>>> +  on mode of dsaf). Port node contain some attributes listed below:
>>> +- port-id: is physical port index in one dsaf.
>>
>> Indexes should generally be avoided. What does the number correspond
>> to in h/w (if anything)?
>
> port-id is index for a port in dsaf, it is correspond to index of PHY showed below.

Okay, you should use reg property here instead.

>
>                                                          CPU
>                                                           |
>                                     -----------------------------------
>                                     |                     |           |
>     ----------------------------------------------    ---------   ---------
>     |                         |                  |    |       |   |   |   |
>     |                        PPE                 |    |  PPE  |   |  PPE  |
>     |                         |                  |    |   |   |   |   |   |
>     |                         |                  |    |   |   |   |   |   |
>     |                      crossbar              |    |   |   |   |   |   |
>     |                         |                  |    |   |   |   |   |   |
>     |   ----------------------------------       |    |   |   |   |   |   |
>     |   |     |     |      |      |      |       |    |   |   |   |   |   |
>     |   |     |     |      |      |      |       |    |   |   |   |   |   |
>     |  MAC   MAC   MAC    MAC    MAC    MAC      |    |  MAC  |   |  MAC  |
>     |   |     |     |      |      |      |       |    |   |   |   |   |   |
>     ----------------------------------------------    ---------   ---------
>         |     |     |      |      |      |    \        /  |        /  |
>        PHY   PHY   PHY    PHY    PHY    PHY    \      /  PHY      /  PHY
>                                                 \    /           /
>                                                  \  /           /
>                                              DSAF(three platform device)
>
>>
>>> +- phy-handle: phy handle of physicl port. It is not required if there isn't

Another typo here.

Rob

^ permalink raw reply

* Re: [RFC PATCH 4/5] bnxt: Add support for segmentation of tunnels with outer checksums
From: Alexander Duyck @ 2016-04-27 15:21 UTC (permalink / raw)
  To: Michael Chan
  Cc: Alexander Duyck, eugenia, Bruce W Allan, Saeed Mahameed, Netdev,
	intel-wired-lan, Ariel Elior, Michael Chan
In-Reply-To: <CACKFLi=2mHfOS1Rz5RbSdyhepdjBqM+dj7Nh=CzARcYSwUTcSw@mail.gmail.com>

On Tue, Apr 26, 2016 at 10:55 PM, Michael Chan
<michael.chan@broadcom.com> wrote:
> On Tue, Apr 19, 2016 at 12:06 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> This patch assumes that the bnxt hardware will ignore existing IPv4/v6
>> header fields for length and checksum as well as the length and checksum
>> fields for outer UDP and GRE headers.
>>
>> I have no means of testing this as I do not have any bnx2x hardware but
>> thought I would submit it as an RFC to see if anyone out there wants to
>> test this and see if this does in fact enable this functionality allowing
>> us to to segment tunneled frames that have an outer checksum.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>
> Hi Alex, I just did a very quick test of this patch on our bnxt
> hardware and it seemed to work.
>
> I created a vxlan endpoint with udpcsum enabled and I saw TSO packets
> getting through.  I've verified that our hardware can be programmed to
> either ignore outer UDP checksum or to calculate it.  Current default
> is to ignore ipv4 UDP checksum and calculate ipv6 UDP checksum.
> Thanks.

Are you saying you can natively support UDP tunnel with outer checksum
offload then?

I'm just trying to sort out if you actually need to have the partial
segmentation offload support or if we can handle it in hardware.  Also
is there any documentation you could point me to that might help to
clarify what the hardware does/doesn't support so that I could improve
upon this patch in order to make sure we are getting the most bang for
the buck in terms of the features that can be offloaded by hardware?

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH net-next] drivers/net: add 6WIND SHULTI support
From: Jiri Pirko @ 2016-04-27 15:14 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Nicolas Dichtel, davem, netdev
In-Reply-To: <20160427095615.GA27071@breakpoint.cc>

Wed, Apr 27, 2016 at 11:56:15AM CEST, fw@strlen.de wrote:
>Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>> This patch adds the support of the 6WIND SHULTI switch. It is a software
>> switch doing L2 forwarding.
>> 
>> This first version implements the minimum needed to get the device working.
>> It also implements, via switchdev and rtnetlink, bridge forwarding offload,
>> including FDB static entries, FDB learning and FDB ageing.
>
>How is this different from net/bridge?
>How is this different from openvswitch?

The difference is that it this tries to allow userspace crap to mirror
setting user does for bridge/ovs. Basically this looks to me like an
attempt to enable userspace SDKs and such.

^ permalink raw reply

* [PATCH] MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
From: Andreas Larsson @ 2016-04-27 14:46 UTC (permalink / raw)
  To: David S. Miller, netdev; +Cc: linux-kernel, Sam Ravnborg, software

Signed-off-by: Andreas Larsson <andreas@gaisler.com>
---
 MAINTAINERS |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d5b4be..08ffebd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4903,7 +4903,7 @@ F:	net/ipv4/gre_offload.c
 F:	include/net/gre.h
 
 GRETH 10/100/1G Ethernet MAC device driver
-M:	Kristoffer Glembo <kristoffer@gaisler.com>
+M:	Andreas Larsson <andreas@gaisler.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/aeroflex/
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH v2 2/2] net: nps_enet: bug fix - handle lost tx interrupts
From: Elad Kanfi @ 2016-04-27 13:18 UTC (permalink / raw)
  To: davem; +Cc: noamca, linux-kernel, abrodkin, eladkan, talz, netdev
In-Reply-To: <1461763110-15263-1-git-send-email-eladkan@mellanox.com>

From: Elad Kanfi <eladkan@mellanox.com>

The tx interrupt is of edge type, and in case such interrupt is triggered
while it is masked it will not be handled even after tx interrupts are
re-enabled in the end of NAPI poll.
This will cause tx network to stop in the following scenario:
 * Rx is being handled, hence interrupts are masked.
 * Tx interrupt is triggered after checking if there is some tx to handle
   and before re-enabling the interrupts.
In this situation only rx transaction will release tx requests.

In order to handle the tx that was missed( if there was one ),
a NAPI reschdule was added after enabling the interrupts.

Signed-off-by: Elad Kanfi <eladkan@mellanox.com>

Acked-by: Noam Camus <noamca@mellanox.com>
---
 drivers/net/ethernet/ezchip/nps_enet.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ezchip/nps_enet.c b/drivers/net/ethernet/ezchip/nps_enet.c
index 1373236..31e27a7 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.c
+++ b/drivers/net/ethernet/ezchip/nps_enet.c
@@ -183,6 +183,8 @@ static int nps_enet_poll(struct napi_struct *napi, int budget)
 	work_done = nps_enet_rx_handler(ndev);
 	if (work_done < budget) {
 		u32 buf_int_enable_value = 0;
+		u32 tx_ctrl_value = nps_enet_reg_get(priv, NPS_ENET_REG_TX_CTL);
+		u32 tx_ctrl_ct = (tx_ctrl_value & TX_CTL_CT_MASK) >> TX_CTL_CT_SHIFT;
 
 		napi_complete(napi);
 
@@ -192,6 +194,18 @@ static int nps_enet_poll(struct napi_struct *napi, int budget)
 
 		nps_enet_reg_set(priv, NPS_ENET_REG_BUF_INT_ENABLE,
 				 buf_int_enable_value);
+
+		/* in case we will get a tx interrupt while interrupts
+		 * are masked, we will lose it since the tx is edge interrupt.
+		 * specifically, while executing the code section above,
+		 * between nps_enet_tx_handler and the interrupts enable, all
+		 * tx requests will be stuck until we will get an rx interrupt.
+		 * the two code lines below will solve this situation by
+		 * re-adding ourselves to the poll list.
+		 */
+
+		if (priv->tx_packet_sent && !tx_ctrl_ct)
+			napi_reschedule(napi);
 	}
 
 	return work_done;
-- 
1.7.1

^ permalink raw reply related

* [PATCH iproute2 2/2] ip link gre: print only relevant info in external mode
From: Jiri Benc @ 2016-04-27 14:11 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Paolo Abeni, Pravin Shelar
In-Reply-To: <cover.1461766016.git.jbenc@redhat.com>

Display only attributes that are relevant when a GRE interface is in
'external' mode instead of the default values (which are ignored by the
kernel even if passed back).

Fixes: 926b39e1feffd ("gre: add support for collect metadata flag")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 ip/link_gre.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index 36ce1252675b..492c22053b89 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -339,7 +339,7 @@ get_failed:
 	return 0;
 }
 
-static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+static void gre_print_direct_opt(FILE *f, struct rtattr *tb[])
 {
 	char s2[64];
 	const char *local = "any";
@@ -347,9 +347,6 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 	unsigned int iflags = 0;
 	unsigned int oflags = 0;
 
-	if (!tb)
-		return;
-
 	if (tb[IFLA_GRE_REMOTE]) {
 		unsigned int addr = rta_getattr_u32(tb[IFLA_GRE_REMOTE]);
 
@@ -421,8 +418,16 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 		fputs("icsum ", f);
 	if (oflags & GRE_CSUM)
 		fputs("ocsum ", f);
+}
 
-	if (tb[IFLA_GRE_COLLECT_METADATA])
+static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+	if (!tb)
+		return;
+
+	if (!tb[IFLA_GRE_COLLECT_METADATA])
+		gre_print_direct_opt(f, tb);
+	else
 		fputs("external ", f);
 
 	if (tb[IFLA_GRE_ENCAP_TYPE] &&
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH iproute2 1/2] ip link gre: create interfaces in external mode correctly
From: Jiri Benc @ 2016-04-27 14:11 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Paolo Abeni, Pravin Shelar
In-Reply-To: <cover.1461766016.git.jbenc@redhat.com>

For GRE interfaces in 'external' mode, the kernel ignores all manual
settings like remote IP address or TTL. However, for some of those
attributes, kernel checks their value and does not allow them to be zero
(even though they're ignored later).

Currently, 'ip link' always includes all attributes in the netlink message.
This leads to problem with creating interfaces in 'external' mode. For
example, this command does not work:

ip link add gre1 type gretap external

and needs a bogus remote IP address to be specified, as the kernel enforces
remote IP address to be either not present, or not null.

Ignore the parameters that do not make sense in 'external' mode.
Unfortunately, we cannot error out, as there may be existing deployments
that workarounded the bug by specifying bogus values.

Fixes: 926b39e1feffd ("gre: add support for collect metadata flag")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 ip/link_gre.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index bcf003aaa5d7..36ce1252675b 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -315,24 +315,26 @@ get_failed:
 		return -1;
 	}
 
-	addattr32(n, 1024, IFLA_GRE_IKEY, ikey);
-	addattr32(n, 1024, IFLA_GRE_OKEY, okey);
-	addattr_l(n, 1024, IFLA_GRE_IFLAGS, &iflags, 2);
-	addattr_l(n, 1024, IFLA_GRE_OFLAGS, &oflags, 2);
-	addattr_l(n, 1024, IFLA_GRE_LOCAL, &saddr, 4);
-	addattr_l(n, 1024, IFLA_GRE_REMOTE, &daddr, 4);
-	addattr_l(n, 1024, IFLA_GRE_PMTUDISC, &pmtudisc, 1);
-	if (link)
-		addattr32(n, 1024, IFLA_GRE_LINK, link);
-	addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
-	addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
+	if (!metadata) {
+		addattr32(n, 1024, IFLA_GRE_IKEY, ikey);
+		addattr32(n, 1024, IFLA_GRE_OKEY, okey);
+		addattr_l(n, 1024, IFLA_GRE_IFLAGS, &iflags, 2);
+		addattr_l(n, 1024, IFLA_GRE_OFLAGS, &oflags, 2);
+		addattr_l(n, 1024, IFLA_GRE_LOCAL, &saddr, 4);
+		addattr_l(n, 1024, IFLA_GRE_REMOTE, &daddr, 4);
+		addattr_l(n, 1024, IFLA_GRE_PMTUDISC, &pmtudisc, 1);
+		if (link)
+			addattr32(n, 1024, IFLA_GRE_LINK, link);
+		addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
+		addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
+	} else {
+		addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
+	}
 
 	addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype);
 	addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags);
 	addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport));
 	addattr16(n, 1024, IFLA_GRE_ENCAP_DPORT, htons(encapdport));
-	if (metadata)
-		addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
 
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH iproute2 0/2] ip link gre: fix external mode handling
From: Jiri Benc @ 2016-04-27 14:11 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Paolo Abeni, Pravin Shelar

Fix two bugs with handling of the 'external' keyword for GRE.

Jiri Benc (2):
  ip link gre: create interfaces in external mode correctly
  ip link gre: print only relevant info in external mode

 ip/link_gre.c | 43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH v2 1/2] net: nps_enet: Sync access to packet sent flag
From: Lino Sanfilippo @ 2016-04-27 13:56 UTC (permalink / raw)
  To: Elad Kanfi, davem; +Cc: noamca, linux-kernel, abrodkin, talz, netdev
In-Reply-To: <1461763110-15263-2-git-send-email-eladkan@mellanox.com>

Hi,

On 27.04.2016 15:18, Elad Kanfi wrote:
> From: Elad Kanfi <eladkan@mellanox.com>
>
> Below is a description of a possible problematic
> sequence. CPU-A is sending a frame and CPU-B handles
> the interrupt that indicates the frame was sent. CPU-B
> reads an invalid value of tx_packet_sent.
>
> 	CPU-A				CPU-B
> 	-----				-----
> 	nps_enet_send_frame
> 	.
> 	.
> 	tx_packet_sent = true
> 	order HW to start tx
> 	.
> 	.
> 	HW complete tx
> 			    ------> 	get tx complete interrupt
> 					.
> 					.
> 					if(tx_packet_sent == true)
>
> 	end memory transaction
> 	(tx_packet_sent actually
> 	 written)
>
> Problem solution:
>
> Add a memory barrier after setting tx_packet_sent,
> in order to make sure that it is written before
> the packet is sent.

Should not those SMP memory barriers be paired? AFAIK you do not only have to make sure
that the value written by CPU-A actually is written to memory but also that CPU-B
reads that value from memory. At least this is what I have understood from memory-barriers.txt...

Regards,
Lino

^ permalink raw reply

* RE: [net-next PATCH V2 2/5] samples/bpf: Makefile verify LLVM compiler avail and bpf target is supported
From: David Laight @ 2016-04-27 13:52 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', netdev@vger.kernel.org
  Cc: linux-kbuild@vger.kernel.org, bblanco@plumgrid.com,
	naveen.n.rao@linux.vnet.ibm.com, borkmann@iogearbox.net,
	alexei.starovoitov@gmail.com
In-Reply-To: <20160426162716.22962.36473.stgit@firesoul>

From: Jesper Dangaard Brouer
> Sent: 26 April 2016 17:27
> Make compiling samples/bpf more user friendly, by detecting if LLVM
> compiler tool 'llc' is available, and also detect if the 'bpf' target
> is available in this version of LLVM.
...
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 5bae9536f100..45859c99f573 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -85,6 +85,24 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
>  LLC ?= llc
> 
> +# Verify LLVM compiler is available and bpf target is supported
> +.PHONY: verify_cmd_llc verify_target_bpf
> +
> +verify_cmd_llc:
> +	@if ! (which "${LLC}" > /dev/null 2>&1); then \

You should use 'type' not 'which'.
'type' is a posix shell builtin, 'which' is a script/program
that tries to emulate a 'csh' builtin.
You want to know whether the shell that make runs can execute ${LLC}
not whether a csh would be able to run it.

You might also want to worry about:
    LLC="/path_to_llc/llc -extra_arg" make fubar

	David


^ permalink raw reply

* [PATCH v2 0/2] Net driver bugs fix
From: Elad Kanfi @ 2016-04-27 13:18 UTC (permalink / raw)
  To: davem; +Cc: noamca, linux-kernel, abrodkin, eladkan, talz, netdev

From: Elad Kanfi <eladkan@mellanox.com>

v2:
Remove code style commit for now.
Code style commit will be added after the bugs fix will be approved.

Summary:
 1. Bug description: TX done interrupts that arrives while interrupts
    are masked, during NAPI poll, will not trigger an interrupt handling.
    Since TX interrupt is of level edge we will lose the TX done interrupt.
    As a result all pending tx frames will get no service.

    Solution: Check if there is a pending tx request after unmasking the
    interrupt and if answer is yes then re-add ourselves to
    the NAPI poll list.

 2. Bug description: CPU-A before sending a frame will set a variable
    to true. CPU-B that executes the tx done interrupt service routine
    might read a non valid value of that variable.

    Solution: Add a memory barrier at the tx sending function.

Elad Kanfi (2):
  net: nps_enet: Sync access to packet sent flag
  net: nps_enet: bug fix - handle lost tx interrupts

 drivers/net/ethernet/ezchip/nps_enet.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

^ permalink raw reply

* Re: [PATCH] net: phy: at803x: Register 'link_change_notify' only for AT8030
From: Sebastian Frias @ 2016-04-27 13:50 UTC (permalink / raw)
  To: Daniel Mack, David S. Miller, netdev
  Cc: lkml, mason, Sergei Shtylyov, timur, Florian Fainelli
In-Reply-To: <5720A3C1.2040401@laposte.net>

Hi,

Sergei pointed out that the same patch was submitted yesterday by Timur (what are the chances?! :-) ) https://patchwork.ozlabs.org/patch/615019/

Regards,

Sebastian

On 04/27/2016 01:34 PM, Sebastian Frias wrote:
> There is no need to register the callback introduced by
> commit 13a56b449325 ("net: phy: at803x: Add support for hardware reset")
> for non faulty PHYs.
> 
> The check on the PHY ID is not necessary anymore and thus has been
> removed from the callback implementation as well.
> 
> Fixes: 13a56b449325 ("net: phy: at803x: Add support for hardware reset")
> 
> Signed-off-by: Sebastian Frias <sf84@laposte.net>
> ---
>  drivers/net/phy/at803x.c | 43 ++++++++++++++++++++-----------------------
>  1 file changed, 20 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> index b3ffaee..7fdc676 100644
> --- a/drivers/net/phy/at803x.c
> +++ b/drivers/net/phy/at803x.c
> @@ -353,33 +353,32 @@ static void at803x_link_change_notify(struct phy_device *phydev)
>  	struct at803x_priv *priv = phydev->priv;
>  
>  	/*
> -	 * Conduct a hardware reset for AT8030 every time a link loss is
> +	 * Conduct a hardware reset for AT8030 (this callback is only
> +	 * registered for AT8030 at the moment) every time a link loss is
>  	 * signalled. This is necessary to circumvent a hardware bug that
>  	 * occurs when the cable is unplugged while TX packets are pending
>  	 * in the FIFO. In such cases, the FIFO enters an error mode it
>  	 * cannot recover from by software.
>  	 */
> -	if (phydev->drv->phy_id == ATH8030_PHY_ID) {
> -		if (phydev->state == PHY_NOLINK) {
> -			if (priv->gpiod_reset && !priv->phy_reset) {
> -				struct at803x_context context;
> -
> -				at803x_context_save(phydev, &context);
> -
> -				gpiod_set_value(priv->gpiod_reset, 1);
> -				msleep(1);
> -				gpiod_set_value(priv->gpiod_reset, 0);
> -				msleep(1);
> -
> -				at803x_context_restore(phydev, &context);
> -
> -				phydev_dbg(phydev, "%s(): phy was reset\n",
> -					   __func__);
> -				priv->phy_reset = true;
> -			}
> -		} else {
> -			priv->phy_reset = false;
> +	if (phydev->state == PHY_NOLINK) {
> +		if (priv->gpiod_reset && !priv->phy_reset) {
> +			struct at803x_context context;
> +
> +			at803x_context_save(phydev, &context);
> +
> +			gpiod_set_value(priv->gpiod_reset, 1);
> +			msleep(1);
> +			gpiod_set_value(priv->gpiod_reset, 0);
> +			msleep(1);
> +
> +			at803x_context_restore(phydev, &context);
> +
> +			phydev_dbg(phydev, "%s(): phy was reset\n",
> +				   __func__);
> +			priv->phy_reset = true;
>  		}
> +	} else {
> +		priv->phy_reset = false;
>  	}
>  }
>  
> @@ -391,7 +390,6 @@ static struct phy_driver at803x_driver[] = {
>  	.phy_id_mask		= 0xffffffef,
>  	.probe			= at803x_probe,
>  	.config_init		= at803x_config_init,
> -	.link_change_notify	= at803x_link_change_notify,
>  	.set_wol		= at803x_set_wol,
>  	.get_wol		= at803x_get_wol,
>  	.suspend		= at803x_suspend,
> @@ -427,7 +425,6 @@ static struct phy_driver at803x_driver[] = {
>  	.phy_id_mask		= 0xffffffef,
>  	.probe			= at803x_probe,
>  	.config_init		= at803x_config_init,
> -	.link_change_notify	= at803x_link_change_notify,
>  	.set_wol		= at803x_set_wol,
>  	.get_wol		= at803x_get_wol,
>  	.suspend		= at803x_suspend,
> 

^ permalink raw reply

* [PATCH v2 1/2] net: nps_enet: Sync access to packet sent flag
From: Elad Kanfi @ 2016-04-27 13:18 UTC (permalink / raw)
  To: davem; +Cc: noamca, linux-kernel, abrodkin, eladkan, talz, netdev
In-Reply-To: <1461763110-15263-1-git-send-email-eladkan@mellanox.com>

From: Elad Kanfi <eladkan@mellanox.com>

Below is a description of a possible problematic
sequence. CPU-A is sending a frame and CPU-B handles
the interrupt that indicates the frame was sent. CPU-B
reads an invalid value of tx_packet_sent.

	CPU-A				CPU-B
	-----				-----
	nps_enet_send_frame
	.
	.
	tx_packet_sent = true
	order HW to start tx
	.
	.
	HW complete tx
			    ------> 	get tx complete interrupt
					.
					.
					if(tx_packet_sent == true)

	end memory transaction
	(tx_packet_sent actually
	 written)

Problem solution:

Add a memory barrier after setting tx_packet_sent,
in order to make sure that it is written before
the packet is sent.

Signed-off-by: Elad Kanfi <eladkan@mellanox.com>

Acked-by: Noam Camus <noamca@mellanox.com>
---
 drivers/net/ethernet/ezchip/nps_enet.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ezchip/nps_enet.c b/drivers/net/ethernet/ezchip/nps_enet.c
index 1f23845..1373236 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.c
+++ b/drivers/net/ethernet/ezchip/nps_enet.c
@@ -389,6 +389,13 @@ static void nps_enet_send_frame(struct net_device *ndev,
 
 	/* Indicate SW is done */
 	priv->tx_packet_sent = true;
+
+	/* before the frame is sent we have to make
+	 * sure that priv->tx_packet_sent will be valid
+	 * for the CPU'S that handles the ISR and NAPI poll
+	 */
+	smp_wmb();
+
 	tx_ctrl_value |= NPS_ENET_ENABLE << TX_CTL_CT_SHIFT;
 	/* Send Frame */
 	nps_enet_reg_set(priv, NPS_ENET_REG_TX_CTL, tx_ctrl_value);
-- 
1.7.1

^ permalink raw reply related

* RE: [net-next PATCH 4/4] samples/bpf: allow make to be run from samples/bpf/ directory
From: David Laight @ 2016-04-27 13:15 UTC (permalink / raw)
  To: 'Alexei Starovoitov', Jesper Dangaard Brouer
  Cc: netdev@vger.kernel.org, bblanco@plumgrid.com,
	borkmann@iogearbox.net, linux-kbuild@vger.kernel.org
In-Reply-To: <20160426143524.GB39797@ast-mbp.thefacebook.com>

From: Alexei Starovoitov
> Sent: 26 April 2016 15:35
> On Tue, Apr 26, 2016 at 01:09:32PM +0200, Jesper Dangaard Brouer wrote:
> > It is not intuitive that 'make' must be run from the top level
> > directory with argument "samples/bpf/" to compile these eBPF samples.
> >
> > Introduce a kbuild make file trick that allow make to be run from the
> > "samples/bpf/" directory itself.  It basically change to the top level
> > directory and call "make samples/bpf/" with the "/" slash after the
> > directory name.
> >
> > Also add a clean target that only cleans this directory, by taking
> > advantage of the kbuild external module setting M=$PWD.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  samples/bpf/Makefile   |    8 ++++++++
> >  samples/bpf/README.rst |    3 +++
> >  2 files changed, 11 insertions(+)
> >
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 7ac66f5bbbf5..7370dfee482f 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -85,6 +85,14 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
> >  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
> >  LLC ?= llc
> >
> > +# Trick to allow make to be run from this directory
> > +all:
> > +	$(MAKE) -C ../../ $$PWD/
> > +
> > +clean:
> > +	$(MAKE) -C ../../ M=$$PWD clean
> > +	@rm -f *~
> > +
> 
> nice trick!

Another nice 'trick' is to write a GNUmakefie containing the local rules.
GNU make will read it in preference to Makefile and makefile.

	David

^ permalink raw reply

* Re: linux-next: manual merge of the net-next tree with the net tree
From: Saeed Mahameed @ 2016-04-27 13:13 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Linux Netdev List, linux-next, linux-kernel,
	Saeed Mahameed, Gal Pressman
In-Reply-To: <20160427120109.37449920@canb.auug.org.au>

On Wed, Apr 27, 2016 at 5:01 AM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>
> between commit:
>
>   d8edd2469ace ("et/mlx5e: Fix minimum MTU")
>
> from the net tree and commit:
>
>   0e405443e803 ("net/mlx5e: Improve set features ndo resiliency")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>
> --
> Cheers,
> Stephen Rothwell
>
> diff --cc drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 67d548b70e14,5bad17d37d7b..000000000000
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@@ -2025,9 -2214,49 +2240,52 @@@ static int set_feature_rx_vlan(struct n
>         return err;
>   }
>
> + static int mlx5e_handle_feature(struct net_device *netdev,
> +                               netdev_features_t wanted_features,
> +                               netdev_features_t feature,
> +                               mlx5e_feature_handler feature_handler)
> + {
> +       netdev_features_t changes = wanted_features ^ netdev->features;
> +       bool enable = !!(wanted_features & feature);
> +       int err;
> +
> +       if (!(changes & feature))
> +               return 0;
> +
> +       err = feature_handler(netdev, enable);
> +       if (err) {
> +               netdev_err(netdev, "%s feature 0x%llx failed err %d\n",
> +                          enable ? "Enable" : "Disable", feature, err);
> +               return err;
> +       }
> +
> +       MLX5E_SET_FEATURE(netdev, feature, enable);
> +       return 0;
> + }
> +
> + static int mlx5e_set_features(struct net_device *netdev,
> +                             netdev_features_t features)
> + {
> +       int err;
> +
> +       err  = mlx5e_handle_feature(netdev, features, NETIF_F_LRO,
> +                                   set_feature_lro);
> +       err |= mlx5e_handle_feature(netdev, features,
> +                                   NETIF_F_HW_VLAN_CTAG_FILTER,
> +                                   set_feature_vlan_filter);
> +       err |= mlx5e_handle_feature(netdev, features, NETIF_F_HW_TC,
> +                                   set_feature_tc_num_filters);
> +       err |= mlx5e_handle_feature(netdev, features, NETIF_F_RXALL,
> +                                   set_feature_rx_all);
> +       err |= mlx5e_handle_feature(netdev, features, NETIF_F_HW_VLAN_CTAG_RX,
> +                                   set_feature_rx_vlan);
> +
> +       return err ? -EINVAL : 0;
> + }
> +
>  +#define MXL5_HW_MIN_MTU 64
>  +#define MXL5E_MIN_MTU (MXL5_HW_MIN_MTU + ETH_FCS_LEN)
>  +
>   static int mlx5e_change_mtu(struct net_device *netdev, int new_mtu)
>   {
>         struct mlx5e_priv *priv = netdev_priv(netdev);
> @@@ -2633,18 -2966,10 +2997,19 @@@ static void mlx5e_destroy_netdev(struc
>         schedule_work(&priv->set_rx_mode_work);
>         mlx5e_disable_async_events(priv);
>         flush_scheduled_work();
>  -      unregister_netdev(netdev);
>  +      if (test_bit(MLX5_INTERFACE_STATE_SHUTDOWN, &mdev->intf_state)) {
>  +              netif_device_detach(netdev);
>  +              mutex_lock(&priv->state_lock);
>  +              if (test_bit(MLX5E_STATE_OPENED, &priv->state))
>  +                      mlx5e_close_locked(netdev);
>  +              mutex_unlock(&priv->state_lock);
>  +      } else {
>  +              unregister_netdev(netdev);
>  +      }
>  +
>         mlx5e_tc_cleanup(priv);
>         mlx5e_vxlan_cleanup(priv);
> +       mlx5e_destroy_q_counter(priv);
>         mlx5e_destroy_flow_tables(priv);
>         mlx5e_destroy_tirs(priv);
>         mlx5e_destroy_rqt(priv, MLX5E_SINGLE_RQ_RQT);

Looks good, and it is pretty straightforward.
Dave will have to take all hunks from both patches, same as you did.

Thank you Stephen.

Saeed.

^ permalink raw reply

* Re: [PATCH] wcn36xx: Set SMD timeout to 10 seconds
From: Kalle Valo @ 2016-04-27 12:55 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Eugene Krasnikov, linux-wireless, netdev, linux-kernel,
	John Stultz
In-Reply-To: <1461272982-7233-1-git-send-email-bjorn.andersson@linaro.org>

Bjorn Andersson <bjorn.andersson@linaro.org> writes:

> After booting the wireless subsystem and uploading the NV blob to the
> WCNSS_CTRL service the remote continues to do things and will not start
> servicing wlan-requests for another 2-5 seconds (measured).
>
> The downstream code does not have any special handling for this case,
> but has a timeout of 10 seconds for the communication layer. By
> extending the wcn36xx timeout to match this we follows the same flow for
> the boot procedure and can successfully configure WiFi as wlan0 is
> registered.
>
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Applied, thanks.

-- 
Kalle Valo

^ permalink raw reply

* [PATCH nf-next] netfilter: allow logging from non-init namespaces
From: Michal Kubecek @ 2016-04-27 12:48 UTC (permalink / raw)
  To: netfilter-devel
  Cc: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	Jonathan Corbet, coreteam, netdev, linux-doc, linux-kernel,
	bridge

Commit 69b34fb996b2 ("netfilter: xt_LOG: add net namespace support for
xt_LOG") disabled logging packets using the LOG target from non-init
namespaces. The motivation was to prevent containers from flooding
kernel log of the host. The plan was to keep it that way until syslog
namespace implementation allows containers to log in a safe way.

However, the work on syslog namespace seems to have hit a dead end
somewhere in 2013 and there are users who want to use xt_LOG in all
network namespaces. This patch allows to do so by setting

  /proc/sys/net/netfilter/nf_log_all_netns

to a nonzero value. This sysctl is only accessible from init_net so that
one cannot switch the behaviour from inside a container.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
---
 Documentation/networking/netfilter-sysctl.txt | 10 ++++++++++
 include/net/netfilter/nf_log.h                |  3 +++
 net/bridge/netfilter/ebt_log.c                |  2 +-
 net/ipv4/netfilter/nf_log_arp.c               |  2 +-
 net/ipv4/netfilter/nf_log_ipv4.c              |  2 +-
 net/ipv6/netfilter/nf_log_ipv6.c              |  2 +-
 net/netfilter/nf_log.c                        | 22 ++++++++++++++++++++++
 7 files changed, 39 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/networking/netfilter-sysctl.txt

diff --git a/Documentation/networking/netfilter-sysctl.txt b/Documentation/networking/netfilter-sysctl.txt
new file mode 100644
index 000000000000..55791e50e169
--- /dev/null
+++ b/Documentation/networking/netfilter-sysctl.txt
@@ -0,0 +1,10 @@
+/proc/sys/net/netfilter/* Variables:
+
+nf_log_all_netns - BOOLEAN
+	0 - disabled (default)
+	not 0 - enabled
+
+	By default, only init_net namespace can log packets into kernel log
+	with LOG target; this aims to prevent containers from flooding host
+	kernel log. If enabled, this target also works in other network
+	namespaces. This variable is only accessible from init_net.
diff --git a/include/net/netfilter/nf_log.h b/include/net/netfilter/nf_log.h
index 57639fca223a..8c4b018eef72 100644
--- a/include/net/netfilter/nf_log.h
+++ b/include/net/netfilter/nf_log.h
@@ -49,6 +49,9 @@ struct nf_logger {
 	struct module		*me;
 };
 
+/* sysctl_nf_log_all_netns - allow LOG target in all network namespaces */
+extern int sysctl_nf_log_all_netns;
+
 /* Function to register/unregister log function. */
 int nf_log_register(u_int8_t pf, struct nf_logger *logger);
 void nf_log_unregister(struct nf_logger *logger);
diff --git a/net/bridge/netfilter/ebt_log.c b/net/bridge/netfilter/ebt_log.c
index 152300d164ac..735230ec0e49 100644
--- a/net/bridge/netfilter/ebt_log.c
+++ b/net/bridge/netfilter/ebt_log.c
@@ -78,7 +78,7 @@ ebt_log_packet(struct net *net, u_int8_t pf, unsigned int hooknum,
 	unsigned int bitmask;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	spin_lock_bh(&ebt_log_lock);
diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c
index e7ad950cf9ef..39e1348dfe45 100644
--- a/net/ipv4/netfilter/nf_log_arp.c
+++ b/net/ipv4/netfilter/nf_log_arp.c
@@ -87,7 +87,7 @@ static void nf_log_arp_packet(struct net *net, u_int8_t pf,
 	struct nf_log_buf *m;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	m = nf_log_buf_open();
diff --git a/net/ipv4/netfilter/nf_log_ipv4.c b/net/ipv4/netfilter/nf_log_ipv4.c
index 076aadda0473..2b0083112ed8 100644
--- a/net/ipv4/netfilter/nf_log_ipv4.c
+++ b/net/ipv4/netfilter/nf_log_ipv4.c
@@ -319,7 +319,7 @@ static void nf_log_ip_packet(struct net *net, u_int8_t pf,
 	struct nf_log_buf *m;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	m = nf_log_buf_open();
diff --git a/net/ipv6/netfilter/nf_log_ipv6.c b/net/ipv6/netfilter/nf_log_ipv6.c
index 8dd869642f45..04960486d0e2 100644
--- a/net/ipv6/netfilter/nf_log_ipv6.c
+++ b/net/ipv6/netfilter/nf_log_ipv6.c
@@ -351,7 +351,7 @@ static void nf_log_ip6_packet(struct net *net, u_int8_t pf,
 	struct nf_log_buf *m;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	m = nf_log_buf_open();
diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index a5d41dfa9f05..a5f4c57b14c5 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -16,6 +16,9 @@
 #define NF_LOG_PREFIXLEN		128
 #define NFLOGGER_NAME_LEN		64
 
+int sysctl_nf_log_all_netns __read_mostly;
+EXPORT_SYMBOL(sysctl_nf_log_all_netns);
+
 static struct nf_logger __rcu *loggers[NFPROTO_NUMPROTO][NF_LOG_TYPE_MAX] __read_mostly;
 static DEFINE_MUTEX(nf_log_mutex);
 
@@ -392,6 +395,18 @@ static const struct file_operations nflog_file_ops = {
 #ifdef CONFIG_SYSCTL
 static char nf_log_sysctl_fnames[NFPROTO_NUMPROTO-NFPROTO_UNSPEC][3];
 static struct ctl_table nf_log_sysctl_table[NFPROTO_NUMPROTO+1];
+static struct ctl_table_header *nf_log_sysctl_fhdr;
+
+static struct ctl_table nf_log_sysctl_ftable[] = {
+	{
+		.procname	= "nf_log_all_netns",
+		.data		= &sysctl_nf_log_all_netns,
+		.maxlen		= sizeof(sysctl_nf_log_all_netns),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{ }
+};
 
 static int nf_log_proc_dostring(struct ctl_table *table, int write,
 			 void __user *buffer, size_t *lenp, loff_t *ppos)
@@ -461,6 +476,10 @@ static int netfilter_log_sysctl_init(struct net *net)
 			nf_log_sysctl_table[i].extra1 =
 				(void *)(unsigned long) i;
 		}
+		nf_log_sysctl_fhdr = register_net_sysctl(net, "net/netfilter",
+							 nf_log_sysctl_ftable);
+		if (!nf_log_sysctl_fhdr)
+			goto err_freg;
 	}
 
 	net->nf.nf_log_dir_header = register_net_sysctl(net,
@@ -474,6 +493,7 @@ static int netfilter_log_sysctl_init(struct net *net)
 err_reg:
 	if (!net_eq(net, &init_net))
 		kfree(table);
+err_freg:
 err_alloc:
 	return -ENOMEM;
 }
@@ -486,6 +506,8 @@ static void netfilter_log_sysctl_exit(struct net *net)
 	unregister_net_sysctl_table(net->nf.nf_log_dir_header);
 	if (!net_eq(net, &init_net))
 		kfree(table);
+	else
+		unregister_net_sysctl_table(nf_log_sysctl_fhdr);
 }
 #else
 static int netfilter_log_sysctl_init(struct net *net)
-- 
2.8.1


^ permalink raw reply related

* Re: [PATCH net-next 9/9] taskstats: use the libnl API to align nlattr on 64-bit
From: Balbir Singh @ 2016-04-27 12:29 UTC (permalink / raw)
  To: nicolas.dichtel, netdev
  Cc: davem, linux-kernel, linux-wpan, aar, pablo, kaber, kadlec,
	pshelar, kuznet, jmorris, yoshfuji, netfilter-devel, dev,
	steffen.klassert, herbert
In-Reply-To: <57206A47.2010906@6wind.com>



On 27/04/16 17:29, Nicolas Dichtel wrote:
> Le 27/04/2016 03:14, Balbir Singh a écrit :
>>
>>
>> On 23/04/16 01:31, Nicolas Dichtel wrote:
>>> Goal of this patch is to use the new libnl API to align netlink attribute
>>> when needed.
>>> The layout of the netlink message will be a bit different after the patch,
>>> because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
>>> attribute instead of before it.
>>>
>>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>
>> The layout will change/break user space -- I've not tested the patch though..
> Sigh.
> 
> I quote David:
> "All userspace components using netlink should always ignore attributes
> they do not recognize in dumps.
> 
> This is one of the most basic principles of netlink"
> 
> Do you have some pointers so I can made some tests?
> 

Please try

https://www.kernel.org/doc/Documentation/accounting/getdelays.c

iotop uses it as well. My concern is ABI breakage of user space.

Balbir Singh
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] ip: add udp_csum, udp6_csum_tx, udp6_csum_rx control flags to ip l2tp add tunnel
From: James Chapman @ 2016-04-27 12:21 UTC (permalink / raw)
  To: Wang Shanker; +Cc: netdev

On 26 April 2016 at 15:15, Wang Shanker <shankerwangmiao@gmail.com> wrote:
> Hi, all
>
> It’s my first time to contribute to such an important open source project. Things began when I upgraded my server, called "Server A", form ubuntu 14.04 to 16.04, which is shipped with new kernel version, 4.4. After upgrade, I soon found a l2tp tunnel between this server and another linux server, called "Server B", via ipv6 broke down. Here is the network topology:
>
>   +----------+                    +----------+
>   | Server A | -- IPV6 Network -- | Server B |
>   +----------+                    +----------+
>
> The l2tp tunnel was encapsulated in udp datagrams. All the configuration was normal and could work after I reverted the kernel on Server A to original version.
>
> Here is what i did to create the tunnel:
>
> ```
> on Server A:
>
> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 remote 2001:db8::aaaa local 2001:db8::bbbb udp_sport 1086 udp_dport 1086
> ip l2tp add session name l2tpeth0 tunnel_id 86 session_id 86 peer_session_id 86
> ip l s l2tpeth0 up
>
> on Server B:
>
> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 local 2001:db8::aaaa remote 2001:db8::bbbb udp_sport 1086 udp_dport 1086
> ip l2tp add session name l2tpeth0 tunnel_id 86 session_id 86 peer_session_id 86
> ip l s l2tpeth0 up
>
> ```
>
> When I used tcpdump to diagnose the problem, I got such result:
>
> ```
> on Server A:
>
> arping -i l2tpeth0 -0 1.2.3.4
>
> on Server B:
>
> tcpdump -i eth0 -n port 1086  -v
>
> 21:35:57.818810 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
> 21:35:58.820572 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
> 21:35:59.822216 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
>
> ```
>
> After looking into kernel source, I found out that in this commit a new feature to set udp6 checksum to zero in commit 6b649fe, which added `L2TP_ATTR_UDP_ZERO_CSUM6_TX` and `L2TP_ATTR_UDP_ZERO_CSUM6_TX`.
>
> As a result, I added `udp_csum`, `udp6_csum_tx`, `udp6_csum_rx` control flags to `ip l2tp add tunnel` to control those attributes about checksum.
>
> Using this to create the tunnel instead on Server A:
>
> ```
> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 remote 2001:db8::aaaa local 2001:db8::bbbb udp_sport 1086 udp_dport 1086 udp6_csum_tx on udp6_csum_rx on
> ```
>
> I finally got:
>
> ```
> on Server A:
>
> arping -i l2tpeth0 -0 1.2.3.4
>
> on Server B:
>
> tcpdump -i eth0 -n port 1086  -v
>
> 22:07:03.844297 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
> 22:07:04.845717 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
> 22:07:05.847965 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
>
> tcpdump -i l2tpeth0 -v
>
> 22:10:35.691326 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
> 22:10:36.693627 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
> 22:10:37.695010 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
> 22:10:38.697121 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
>
> ```
>
> It seems to work. However, is it the real point that should be fixed and does my patch fix it well? I need your suggestion.

This seems reasonable to me. It is good to provide user control of
L2TP checksum options.

However, there is a problem with your patch. The netlink attributes
you are accessing to control checksums are flags, not u8 values.

Maybe the default checksum setting for such l2tp tunnels should be
changed in the l2tp kernel code to match the previous behaviour where
IPv6 checksums were disabled?

>
>
> ---
>  ip/ipl2tp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
>
> diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
> index 3c8ee93..67a6482 100644
> --- a/ip/ipl2tp.c
> +++ b/ip/ipl2tp.c
> @@ -56,6 +56,8 @@ struct l2tp_parm {
>
>         uint16_t pw_type;
>         uint16_t mtu;
> +       int udp6_csum_tx:1;
> +       int udp6_csum_rx:1;
>         int udp_csum:1;
>         int recv_seq:1;
>         int send_seq:1;
> @@ -117,6 +119,9 @@ static int create_tunnel(struct l2tp_parm *p)
>         if (p->encap == L2TP_ENCAPTYPE_UDP) {
>                 addattr16(&req.n, 1024, L2TP_ATTR_UDP_SPORT, p->local_udp_port);
>                 addattr16(&req.n, 1024, L2TP_ATTR_UDP_DPORT, p->peer_udp_port);
> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_CSUM, p->udp_csum);
> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_ZERO_CSUM6_TX, p->udp6_csum_tx);
> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_ZERO_CSUM6_RX, p->udp6_csum_rx);
>         }
>
>         if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
> @@ -282,6 +287,14 @@ static int get_response(struct nlmsghdr *n, void *arg)
>                 p->l2spec_len = rta_getattr_u8(attrs[L2TP_ATTR_L2SPEC_LEN]);
>
>         p->udp_csum = !!attrs[L2TP_ATTR_UDP_CSUM];
> +       if (attrs[L2TP_ATTR_UDP_ZERO_CSUM6_TX])
> +         p->udp6_csum_tx = rta_getattr_u8(attrs[L2TP_ATTR_UDP_ZERO_CSUM6_TX]);
> +       else
> +         p->udp6_csum_tx = 1;
> +       if (attrs[L2TP_ATTR_UDP_ZERO_CSUM6_RX])
> +         p->udp6_csum_rx = rta_getattr_u8(attrs[L2TP_ATTR_UDP_ZERO_CSUM6_RX]);
> +       else
> +         p->udp6_csum_rx = 1;
>         if (attrs[L2TP_ATTR_COOKIE])
>                 memcpy(p->cookie, RTA_DATA(attrs[L2TP_ATTR_COOKIE]),
>                        p->cookie_len = RTA_PAYLOAD(attrs[L2TP_ATTR_COOKIE]));
> @@ -470,6 +483,9 @@ static void usage(void)
>         fprintf(stderr, "          tunnel_id ID peer_tunnel_id ID\n");
>         fprintf(stderr, "          [ encap { ip | udp } ]\n");
>         fprintf(stderr, "          [ udp_sport PORT ] [ udp_dport PORT ]\n");
> +       fprintf(stderr, "          [ udp_csum { on | off } ]\n");
> +       fprintf(stderr, "          [ udp6_csum_tx { on | off } ]\n");
> +       fprintf(stderr, "          [ udp6_csum_rx { on | off } ]\n");
>         fprintf(stderr, "Usage: ip l2tp add session [ name NAME ]\n");
>         fprintf(stderr, "          tunnel_id ID\n");
>         fprintf(stderr, "          session_id ID peer_session_id ID\n");
> @@ -500,6 +516,8 @@ static int parse_args(int argc, char **argv, int cmd, struct l2tp_parm *p)
>         /* Defaults */
>         p->l2spec_type = L2TP_L2SPECTYPE_DEFAULT;
>         p->l2spec_len = 4;
> +       p->udp6_csum_rx = 1;
> +       p->udp6_csum_tx = 1;
>
>         while (argc > 0) {
>                 if (strcmp(*argv, "encap") == 0) {
> @@ -569,6 +587,33 @@ static int parse_args(int argc, char **argv, int cmd, struct l2tp_parm *p)
>                         if (get_u16(&uval, *argv, 0))
>                                 invarg("invalid port\n", *argv);
>                         p->peer_udp_port = uval;
> +               } else if (strcmp(*argv, "udp_csum") == 0) {
> +                       NEXT_ARG();
> +                       if (strcmp(*argv, "on") == 0) {
> +                               p->udp_csum = 1;
> +                       } else if (strcmp(*argv, "off") == 0) {
> +                               p->udp_csum = 0;
> +                       } else {
> +                               invarg("invalid option for udp_csum\n", *argv);
> +                       }
> +               } else if (strcmp(*argv, "udp6_csum_rx") == 0) {
> +                       NEXT_ARG();
> +                       if (strcmp(*argv, "on") == 0) {
> +                               p->udp6_csum_rx = 1;
> +                       } else if (strcmp(*argv, "off") == 0) {
> +                               p->udp6_csum_rx = 0;
> +                       } else {
> +                               invarg("invalid option for udp6_csum_rx\n", *argv);
> +                       }
> +               } else if (strcmp(*argv, "udp6_csum_tx") == 0) {
> +                       NEXT_ARG();
> +                       if (strcmp(*argv, "on") == 0) {
> +                               p->udp6_csum_tx = 1;
> +                       } else if (strcmp(*argv, "off") == 0) {
> +                               p->udp6_csum_tx = 0;
> +                       } else {
> +                               invarg("invalid option for udp6_csum_tx\n", *argv);
> +                       }
>                 } else if (strcmp(*argv, "offset") == 0) {
>                         __u8 uval;
>
> --
> 2.5.2
>

^ permalink raw reply

* [PATCH net] gre: reject GUE and FOU in collect metadata mode
From: Jiri Benc @ 2016-04-27 12:08 UTC (permalink / raw)
  To: netdev; +Cc: Pravin Shelar, Thomas Graf, Tom Herbert

The collect metadata mode does not support GUE nor FOU. This might be
implemented later; until then, we should reject such config.

I think this is okay to be changed. It's unlikely anyone has such
configuration (as it doesn't work anyway) and we may need a way to
distinguish whether it's supported or not by the kernel later.

For backwards compatibility with iproute2, it's not possible to just check
the attribute presence (iproute2 always includes the attribute), the actual
value has to be checked, too.

Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
Discovered this only after I already sent v3 of the previous gre set.
Submitting this patch on its own, it's an indepent fix anyway (though fixing
the same commit).
---
 net/ipv4/ip_gre.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index f973e0a58993..f502d34bcb40 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -948,6 +948,11 @@ static int ipgre_tunnel_validate(struct nlattr *tb[], struct nlattr *data[])
 	if (flags & (GRE_VERSION|GRE_ROUTING))
 		return -EINVAL;

+	if (data[IFLA_GRE_COLLECT_METADATA] &&
+	    data[IFLA_GRE_ENCAP_TYPE] &&
+	    nla_get_u16(data[IFLA_GRE_ENCAP_TYPE]) != TUNNEL_ENCAP_NONE)
+		return -EINVAL;
+
 	return 0;
 }

-- 
1.8.3.1

^ permalink raw reply related

* Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Michael S. Tsirkin @ 2016-04-27 11:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
	pbonzini
In-Reply-To: <1458873274-13961-3-git-send-email-jasowang@redhat.com>

On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
> This patch tries to implement an device IOTLB for vhost. This could be
> used with for co-operation with userspace(qemu) implementation of DMA
> remapping.
> 
> The idea is simple. When vhost meets an IOTLB miss, it will request
> the assistance of userspace to do the translation, this is done
> through:
> 
> - Fill the translation request in a preset userspace address (This
>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
> - Notify userspace through eventfd (This eventfd was set through ioctl
>   VHOST_SET_IOTLB_FD).

Why use an eventfd for this? We use them for interrupts because
that happens to be what kvm wants, but here - why don't we
just add a generic support for reading out events
on the vhost fd itself?

> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
> 
> When userspace finishes the translation, it will update the vhost
> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
> snooping the IOTLB invalidation of IOMMU IOTLB and use
> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.

There's one problem here, and that is that VQs still do not undergo
translation.  In theory VQ could be mapped in such a way
that it's not contigious in userspace memory.


> Signed-off-by: Jason Wang <jasowang@redhat.com>

What limits amount of entries that kernel keeps around?
Do we want at least a mod parameter for this?


> ---
>  drivers/vhost/net.c        |   6 +-
>  drivers/vhost/vhost.c      | 301 +++++++++++++++++++++++++++++++++++++++------
>  drivers/vhost/vhost.h      |  17 ++-
>  fs/eventfd.c               |   3 +-
>  include/uapi/linux/vhost.h |  35 ++++++
>  5 files changed, 320 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 481db96..7cbdeed 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -334,7 +334,7 @@ static void handle_tx(struct vhost_net *net)
>  		head = vhost_get_vq_desc(vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
>  					 &out, &in,
> -					 NULL, NULL);
> +					 NULL, NULL, VHOST_ACCESS_RO);
>  		/* On error, stop handling until the next kick. */
>  		if (unlikely(head < 0))
>  			break;
> @@ -470,7 +470,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  		}
>  		r = vhost_get_vq_desc(vq, vq->iov + seg,
>  				      ARRAY_SIZE(vq->iov) - seg, &out,
> -				      &in, log, log_num);
> +				      &in, log, log_num, VHOST_ACCESS_WO);
>  		if (unlikely(r < 0))
>  			goto err;
>  
> @@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
>  		r = vhost_dev_ioctl(&n->dev, ioctl, argp);
>  		if (r == -ENOIOCTLCMD)
>  			r = vhost_vring_ioctl(&n->dev, ioctl, argp);
> -		else
> +		else if (ioctl != VHOST_UPDATE_IOTLB)
>  			vhost_net_flush(n);
>  		mutex_unlock(&n->dev.mutex);
>  		return r;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 32c35a9..1dd43e8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -280,6 +280,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	vq->iotlb_call = NULL;
> +	vq->iotlb_call_ctx = NULL;
> +	vq->iotlb_request = NULL;
> +	vq->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
>  	vq->umem = NULL;
>  	vq->is_le = virtio_legacy_is_little_endian();
>  	vhost_vq_reset_user_be(vq);
> @@ -387,8 +391,10 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->log_ctx = NULL;
>  	dev->log_file = NULL;
>  	dev->umem = NULL;
> +	dev->iotlb = NULL;
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
> +	spin_lock_init(&dev->iotlb_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
>  	dev->worker = NULL;
>  
> @@ -537,6 +543,15 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +static void vhost_umem_free(struct vhost_umem *umem,
> +			    struct vhost_umem_node *node)
> +{
> +	vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> +	list_del(&node->link);
> +	kfree(node);
> +	umem->numem--;
> +}
> +
>  static void vhost_umem_clean(struct vhost_umem *umem)
>  {
>  	struct vhost_umem_node *node, *tmp;
> @@ -544,11 +559,9 @@ static void vhost_umem_clean(struct vhost_umem *umem)
>  	if (!umem)
>  		return;
>  
> -	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
> -		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> -		list_del(&node->link);
> -		kvfree(node);
> -	}
> +	list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
> +		vhost_umem_free(umem, node);
> +
>  	kvfree(umem);
>  }
>  
> @@ -580,6 +593,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  	/* No one will access memory at this point */
>  	vhost_umem_clean(dev->umem);
>  	dev->umem = NULL;
> +	vhost_umem_clean(dev->iotlb);
> +	dev->iotlb = NULL;
>  	WARN_ON(!list_empty(&dev->work_list));
>  	if (dev->worker) {
>  		kthread_stop(dev->worker);
> @@ -699,11 +714,61 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>  
> +static int vhost_new_umem_range(struct vhost_umem *umem,
> +				u64 start, u64 size, u64 end,
> +				u64 userspace_addr, int perm)
> +{
> +	struct vhost_umem_node *tmp, *node = kmalloc(sizeof(*node), GFP_ATOMIC);
> +
> +	if (!node)
> +		return -ENOMEM;
> +
> +	if (umem->numem == VHOST_IOTLB_SIZE) {
> +		tmp = list_last_entry(&umem->umem_list, typeof(*tmp), link);
> +		vhost_umem_free(umem, tmp);
> +	}
> +
> +	node->start = start;
> +	node->size = size;
> +	node->last = end;
> +	node->userspace_addr = userspace_addr;
> +	node->perm = perm;
> +	INIT_LIST_HEAD(&node->link);
> +	list_add_tail(&node->link, &umem->umem_list);
> +	vhost_umem_interval_tree_insert(node, &umem->umem_tree);
> +	umem->numem++;
> +
> +	return 0;
> +}
> +
> +static void vhost_del_umem_range(struct vhost_umem *umem,
> +				 u64 start, u64 end)
> +{
> +	struct vhost_umem_node *node;
> +
> +	while ((node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
> +							   start, end)))
> +		vhost_umem_free(umem, node);
> +}
> +
> +static struct vhost_umem *vhost_umem_alloc(void)
> +{
> +	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
> +
> +	if (!umem)
> +		return NULL;
> +
> +	umem->umem_tree = RB_ROOT;
> +	umem->numem = 0;
> +	INIT_LIST_HEAD(&umem->umem_list);
> +
> +	return umem;
> +}
> +
>  static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  {
>  	struct vhost_memory mem, *newmem;
>  	struct vhost_memory_region *region;
> -	struct vhost_umem_node *node;
>  	struct vhost_umem *newumem, *oldumem;
>  	unsigned long size = offsetof(struct vhost_memory, regions);
>  	int i;
> @@ -725,28 +790,23 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  		return -EFAULT;
>  	}
>  
> -	newumem = vhost_kvzalloc(sizeof(*newumem));
> +	newumem = vhost_umem_alloc();
>  	if (!newumem) {
>  		kvfree(newmem);
>  		return -ENOMEM;
>  	}
>  
> -	newumem->umem_tree = RB_ROOT;
> -	INIT_LIST_HEAD(&newumem->umem_list);
> -
>  	for (region = newmem->regions;
>  	     region < newmem->regions + mem.nregions;
>  	     region++) {
> -		node = vhost_kvzalloc(sizeof(*node));
> -		if (!node)
> +		if (vhost_new_umem_range(newumem,
> +					 region->guest_phys_addr,
> +					 region->memory_size,
> +					 region->guest_phys_addr +
> +					 region->memory_size - 1,
> +					 region->userspace_addr,
> +				         VHOST_ACCESS_RW))
>  			goto err;
> -		node->start = region->guest_phys_addr;
> -		node->size = region->memory_size;
> -		node->last = node->start + node->size - 1;
> -		node->userspace_addr = region->userspace_addr;
> -		INIT_LIST_HEAD(&node->link);
> -		list_add_tail(&node->link, &newumem->umem_list);
> -		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
>  	}
>  
>  	if (!memory_access_ok(d, newumem, 0))
> @@ -782,6 +842,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  	struct vhost_vring_state s;
>  	struct vhost_vring_file f;
>  	struct vhost_vring_addr a;
> +	struct vhost_vring_iotlb_entry e;
>  	u32 idx;
>  	long r;
>  
> @@ -910,6 +971,35 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  		} else
>  			filep = eventfp;
>  		break;
> +	case VHOST_SET_VRING_IOTLB_REQUEST:
> +		r = -EFAULT;
> +		if (copy_from_user(&e, argp, sizeof e))
> +			break;
> +		if (!access_ok(VERIFY_WRITE, e.userspace_addr,
> +				sizeof(*vq->iotlb_request)))
> +			break;
> +		r = 0;
> +		vq->iotlb_request = (struct vhost_iotlb_entry __user *)e.userspace_addr;
> +		break;
> +	case VHOST_SET_VRING_IOTLB_CALL:
> +		if (copy_from_user(&f, argp, sizeof f)) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp)) {
> +			r = PTR_ERR(eventfp);
> +			break;
> +		}
> +		if (eventfp != vq->iotlb_call) {
> +			filep = vq->iotlb_call;
> +			ctx = vq->iotlb_call_ctx;
> +			vq->iotlb_call = eventfp;
> +			vq->iotlb_call_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
>  	case VHOST_SET_VRING_CALL:
>  		if (copy_from_user(&f, argp, sizeof f)) {
>  			r = -EFAULT;
> @@ -977,11 +1067,55 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
>  
> +static int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
> +{
> +	struct vhost_umem *niotlb, *oiotlb;
> +
> +	if (enabled) {
> +		niotlb = vhost_umem_alloc();
> +		if (!niotlb)
> +			return -ENOMEM;
> +	} else
> +		niotlb = NULL;
> +
> +	spin_lock(&d->iotlb_lock);
> +	oiotlb = d->iotlb;
> +	d->iotlb = niotlb;
> +	spin_unlock(&d->iotlb_lock);
> +
> +	vhost_umem_clean(oiotlb);
> +
> +	return 0;
> +}
> +
> +static void vhost_complete_iotlb_update(struct vhost_dev *d,
> +					struct vhost_iotlb_entry *entry)
> +{
> +	struct vhost_iotlb_entry *req;
> +	struct vhost_virtqueue *vq;
> +	int i;
> +
> +
> +	for (i = 0; i < d->nvqs; i++) {
> +		vq = d->vqs[i];
> +		mutex_lock(&vq->mutex);
> +		req = &vq->pending_request;
> +		if (entry->iova <= req->iova &&
> +		    entry->iova + entry->size - 1 > req->iova &&
> +		    req->flags.type == VHOST_IOTLB_MISS) {
> +			*req = *entry;
> +			vhost_poll_queue(&vq->poll);
> +		}
> +		mutex_unlock(&vq->mutex);
> +	}
> +}
> +
>  /* Caller must have device mutex */
>  long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>  {
>  	struct file *eventfp, *filep = NULL;
>  	struct eventfd_ctx *ctx = NULL;
> +	struct vhost_iotlb_entry entry;
>  	u64 p;
>  	long r;
>  	int i, fd;
> @@ -1050,6 +1184,52 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>  		if (filep)
>  			fput(filep);
>  		break;
> +	case VHOST_RUN_IOTLB:
> +		/* FIXME: enable and disabled */
> +		vhost_init_device_iotlb(d, true);
> +		break;
> +	case VHOST_UPDATE_IOTLB:
> +		r = copy_from_user(&entry, argp, sizeof(entry));
> +		if (r < 0) {
> +			r = -EFAULT;
> +			goto done;
> +		}
> +
> +		spin_lock(&d->iotlb_lock);
> +		if (!d->iotlb) {
> +			spin_unlock(&d->iotlb_lock);
> +			r = -EFAULT;
> +			goto done;
> +		}
> +		switch (entry.flags.type) {
> +		case VHOST_IOTLB_UPDATE:
> +			if (entry.flags.valid != VHOST_IOTLB_VALID) {
> +				break;
> +			}
> +			if (vhost_new_umem_range(d->iotlb,
> +						 entry.iova,
> +						 entry.size,
> +						 entry.iova + entry.size - 1,
> +                                                 entry.userspace_addr,
> +                                                 entry.flags.perm)) {
> +				r = -ENOMEM;
> +				break;
> +			}
> +			break;
> +		case VHOST_IOTLB_INVALIDATE:
> +			vhost_del_umem_range(d->iotlb,
> +					     entry.iova,
> +					     entry.iova + entry.size - 1);
> +			break;
> +		default:
> +			r = -EINVAL;
> +		}
> +		spin_unlock(&d->iotlb_lock);
> +
> +		if (!r && entry.flags.type != VHOST_IOTLB_INVALIDATE)
> +			vhost_complete_iotlb_update(d, &entry);
> +
> +		break;
>  	default:
>  		r = -ENOIOCTLCMD;
>  		break;
> @@ -1197,27 +1377,69 @@ int vhost_init_used(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_init_used);
>  
> +static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova)
> +{
> +	struct vhost_iotlb_entry *pending = &vq->pending_request;
> +	int ret;
> +
> +	if (!vq->iotlb_call_ctx)
> +		return -EOPNOTSUPP;
> +
> +	if (!vq->iotlb_request)
> +		return -EOPNOTSUPP;
> +
> +	if (pending->flags.type == VHOST_IOTLB_MISS) {
> +		return -EEXIST;
> +	}
> +
> +	pending->iova = iova;
> +	pending->flags.type = VHOST_IOTLB_MISS;
> +
> +	ret = __copy_to_user(vq->iotlb_request, pending,
> +			     sizeof(struct vhost_iotlb_entry));
> +	if (ret) {
> +		goto err;
> +	}
> +
> +	if (vq->iotlb_call_ctx)
> +		eventfd_signal(vq->iotlb_call_ctx, 1);
> +err:
> +	return ret;
> +}
> +
>  static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> -			  struct iovec iov[], int iov_size)
> +			  struct iovec iov[], int iov_size, int access)
>  {
>  	const struct vhost_umem_node *node;
> -	struct vhost_umem *umem = vq->umem;
> +	struct vhost_dev *dev = vq->dev;
> +	struct vhost_umem *umem = dev->iotlb ? dev->iotlb : dev->umem;
>  	struct iovec *_iov;
>  	u64 s = 0;
>  	int ret = 0;
>  
> +	spin_lock(&dev->iotlb_lock);
> +
>  	while ((u64)len > s) {
>  		u64 size;
>  		if (unlikely(ret >= iov_size)) {
>  			ret = -ENOBUFS;
>  			break;
>  		}
> +
>  		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
>  							addr, addr + len - 1);
>  		if (node == NULL || node->start > addr) {
> -			ret = -EFAULT;
> +			if (umem != dev->iotlb) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +			ret = -EAGAIN;
> +			break;
> +		} else if (!(node->perm & access)) {
> +			ret = -EPERM;
>  			break;
>  		}
> +
>  		_iov = iov + ret;
>  		size = node->size - addr + node->start;
>  		_iov->iov_len = min((u64)len - s, size);
> @@ -1228,6 +1450,10 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
>  		++ret;
>  	}
>  
> +	spin_unlock(&dev->iotlb_lock);
> +
> +	if (ret == -EAGAIN)
> +		vhost_iotlb_miss(vq, addr);
>  	return ret;
>  }
>  
> @@ -1256,7 +1482,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  			struct iovec iov[], unsigned int iov_size,
>  			unsigned int *out_num, unsigned int *in_num,
>  			struct vhost_log *log, unsigned int *log_num,
> -			struct vring_desc *indirect)
> +			struct vring_desc *indirect, int access)
>  {
>  	struct vring_desc desc;
>  	unsigned int i = 0, count, found = 0;
> @@ -1274,9 +1500,10 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  	}
>  
>  	ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
> -			     UIO_MAXIOV);
> +			     UIO_MAXIOV, access);
>  	if (unlikely(ret < 0)) {
> -		vq_err(vq, "Translation failure %d in indirect.\n", ret);
> +		if (ret != -EAGAIN)
> +			vq_err(vq, "Translation failure %d in indirect.\n", ret);
>  		return ret;
>  	}
>  	iov_iter_init(&from, READ, vq->indirect, ret, len);
> @@ -1316,10 +1543,11 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  
>  		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
>  				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
> -				     iov_size - iov_count);
> +				     iov_size - iov_count, access);
>  		if (unlikely(ret < 0)) {
> -			vq_err(vq, "Translation failure %d indirect idx %d\n",
> -			       ret, i);
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Translation failure %d indirect idx %d\n",
> +					ret, i);
>  			return ret;
>  		}
>  		/* If this is an input descriptor, increment that count. */
> @@ -1355,7 +1583,8 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  		      struct iovec iov[], unsigned int iov_size,
>  		      unsigned int *out_num, unsigned int *in_num,
> -		      struct vhost_log *log, unsigned int *log_num)
> +		      struct vhost_log *log, unsigned int *log_num,
> +		      int access)
>  {
>  	struct vring_desc desc;
>  	unsigned int i, head, found = 0;
> @@ -1433,10 +1662,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
>  			ret = get_indirect(vq, iov, iov_size,
>  					   out_num, in_num,
> -					   log, log_num, &desc);
> +					   log, log_num, &desc, access);
>  			if (unlikely(ret < 0)) {
> -				vq_err(vq, "Failure detected "
> -				       "in indirect descriptor at idx %d\n", i);
> +				if (ret != -EAGAIN)
> +					vq_err(vq, "Failure detected "
> +						"in indirect descriptor at idx %d\n", i);
>  				return ret;
>  			}
>  			continue;
> @@ -1444,10 +1674,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>  		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
>  				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
> -				     iov_size - iov_count);
> +				     iov_size - iov_count, access);
>  		if (unlikely(ret < 0)) {
> -			vq_err(vq, "Translation failure %d descriptor idx %d\n",
> -			       ret, i);
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Translation failure %d descriptor idx %d\n",
> +					ret, i);
>  			return ret;
>  		}
>  		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE)) {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 5d64393..4365104 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -62,13 +62,15 @@ struct vhost_umem_node {
>  	__u64 last;
>  	__u64 size;
>  	__u64 userspace_addr;
> -	__u64 flags_padding;
> +	__u32 perm;
> +	__u32 flags_padding;
>  	__u64 __subtree_last;
>  };
>  
>  struct vhost_umem {
>  	struct rb_root umem_tree;
>  	struct list_head umem_list;
> +	int numem;
>  };
>  
>  /* The virtqueue structure describes a queue attached to a device. */
> @@ -84,9 +86,13 @@ struct vhost_virtqueue {
>  	struct file *kick;
>  	struct file *call;
>  	struct file *error;
> +	struct file *iotlb_call;
>  	struct eventfd_ctx *call_ctx;
>  	struct eventfd_ctx *error_ctx;
>  	struct eventfd_ctx *log_ctx;
> +	struct eventfd_ctx *iotlb_call_ctx;
> +	struct vhost_iotlb_entry __user *iotlb_request;
> +	struct vhost_iotlb_entry pending_request;
>  
>  	struct vhost_poll poll;
>  
> @@ -135,6 +141,8 @@ struct vhost_virtqueue {
>  #endif
>  };
>  
> +#define VHOST_IOTLB_SIZE 2048
> +
>  struct vhost_dev {
>  	struct mm_struct *mm;
>  	struct mutex mutex;
> @@ -146,6 +154,8 @@ struct vhost_dev {
>  	struct list_head work_list;
>  	struct task_struct *worker;
>  	struct vhost_umem *umem;
> +	struct vhost_umem *iotlb;
> +	spinlock_t iotlb_lock;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> @@ -164,7 +174,8 @@ int vhost_log_access_ok(struct vhost_dev *);
>  int vhost_get_vq_desc(struct vhost_virtqueue *,
>  		      struct iovec iov[], unsigned int iov_count,
>  		      unsigned int *out_num, unsigned int *in_num,
> -		      struct vhost_log *log, unsigned int *log_num);
> +		      struct vhost_log *log, unsigned int *log_num,
> +		      int access);
>  void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>  
>  int vhost_init_used(struct vhost_virtqueue *);
> @@ -183,7 +194,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>  		    unsigned int log_num, u64 len);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
> -		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> +		printk(pr_fmt(fmt), ##__VA_ARGS__);       \
>  		if ((vq)->error_ctx)                               \
>  				eventfd_signal((vq)->error_ctx, 1);\
>  	} while (0)
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 8d0c0df..5c0a22f 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -59,8 +59,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	if (ULLONG_MAX - ctx->count < n)
>  		n = ULLONG_MAX - ctx->count;
>  	ctx->count += n;
> -	if (waitqueue_active(&ctx->wqh))
> +	if (waitqueue_active(&ctx->wqh)) {
>  		wake_up_locked_poll(&ctx->wqh, POLLIN);
> +	}
>  	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
>  
>  	return n;
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index ab373191..5c35ab4 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -47,6 +47,32 @@ struct vhost_vring_addr {
>  	__u64 log_guest_addr;
>  };
>  
> +struct vhost_iotlb_entry {
> +	__u64 iova;
> +	__u64 size;
> +	__u64 userspace_addr;

Alignment requirements?

> +	struct {
> +#define VHOST_ACCESS_RO      0x1
> +#define VHOST_ACCESS_WO      0x2
> +#define VHOST_ACCESS_RW      0x3
> +		__u8  perm;
> +#define VHOST_IOTLB_MISS           1
> +#define VHOST_IOTLB_UPDATE         2
> +#define VHOST_IOTLB_INVALIDATE     3
> +		__u8  type;
> +#define VHOST_IOTLB_INVALID        0x1
> +#define VHOST_IOTLB_VALID          0x2
> +		__u8  valid;

why do we need this flag?

> +		__u8  u8_padding;
> +		__u32 padding;
> +	} flags;
> +};
> +
> +struct vhost_vring_iotlb_entry {
> +	unsigned int index;
> +	__u64 userspace_addr;
> +};
> +
>  struct vhost_memory_region {
>  	__u64 guest_phys_addr;
>  	__u64 memory_size; /* bytes */
> @@ -127,6 +153,15 @@ struct vhost_memory {
>  /* Set eventfd to signal an error */
>  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>  
> +/* IOTLB */
> +/* Specify an eventfd file descriptor to signle on IOTLB miss */

typo

> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct      \
> +                                        vhost_vring_file)
> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct   \
> +                                           vhost_vring_iotlb_entry)
> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
> +

Is the assumption that userspace must dedicate a thread to running the iotlb? 
I dislike this one.
Please support asynchronous APIs at least optionally to make
userspace make its own threading decisions.

>  /* VHOST_NET specific defines */
>  
>  /* Attach virtio net ring to a raw socket, or tap device.

Don't we need a feature bit for this?
Are we short on feature bits? If yes maybe it's time to add
something like PROTOCOL_FEATURES that we have in vhost-user.

> -- 
> 2.5.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox