Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next] tuntap: remove unnecessary sk_receive_queue length check during xmit
From: Jason Wang @ 2016-11-23  2:26 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

After commit 1576d9860599 ("tun: switch to use skb array for tx"),
sk_receive_queue was not used any more. So remove the uncessary
sk_receive_queue length check during xmit.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 64e694c..e2af2dd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -878,13 +878,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	    sk_filter(tfile->socket.sk, skb))
 		goto drop;
 
-	/* Limit the number of packets queued by dividing txq length with the
-	 * number of queues.
-	 */
-	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue) * numqueues
-			  >= dev->tx_queue_len)
-		goto drop;
-
 	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
 		goto drop;
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 1/1] net sched filters: fix filter handle ID in tfilter_notify_chain()
From: Roman Mashak @ 2016-11-23  1:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, jhs, xiyou.wangcong, daniel, Roman Mashak

Should pass valid filter handle, not the netlink flags.

Fixes: 30a391a13ab92 ("net sched filters: pass netlink message flags in event notification")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/cls_api.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8e93d4a..b05d4a2 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -112,7 +112,7 @@ static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
 
 	for (it_chain = chain; (tp = rtnl_dereference(*it_chain)) != NULL;
 	     it_chain = &tp->next)
-		tfilter_notify(net, oskb, n, tp, n->nlmsg_flags, event, false);
+		tfilter_notify(net, oskb, n, tp, 0, event, false);
 }
 
 /* Select new prio value from the range, managed by kernel. */
-- 
1.9.1

^ permalink raw reply related

* net/arp: ARP cache aging failed.
From: yuehaibing @ 2016-11-23  1:16 UTC (permalink / raw)
  To: davem; +Cc: netdev


 Hi,

	I've encountered a arp cache aging failed bug in 4.9 kernel.The topo is as follow:


		HOST1			   --------		  -----
	  --------------------IP1 --------| Switch |--------IP2-| HOST2 |
		 |			   --------		  -----
 	  ------Bonging----                  |
	   |		   |		     IP3
 	 MAC1            MAC2		   | HOST3 |


	HOST1 have a bonding interface which including two NICs

	IP1:192.168.1.100/24
	IP2:192.168.1.200/24
	IP2:192.168.1.300/24

	There are large numbers of TCP transaction between HOST2 and HOST3.


	The Host2 can ping HOST1 normally.However,It cannot ping after HOST1 bonding interface deactived a working NIC.
	

	on HOST2 ,use fowllow command:

		watch "ip -s neigh show|grep 192.168.1.100"
	
	I noticed the old HOST1 arp cache aging counter is gradually  increased ,then reset before it reached 30.This process is repeated,
thus the arp cache holding REACHABLE status.The new HOST1 MAC arp cache cannot been renewed ,and thus ping cannot sendto the correct HOST1 MAC.

	Then I found n->confirmed is freshed  in dst_neigh_output while dst->pending_confirm is set to 1.

include/net/dst.h

static inline void dst_confirm(struct dst_entry *dst)
{
	dst->pending_confirm = 1;
}

static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,
                                   struct sk_buff *skb)
{
        const struct hh_cache *hh;

        if (dst->pending_confirm) {
                unsigned long now = jiffies;

                dst->pending_confirm = 0;
                /* avoid dirtying neighbour */
                if (n->confirmed != now)
                        n->confirmed = now;
        }
	.......
}

	dst_confirm can be called by tcp_ack in net/ipv4/tcp_input.c.

/* This routine deals with incoming acks, but not outgoing ones. */
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
	.....
	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
		struct dst_entry *dst = __sk_dst_get(sk);
		if (dst)
			dst_confirm(dst);
	}
	.....
}
	
	As to my topo,HOST1 and HOST3 share one route on HOST2, tcp connection between HOST2 and HOST3 may call tcp_ack to set dst->pending_confirm.
	
So dst_neigh_output may wrongly freshed  n->confirmed which stands for HOST1,however HOST1'MAC had been changed.

	The possibility of this occurred Significantly increases ,when ping and TCP transaction are set the same processor affinity on the HOST2.

	It seems that the issue is brought in commit 5110effee8fde2edfacac9cd12a9960ab2dc39ea ("net: Do delayed neigh confirmation.").

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-11-23  0:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma, netdev,
	Dennis Dalessandro
In-Reply-To: <alpine.DEB.2.20.1611221656390.8214@east.gentwo.org>

On Tue, Nov 22, 2016 at 05:04:37PM -0600, Christoph Lameter wrote:
>On Tue, 22 Nov 2016, Vishwanathapura, Niranjana wrote:
>
>> Ok, I do understand Jason's point that we should probably not put this driver
>> under drivers/infiniband/sw/.., as this driver is not a HCA.
>> It is an ULP similar to ipoib, built on top of Omni-path irrespective of
>> whether we register a hfi_vnic_bus or a direct custom interface with HFI1.
>> This ULP will transmit and recieve Omni-path packets over the fabric, and is
>> dependent on IB MAD interface and the HFI1 driver.
>
>This is something that encapsulates IP (v4 right?) in something else.
>Would belong into
>
>	linux/net/ipv4
>
>You already have similar implementations there
>
>See f.e. ipip.c, ip_tunnel.c and lots more (try
>	ls linux/net/ipv4/*tunnel*
>
>)
>
>If this is more like a device then it would belong into
>
>linux/drivers/net/hfi or so (see also linux/drivers/net/ppp, plip,
>loopback, etc etc)
>

It is Ethernet packet encapsulated in Omni-path header by hfi_vnic driver.
The packets are sent and received over the wire by the HFI1 device driven by 
HFI1 driver. The encapsulation information is obtained via IB MAD control 
interface.

Niranjana

>
>

^ permalink raw reply

* Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes
From: Casey Leedom @ 2016-11-22 23:30 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <DM5PR12MB178630AFB893C17362A4C09BC8BB0@DM5PR12MB1786.namprd12.prod.outlook.com>

  I'm attempting to start the work necessary to implement Vidya's proposed new ethtool interface to manage Forward Error Correction settings on a link.  I'm confused by the ethtool FEC API and the degree/type of control it offers.  At the top of the patch we have:

Encoding: Types of encoding
Off    :  Turning off any encoding
RS     :  enforcing RS-FEC encoding on supported speeds
BaseR  :  enforcing Base R encoding on supported speeds
Auto   :  Default FEC settings  for  divers , and would represent
          asking the hardware to essentially go into a best effort mode.

but then later on we have:

+struct ethtool_fecparam {
+ __u32   cmd;
+ __u32   autoneg;
+     /* bitmask of FEC modes */
+     __u32   fec;
+     __u32   reserved;
+};

...

+enum ethtool_fec_config_bits {
+       ETHTOOL_FEC_NONE_BIT,
+       ETHTOOL_FEC_AUTO_BIT,
+       ETHTOOL_FEC_OFF_BIT,
+       ETHTOOL_FEC_RS_BIT,
+       ETHTOOL_FEC_BASER_BIT,
+};

...

+       ETHTOOL_LINK_MODE_FEC_NONE_BIT          = 47,
+       ETHTOOL_LINK_MODE_FEC_RS_BIT            = 48,
+       ETHTOOL_LINK_MODE_FEC_BASER_BIT           = 49,

The last ethtool Link Mode bits seem to imply a separable FEC on/off with individual control for RS and BASER.  How would the "Auto" from the top be encoded within these Link Mode bits?  And I don't see any reference to the ethtool_fec_config_bits in the kernel or ethtool patches so I'm not sure what they're supposed to reference.  Can you clarify the above?  I.e. can you offer a small template example of what a driver implementation might look like interpreting the incoming Link Mode Bits?

  And do we expect that there will be new FECs in the future?

Casey

^ permalink raw reply

* Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups
From: John Stultz @ 2016-11-23  0:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Andy Lutomirski, Mickaël Salaün,
	Daniel Mack, David S. Miller, kafai-b10kYP2dOMg, Florian Westphal,
	Harald Hoyer, Network Development, Sargun Dhillon,
	Pablo Neira Ayuso, lkml, Tejun Heo, Li Zefan, Jonathan Corbet,
	open list:CONTROL GROUP (CGROUP), Android Kernel Team,
	Rom Lemarchand, Colin Cross, Dmitry Shmidt
In-Reply-To: <CALCETrW_mPHneTvcAM6HyBLCJqcwKmcYEoC+kseyuyQZzj2a=A@mail.gmail.com>

On Tue, Nov 8, 2016 at 4:12 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov
> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:
>>>
>>> I hate to say it, but I think I may see a problem.  Current
>>> developments are afoot to make cgroups do more than resource control.
>>> For example, there's Landlock and there's Daniel's ingress/egress
>>> filter thing.  Current cgroup controllers can mostly just DoS their
>>> controlled processes.  These new controllers (or controller-like
>>> things) can exfiltrate data and change semantics.
>>>
>>> Does anyone have a security model in mind for these controllers and
>>> the cgroups that they're attached to?  I'm reasonably confident that
>>> CAP_SYS_RESOURCE is not the answer...
>>
>> and specifically the answer is... ?
>> Also would be great if you start with specifying the question first
>> and the problem you're trying to solve.
>>
>
> I don't have a good answer right now.  Here are some constraints, though:
>
> 1. An insufficiently privileged process should not be able to move a
> victim into a dangerous cgroup.
>
> 2. An insufficiently privileged process should not be able to move
> itself into a dangerous cgroup and then use execve to gain privilege
> such that the execve'd program can be compromised.
>
> 3. An insufficiently privileged process should not be able to make an
> existing cgroup dangerous in a way that could compromise a victim in
> that cgroup.
>
> 4. An insufficiently privileged process should not be able to make a
> cgroup dangerous in a way that bypasses protections that would
> otherwise protect execve() as used by itself or some other process in
> that cgroup.
>
> Keep in mind that "dangerous" may apply to a cgroup's descendents in
> addition to the cgroup being controlled.

Sorry for taking awhile to get back to you here.  I'm a little
befuddled as to what next steps I should consider (and honestly, I'm
not totally sure I really grok your concern here, particularly what
you mean with "dangrous cgroups").

So is going back to the CAP_CGROUP_MIGRATE approach (to properly
separate "sufficiently" from "insufficiently privileged") better?

Or something closer to the original method Android used of each cgroup
having an allow_attach() check which could determine what is
sufficiently privledged for the respective level of danger the cgroup
might poise?

Or just stepping back, what method would you imagine to be reasonable
to allow a specified task to migrate other tasks between cgroups
without it having to be root/suid?

thanks
-john

^ permalink raw reply

* [PATCH net-next 1/2] samples/bpf: fix sockex2 example
From: Alexei Starovoitov @ 2016-11-23  0:52 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, netdev

since llvm commit "Do not expand UNDEF SDNode during insn selection lowering"
llvm will generate code that uses uninitialized registers for cases
where C code is actually uses uninitialized data.
So this sockex2 example is technically broken.
Fix it by initializing on the stack variable fully.
Also increase verifier buffer limit, since verifier output
may not fit in 64k for this sockex2 code depending on llvm version.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/libbpf.h       | 2 +-
 samples/bpf/sockex2_kern.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index ac6edb61b64a..de96a935068d 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -18,7 +18,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);

-#define LOG_BUF_SIZE 65536
+#define LOG_BUF_SIZE (256 * 1024)
 extern char bpf_log_buf[LOG_BUF_SIZE];

 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c
index 44e5846c988f..f58acfc92556 100644
--- a/samples/bpf/sockex2_kern.c
+++ b/samples/bpf/sockex2_kern.c
@@ -198,7 +198,7 @@ struct bpf_map_def SEC("maps") hash_map = {
 SEC("socket2")
 int bpf_prog2(struct __sk_buff *skb)
 {
-	struct bpf_flow_keys flow;
+	struct bpf_flow_keys flow = {};
 	struct pair *value;
 	u32 key;

-- 
2.8.0

^ permalink raw reply related

* [PATCH net-next 2/2] samples/bpf: fix bpf loader
From: Alexei Starovoitov @ 2016-11-23  0:52 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, netdev
In-Reply-To: <1479862329-2361912-1-git-send-email-ast@fb.com>

llvm can emit relocations into sections other than program code
(like debug info sections). Ignore them during parsing of elf file

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/bpf_load.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 97913e109b14..62f54d6eb8bf 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -317,6 +317,10 @@ int load_bpf_file(char *path)
 				    &shdr_prog, &data_prog))
 				continue;
 
+			if (shdr_prog.sh_type != SHT_PROGBITS ||
+			    !(shdr_prog.sh_flags & SHF_EXECINSTR))
+				continue;
+
 			insns = (struct bpf_insn *) data_prog->d_buf;
 
 			processed_sec[shdr.sh_info] = true;
-- 
2.8.0

^ permalink raw reply related

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-23  0:49 UTC (permalink / raw)
  To: ira.weiny
  Cc: Vishwanathapura, Niranjana, Doug Ledford, linux-rdma, netdev,
	Dennis Dalessandro
In-Reply-To: <20161123000502.GA27968@phlsvsds.ph.intel.com>

On Tue, Nov 22, 2016 at 07:05:05PM -0500, ira.weiny wrote:
> On Tue, Nov 22, 2016 at 10:04:07AM -0700, Jason Gunthorpe wrote:
> > On Mon, Nov 21, 2016 at 05:53:04PM -0800, Vishwanathapura, Niranjana wrote:
> > > There are many example drivers in kernel which are using bus_register() in
> > > an initcall.
> > 
> > There really are not, certainly not in major subsystems.
> 
> I see 2 drivers in the Block subsystem which do this:
> 
> 
> 19   5354  /nfs/site/home/iweiny/linux-stable/drivers/block/cciss.c <<cciss_init>>
> 	err = bus_register(&cciss_bus_type);
> 20   6447 /nfs/site/home/iweiny/linux-stable/drivers/block/rbd.c <<rbd_sysfs_init>>
> 	ret = bus_register(&rbd_bus_type);
> 
> 2 drivers in the drm subsystem which do this:
> 
> 
> 29   1155  /nfs/site/home/iweiny/linux-stable/drivers/gpu/drm/drm_mipi_dsi.c <<mipi_dsi_bus_init>>
> 	return bus_register(&mipi_dsi_bus_type);
> 30    242 /nfs/site/home/iweiny/linux-stable/drivers/gpu/host1x/dev.c <<tegra_host1x_init>>
> 	err = bus_register(&host1x_bus_type);

IMHO this is all obscure or legacy stuff (eg ccsiss lost its bus when
it was reworked into hpsa). Who knows about that SOC stuff, maybe
there really is a special on-chip bus under those drivers.

The point is using a bus as a generic interconnect between two driver
modules seems very rare, and is not what we have historically ever
done in drivers/infiniband - all our split drivers use a trivial
register scheme. eg see cxgb4_register_uld/mlx4_register_interface/etc.

Should a multi-function driver use a bus or class to connect its
parts? Who knows. Maybe Greg KH/etc has an opinion. But that is not
what we have been doing, it doesn't seem very simplifying, and
this series doesn't even make module auto-loading work...

Since doing this creates a bunch of uapis (again, from a driver, ugh) it
seems like a bad idea without more support as 'the right way'

.. and yes, it would be nice to have a lightweight mechanism to
replace those register functions that could handle module auto loading
too, and maybe that is a 'multi part driver bus/class' or somesuch
... This is really a topic for the device core maintainers, IMHO.

> > > We could add a custom Interface between HFI1 driver and hfi_vnic drivers
> > > without involving a bus.
> > 
> > hfi is already registering on the infiniband class, just use that.
> 
> I don't understand what you mean here?

Get the struct ib_device for the hfi and then do something to get hfi
specific function calls.

Or work it backwards with a _register function..

> [*] As an aside why does the ib_core not use this methodology?  It dawned on
> me that this may be a better way to fix our module load problems.  However, I
> have not looked into details.

ib_core is a class, which is appropriate. RDMA devices are not busses.

Jason

^ permalink raw reply

* Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration
From: Florian Fainelli @ 2016-11-23  0:24 UTC (permalink / raw)
  To: Jiri Pirko, Andrew Lunn
  Cc: idosch, vivien.didelot, netdev, bridge, Ido Schimmel, jiri, davem
In-Reply-To: <20161122220859.GF1819@nanopsycho>

On 11/22/2016 02:08 PM, Jiri Pirko wrote:
> Tue, Nov 22, 2016 at 06:48:29PM CET, andrew@lunn.ch wrote:
>> Hi Ido
>>
>>> First of all, I want to be sure that when we say "CPU port", we're
>>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>>> the device and the host, through which all packets trapped to the host
>>> go through. So, when a packet is trapped, the driver reads its Rx
>>> descriptor, checks through which port it ingressed, resolves its netdev,
>>> sets skb->dev accordingly and injects it to the Rx path via
>>> netif_receive_skb(). The CPU port itself isn't represented using a
>>> netdev.
>>
>> With DSA, we have a real physical ethernet network interface for the
>> 'cpu' port. It connects to one of the ports of the switch. Frames on
> 
> Every port should be visible as a netdevice, including cpu port.
> Would it make sence to have representors for those?

The CPU port is kind of already visible with DSA since you need the
switch to be attached to a normal Ethernet MAC driver (later referenced
as eth0 for simplicity). Since eth0 is going to potentially receive/send
switch tagged traffic, and the model is to terminate the interfaces at
the port level, this interface does not really have any meaningful use
from a data exchange, apart from multiplexing/demultiplexing switch tags
(when enabled).

If we did create a "cpu" network device, this interface would not be
able to send/receive traffic either, because the per-port network
interfaces are terminated at their level, and the conduit interface is
just used for transmitting/receiving switch tagged traffic. It does have
value as a controlling interface only though.

As a controlling interface, this can be helpful, but we need to decide
which side of the switch this CPU interface would represent, is it the
switch's view of the CPU port, or is the Ethernet MAC view's of the
switch's CPU port, attached to it (especially true with discrete switch
chips).

If we did use eth0 as a controlling interface, we need to somehow be
able to overload (in an objected oriented fashioned) the netdev_ops,
ethtool_ops and switchdev_ops for that interface so as to make it
participate in the switch configuration (we actually do this already for
ethtool statistics, but this is ugly).
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 1/1] ipv6: sr: add option to control lwtunnel support
From: Alexei Starovoitov @ 2016-11-23  0:16 UTC (permalink / raw)
  To: David Miller
  Cc: david.lebrun, netdev@vger.kernel.org, Lorenzo Colitti,
	Roopa Prabhu, Eric Dumazet

On Wed, Nov 16, 2016 at 8:32 AM, David Miller <davem@davemloft.net> wrote:
> From: David Lebrun <david.lebrun@uclouvain.be>
> Date: Tue, 15 Nov 2016 16:14:04 +0100
>
>> This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable
>> support of encapsulation with the lightweight tunnels. When this option
>> is enabled, CONFIG_LWTUNNEL is automatically selected.
>>
>> Fix commit 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
>>
>> Without a proper option to control lwtunnel support for SR-IPv6, if
>> CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence
>> of seg6_iptunnel_init() failure with EOPNOTSUPP:
>>
>> NET: Registered protocol family 10
>> IPv6: Attempt to unregister permanent protocol 6
>> IPv6: Attempt to unregister permanent protocol 136
>> IPv6: Attempt to unregister permanent protocol 17
>> NET: Unregistered protocol family 10
>>
>> Tested (compiling, booting, and loading ipv6 module when relevant)
>> with possible combinations of CONFIG_IPV6={y,m,n},
>> CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}.
>>
>> Reported-by: Lorenzo Colitti <lorenzo@google.com>
>> Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
>
> Applied.

ipv6 seems to be still broken in the latest net-next
when CONFIG_LWTUNNEL is not set:
# ping 127.0.0.1
ping: socket: Address family not supported by protocol
# ping -4 127.0.0.1
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.067 ms

it works with CONFIG_LWTUNNEL=y

Roopa, David, please take a look.

Thanks!

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Andrew Lunn @ 2016-11-23  0:06 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma, netdev,
	Dennis Dalessandro
In-Reply-To: <20161122194918.GA69241@knc-06.sc.intel.com>

On Tue, Nov 22, 2016 at 11:49:18AM -0800, Vishwanathapura, Niranjana wrote:
> Ok, I do understand Jason's point that we should probably not put
> this driver under drivers/infiniband/sw/.., as this driver is not a
> HCA.
> It is an ULP similar to ipoib, built on top of Omni-path
> irrespective of whether we register a hfi_vnic_bus or a direct
> custom interface with HFI1.
> This ULP will transmit and recieve Omni-path packets over the
> fabric, and is dependent on IB MAD interface and the HFI1 driver.
> 
> Doug,
> Will it be acceptable if we put it under 'drivers/infiniband/ulp/hfi_vnic'?

How about turning this whole discussion around. 

This is a network driver. So ask the network Maintainers where he
wants it. Send the patch to David Miller <davem@davemloft.net> and
netdev with the question, where does this code belong?

	Andrew

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: ira.weiny @ 2016-11-23  0:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Vishwanathapura, Niranjana, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Dennis Dalessandro
In-Reply-To: <20161122170407.GE3956-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, Nov 22, 2016 at 10:04:07AM -0700, Jason Gunthorpe wrote:
> On Mon, Nov 21, 2016 at 05:53:04PM -0800, Vishwanathapura, Niranjana wrote:
> > There are many example drivers in kernel which are using bus_register() in
> > an initcall.
> 
> There really are not, certainly not in major subsystems.

I see 2 drivers in the Block subsystem which do this:


19   5354  /nfs/site/home/iweiny/linux-stable/drivers/block/cciss.c <<cciss_init>>
	err = bus_register(&cciss_bus_type);
20   6447 /nfs/site/home/iweiny/linux-stable/drivers/block/rbd.c <<rbd_sysfs_init>>
	ret = bus_register(&rbd_bus_type);

2 drivers in the drm subsystem which do this:


29   1155  /nfs/site/home/iweiny/linux-stable/drivers/gpu/drm/drm_mipi_dsi.c <<mipi_dsi_bus_init>>
	return bus_register(&mipi_dsi_bus_type);
30    242 /nfs/site/home/iweiny/linux-stable/drivers/gpu/host1x/dev.c <<tegra_host1x_init>>
	err = bus_register(&host1x_bus_type);

And I think there are a couple others.

I'm not sure what these devices/buses do but they are registering their own bus
while being in another major subsystem.  Is what we are doing really so
crazy/wrong?


>
> > We could add a custom Interface between HFI1 driver and hfi_vnic drivers
> > without involving a bus.
> 
> hfi is already registering on the infiniband class, just use that.
> 

I don't understand what you mean here?

The bus_register provides a really clean way for the hfi1 driver and hfi_vnic
driver to find each other.  This includes being able to support hfi1 with or
without hfi_vnic being loaded.  Note that without configuration from the "EM"
Ethernet Manager the hfi_vnic does not export a net device.

Why wouldn't we use this core kernel support?[*]

> > But using the existing bus model gave a lot of in-built flexibility in
> > decoupling devices from the drivers.
> 
> If you want to have your own bus then you need your own hfi
> subsystem. drivers/infiniband is not a dumping ground..
> 

We don't consider drivers/infiniband a "dumping ground".  There is a
requirement on ib_mad from the hfi_vnic driver.

Ira

[*] As an aside why does the ib_core not use this methodology?  It dawned on
me that this may be a better way to fix our module load problems.  However, I
have not looked into details.

> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next] mlx4: reorganize struct mlx4_en_tx_ring
From: Eric Dumazet @ 2016-11-22 23:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan

From: Eric Dumazet <edumazet@google.com>

Goal is to reorganize this critical structure to increase performance.

ndo_start_xmit() should only dirty one cache line, and access as few
cache lines as possible.

Add sp_ (Slow Path) prefix to fields that are not used in fast path,
to make clear what is going on.

After this patch pahole reports something much better, as all
ndo_start_xmit() needed fields are packed into two cache lines instead
of seven or eight

struct mlx4_en_tx_ring {
	u32                        last_nr_txbb;         /*     0   0x4 */
	u32                        cons;                 /*   0x4   0x4 */
	long unsigned int          wake_queue;           /*   0x8   0x8 */
	struct netdev_queue *      tx_queue;             /*  0x10   0x8 */
	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /*  0x18   0x8 */
	struct mlx4_en_rx_ring *   recycle_ring;         /*  0x20   0x8 */

	/* XXX 24 bytes hole, try to pack */

	/* --- cacheline 1 boundary (64 bytes) --- */
	u32                        prod;                 /*  0x40   0x4 */
	unsigned int               tx_dropped;           /*  0x44   0x4 */
	long unsigned int          bytes;                /*  0x48   0x8 */
	long unsigned int          packets;              /*  0x50   0x8 */
	long unsigned int          tx_csum;              /*  0x58   0x8 */
	long unsigned int          tso_packets;          /*  0x60   0x8 */
	long unsigned int          xmit_more;            /*  0x68   0x8 */
	struct mlx4_bf             bf;                   /*  0x70  0x18 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	__be32                     doorbell_qpn;         /*  0x88   0x4 */
	__be32                     mr_key;               /*  0x8c   0x4 */
	u32                        size;                 /*  0x90   0x4 */
	u32                        size_mask;            /*  0x94   0x4 */
	u32                        full_size;            /*  0x98   0x4 */
	u32                        buf_size;             /*  0x9c   0x4 */
	void *                     buf;                  /*  0xa0   0x8 */
	struct mlx4_en_tx_info *   tx_info;              /*  0xa8   0x8 */
	int                        qpn;                  /*  0xb0   0x4 */
	u8                         queue_index;          /*  0xb4   0x1 */
	bool                       bf_enabled;           /*  0xb5   0x1 */
	bool                       bf_alloced;           /*  0xb6   0x1 */
	u8                         hwtstamp_tx_type;     /*  0xb7   0x1 */
	u8 *                       bounce_buf;           /*  0xb8   0x8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	long unsigned int          queue_stopped;        /*  0xc0   0x8 */
	struct mlx4_hwq_resources  sp_wqres;             /*  0xc8  0x58 */
	/* --- cacheline 4 boundary (256 bytes) was 32 bytes ago --- */
	struct mlx4_qp             sp_qp;                /* 0x120  0x30 */
	/* --- cacheline 5 boundary (320 bytes) was 16 bytes ago --- */
	struct mlx4_qp_context     sp_context;           /* 0x150  0xf8 */
	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
	cpumask_t                  sp_affinity_mask;     /* 0x248  0x20 */
	enum mlx4_qp_state         sp_qp_state;          /* 0x268   0x4 */
	u16                        sp_stride;            /* 0x26c   0x2 */
	u16                        sp_cqn;               /* 0x26e   0x2 */

	/* size: 640, cachelines: 10, members: 36 */
	/* sum members: 600, holes: 1, sum holes: 24 */
	/* padding: 16 */
};

Instead of this silly placement :

struct mlx4_en_tx_ring {
	u32                        last_nr_txbb;         /*     0   0x4 */
	u32                        cons;                 /*   0x4   0x4 */
	long unsigned int          wake_queue;           /*   0x8   0x8 */

	/* XXX 48 bytes hole, try to pack */

	/* --- cacheline 1 boundary (64 bytes) --- */
	u32                        prod;                 /*  0x40   0x4 */

	/* XXX 4 bytes hole, try to pack */

	long unsigned int          bytes;                /*  0x48   0x8 */
	long unsigned int          packets;              /*  0x50   0x8 */
	long unsigned int          tx_csum;              /*  0x58   0x8 */
	long unsigned int          tso_packets;          /*  0x60   0x8 */
	long unsigned int          xmit_more;            /*  0x68   0x8 */
	unsigned int               tx_dropped;           /*  0x70   0x4 */

	/* XXX 4 bytes hole, try to pack */

	struct mlx4_bf             bf;                   /*  0x78  0x18 */
	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
	long unsigned int          queue_stopped;        /*  0x90   0x8 */
	cpumask_t                  affinity_mask;        /*  0x98  0x10 */
	struct mlx4_qp             qp;                   /*  0xa8  0x30 */
	/* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
	struct mlx4_hwq_resources  wqres;                /*  0xd8  0x58 */
	/* --- cacheline 4 boundary (256 bytes) was 48 bytes ago --- */
	u32                        size;                 /* 0x130   0x4 */
	u32                        size_mask;            /* 0x134   0x4 */
	u16                        stride;               /* 0x138   0x2 */

	/* XXX 2 bytes hole, try to pack */

	u32                        full_size;            /* 0x13c   0x4 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	u16                        cqn;                  /* 0x140   0x2 */

	/* XXX 2 bytes hole, try to pack */

	u32                        buf_size;             /* 0x144   0x4 */
	__be32                     doorbell_qpn;         /* 0x148   0x4 */
	__be32                     mr_key;               /* 0x14c   0x4 */
	void *                     buf;                  /* 0x150   0x8 */
	struct mlx4_en_tx_info *   tx_info;              /* 0x158   0x8 */
	struct mlx4_en_rx_ring *   recycle_ring;         /* 0x160   0x8 */
	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /* 0x168   0x8 */
	u8 *                       bounce_buf;           /* 0x170   0x8 */
	struct mlx4_qp_context     context;              /* 0x178  0xf8 */
	/* --- cacheline 9 boundary (576 bytes) was 48 bytes ago --- */
	int                        qpn;                  /* 0x270   0x4 */
	enum mlx4_qp_state         qp_state;             /* 0x274   0x4 */
	u8                         queue_index;          /* 0x278   0x1 */
	bool                       bf_enabled;           /* 0x279   0x1 */
	bool                       bf_alloced;           /* 0x27a   0x1 */

	/* XXX 5 bytes hole, try to pack */

	/* --- cacheline 10 boundary (640 bytes) --- */
	struct netdev_queue *      tx_queue;             /* 0x280   0x8 */
	int                        hwtstamp_tx_type;     /* 0x288   0x4 */

	/* size: 704, cachelines: 11, members: 36 */
	/* sum members: 587, holes: 6, sum holes: 65 */
	/* padding: 52 */
};


Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |    2 
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   48 +++++++--------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   42 +++++++------
 3 files changed, 48 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 9a807e93c9fdd81e61e561208aa1480a244d0bdb..9018bb1b2e12142e048281a9d28ddf95e0023a61 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1305,7 +1305,7 @@ static void mlx4_en_tx_timeout(struct net_device *dev)
 		if (!netif_tx_queue_stopped(netdev_get_tx_queue(dev, i)))
 			continue;
 		en_warn(priv, "TX timeout on queue: %d, QP: 0x%x, CQ: 0x%x, Cons: 0x%x, Prod: 0x%x\n",
-			i, tx_ring->qpn, tx_ring->cqn,
+			i, tx_ring->qpn, tx_ring->sp_cqn,
 			tx_ring->cons, tx_ring->prod);
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 5de3cbe24f2bf467f9a8f7d499e131b6d2a1844c..4b597dca5c52d114344d638895275ed0d378bd96 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -66,7 +66,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 
 	ring->size = size;
 	ring->size_mask = size - 1;
-	ring->stride = stride;
+	ring->sp_stride = stride;
 	ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
@@ -90,22 +90,22 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 			goto err_info;
 		}
 	}
-	ring->buf_size = ALIGN(size * ring->stride, MLX4_EN_PAGE_SIZE);
+	ring->buf_size = ALIGN(size * ring->sp_stride, MLX4_EN_PAGE_SIZE);
 
 	/* Allocate HW buffers on provided NUMA node */
 	set_dev_node(&mdev->dev->persist->pdev->dev, node);
-	err = mlx4_alloc_hwq_res(mdev->dev, &ring->wqres, ring->buf_size);
+	err = mlx4_alloc_hwq_res(mdev->dev, &ring->sp_wqres, ring->buf_size);
 	set_dev_node(&mdev->dev->persist->pdev->dev, mdev->dev->numa_node);
 	if (err) {
 		en_err(priv, "Failed allocating hwq resources\n");
 		goto err_bounce;
 	}
 
-	ring->buf = ring->wqres.buf.direct.buf;
+	ring->buf = ring->sp_wqres.buf.direct.buf;
 
 	en_dbg(DRV, priv, "Allocated TX ring (addr:%p) - buf:%p size:%d buf_size:%d dma:%llx\n",
 	       ring, ring->buf, ring->size, ring->buf_size,
-	       (unsigned long long) ring->wqres.buf.direct.map);
+	       (unsigned long long) ring->sp_wqres.buf.direct.map);
 
 	err = mlx4_qp_reserve_range(mdev->dev, 1, 1, &ring->qpn,
 				    MLX4_RESERVE_ETH_BF_QP);
@@ -114,12 +114,12 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 		goto err_hwq_res;
 	}
 
-	err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp, GFP_KERNEL);
+	err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->sp_qp, GFP_KERNEL);
 	if (err) {
 		en_err(priv, "Failed allocating qp %d\n", ring->qpn);
 		goto err_reserve;
 	}
-	ring->qp.event = mlx4_en_sqp_event;
+	ring->sp_qp.event = mlx4_en_sqp_event;
 
 	err = mlx4_bf_alloc(mdev->dev, &ring->bf, node);
 	if (err) {
@@ -141,7 +141,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	if (queue_index < priv->num_tx_rings_p_up)
 		cpumask_set_cpu(cpumask_local_spread(queue_index,
 						     priv->mdev->dev->numa_node),
-				&ring->affinity_mask);
+				&ring->sp_affinity_mask);
 
 	*pring = ring;
 	return 0;
@@ -149,7 +149,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 err_reserve:
 	mlx4_qp_release_range(mdev->dev, ring->qpn, 1);
 err_hwq_res:
-	mlx4_free_hwq_res(mdev->dev, &ring->wqres, ring->buf_size);
+	mlx4_free_hwq_res(mdev->dev, &ring->sp_wqres, ring->buf_size);
 err_bounce:
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
@@ -171,10 +171,10 @@ void mlx4_en_destroy_tx_ring(struct mlx4_en_priv *priv,
 
 	if (ring->bf_alloced)
 		mlx4_bf_free(mdev->dev, &ring->bf);
-	mlx4_qp_remove(mdev->dev, &ring->qp);
-	mlx4_qp_free(mdev->dev, &ring->qp);
+	mlx4_qp_remove(mdev->dev, &ring->sp_qp);
+	mlx4_qp_free(mdev->dev, &ring->sp_qp);
 	mlx4_qp_release_range(priv->mdev->dev, ring->qpn, 1);
-	mlx4_free_hwq_res(mdev->dev, &ring->wqres, ring->buf_size);
+	mlx4_free_hwq_res(mdev->dev, &ring->sp_wqres, ring->buf_size);
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
 	kvfree(ring->tx_info);
@@ -190,7 +190,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	struct mlx4_en_dev *mdev = priv->mdev;
 	int err;
 
-	ring->cqn = cq;
+	ring->sp_cqn = cq;
 	ring->prod = 0;
 	ring->cons = 0xffffffff;
 	ring->last_nr_txbb = 1;
@@ -198,21 +198,21 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	memset(ring->buf, 0, ring->buf_size);
 	ring->free_tx_desc = mlx4_en_free_tx_desc;
 
-	ring->qp_state = MLX4_QP_STATE_RST;
-	ring->doorbell_qpn = cpu_to_be32(ring->qp.qpn << 8);
+	ring->sp_qp_state = MLX4_QP_STATE_RST;
+	ring->doorbell_qpn = cpu_to_be32(ring->sp_qp.qpn << 8);
 	ring->mr_key = cpu_to_be32(mdev->mr.key);
 
-	mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn,
-				ring->cqn, user_prio, &ring->context);
+	mlx4_en_fill_qp_context(priv, ring->size, ring->sp_stride, 1, 0, ring->qpn,
+				ring->sp_cqn, user_prio, &ring->sp_context);
 	if (ring->bf_alloced)
-		ring->context.usr_page =
+		ring->sp_context.usr_page =
 			cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev,
 							 ring->bf.uar->index));
 
-	err = mlx4_qp_to_ready(mdev->dev, &ring->wqres.mtt, &ring->context,
-			       &ring->qp, &ring->qp_state);
-	if (!cpumask_empty(&ring->affinity_mask))
-		netif_set_xps_queue(priv->dev, &ring->affinity_mask,
+	err = mlx4_qp_to_ready(mdev->dev, &ring->sp_wqres.mtt, &ring->sp_context,
+			       &ring->sp_qp, &ring->sp_qp_state);
+	if (!cpumask_empty(&ring->sp_affinity_mask))
+		netif_set_xps_queue(priv->dev, &ring->sp_affinity_mask,
 				    ring->queue_index);
 
 	return err;
@@ -223,8 +223,8 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
 {
 	struct mlx4_en_dev *mdev = priv->mdev;
 
-	mlx4_qp_modify(mdev->dev, NULL, ring->qp_state,
-		       MLX4_QP_STATE_RST, NULL, 0, 0, &ring->qp);
+	mlx4_qp_modify(mdev->dev, NULL, ring->sp_qp_state,
+		       MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
 }
 
 static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index eff21651b67308a17fe3d60d236cd0b6800a3fd2..574bcbb1b38fc4758511d8f7bd17a87b0a507a73 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -281,46 +281,50 @@ struct mlx4_en_tx_ring {
 	u32			last_nr_txbb;
 	u32			cons;
 	unsigned long		wake_queue;
+	struct netdev_queue	*tx_queue;
+	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
+						struct mlx4_en_tx_ring *ring,
+						int index, u8 owner,
+						u64 timestamp, int napi_mode);
+	struct mlx4_en_rx_ring	*recycle_ring;
 
 	/* cache line used and dirtied in mlx4_en_xmit() */
 	u32			prod ____cacheline_aligned_in_smp;
+	unsigned int		tx_dropped;
 	unsigned long		bytes;
 	unsigned long		packets;
 	unsigned long		tx_csum;
 	unsigned long		tso_packets;
 	unsigned long		xmit_more;
-	unsigned int		tx_dropped;
 	struct mlx4_bf		bf;
-	unsigned long		queue_stopped;
 
 	/* Following part should be mostly read */
-	cpumask_t		affinity_mask;
-	struct mlx4_qp		qp;
-	struct mlx4_hwq_resources wqres;
+	__be32			doorbell_qpn;
+	__be32			mr_key;
 	u32			size; /* number of TXBBs */
 	u32			size_mask;
-	u16			stride;
 	u32			full_size;
-	u16			cqn;	/* index of port CQ associated with this ring */
 	u32			buf_size;
-	__be32			doorbell_qpn;
-	__be32			mr_key;
 	void			*buf;
 	struct mlx4_en_tx_info	*tx_info;
-	struct mlx4_en_rx_ring	*recycle_ring;
-	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
-						struct mlx4_en_tx_ring *ring,
-						int index, u8 owner,
-						u64 timestamp, int napi_mode);
-	u8			*bounce_buf;
-	struct mlx4_qp_context	context;
 	int			qpn;
-	enum mlx4_qp_state	qp_state;
 	u8			queue_index;
 	bool			bf_enabled;
 	bool			bf_alloced;
-	struct netdev_queue	*tx_queue;
-	int			hwtstamp_tx_type;
+	u8			hwtstamp_tx_type;
+	u8			*bounce_buf;
+
+	/* Not used in fast path
+	 * Only queue_stopped might be used if BQL is not properly working.
+	 */
+	unsigned long		queue_stopped;
+	struct mlx4_hwq_resources sp_wqres;
+	struct mlx4_qp		sp_qp;
+	struct mlx4_qp_context	sp_context;
+	cpumask_t		sp_affinity_mask;
+	enum mlx4_qp_state	sp_qp_state;
+	u16			sp_stride;
+	u16			sp_cqn;	/* index of port CQ associated with this ring */
 } ____cacheline_aligned_in_smp;
 
 struct mlx4_en_rx_desc {

^ permalink raw reply related

* [PATCH ethtool] ethtool: Fix the "advertise" parameter logic.
From: Michael Chan @ 2016-11-22 23:55 UTC (permalink / raw)
  To: linville; +Cc: netdev

From: Michael Chan <mchan@broadcom.com>

The current code ignores the value of the advertise parameter.  For example,

ethtool -s ethx advertise 0x1000

The full_advertising_wanted parameter of 0x1000 is not passed to the kernel.
The reason is that advertising_wanted is NULL in this case, and ethtool
will think that the user has given no advertisement input and so it will
proceed to pass all supported advertisement speeds to the kernel.

The older legacy ethtool with similar logic worked because
advertising_wanted was an integer and could take on -1 and 0.  It would pass
the full_advertising_wanted value if advertising_wanted == -1.

This fix is to pass all supported advertisement speeds only when both
advertising_wanted == NULL && full_advertising_wanted == NULL.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 ethtool.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/ethtool.c b/ethtool.c
index 49ac94e..7715823 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -2971,7 +2971,8 @@ static int do_sset(struct cmd_context *ctx)
 				fprintf(stderr,	"\n");
 			}
 			if (autoneg_wanted == AUTONEG_ENABLE &&
-			    advertising_wanted == NULL) {
+			    advertising_wanted == NULL &&
+			    full_advertising_wanted == NULL) {
 				unsigned int i;

 				/* Auto negotiation enabled, but with
-- 
1.8.4.5

^ permalink raw reply related

* Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes
From: Casey Leedom @ 2016-11-22 23:41 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <DM5PR12MB1786073763933EAE09E36978C8B40@DM5PR12MB1786.namprd12.prod.outlook.com>

  And by the way, we currently have two ethtool APIs which pump in an Auto-Negotiation indication -- set_link_ksettings() and set_pauseparam().  Now we're talking about adding a third, set_fecparam().  Are all of the calls to these three APIs supposed to agree on the concept of Auto-Negotiations?  I.e. what's it mean if set_link_ksettings() gets called with link_ksettings->base.autoneg == AUTONEG_ENABLE but set_pauseparam() gets called with epause->autoneg == AUTONEG_DISABLE?  And now adding set_fecparam() into the system with a similar ability to specify the state of Auto-Negotiation is even more confusing.

Casey

^ permalink raw reply

* [PATCH RFC v1] ethtool: implement helper to get flow_type value
From: Jacob Keller @ 2016-11-22 23:44 UTC (permalink / raw)
  To: netdev, Intel Wired LAN; +Cc: Jacob Keller

Often a driver wants to store the flow type and thus it must mask the
extra fields. This is a task that could grow more complex as more flags
are added in the future. Add a helper function that masks the flags for
marking additional fields.

Modify drivers in drivers/net/ethernet that currently check for FLOW_EXT
and FLOW_MAC_EXT to use the helper. Currently this is only the mellanox
drivers.

I chose not to modify other drivers as I'm actually unsure whether we
should always mask the flow type even for drivers which don't recognize
the newer flags. On the one hand, today's drivers (generally)
automatically fail when a new flag is used because they won't mask it
and their checks against flow_type will not match. On the other hand, it
means another place that you have to update when you begin implementing
a flag.

An alternative is to have the driver store a set of flags that it knows
about, and then have ethtool core do the check for us to discard frames.
I haven't implemented this quite yet.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
I plan on using this helper when fixing the mask code for ntuple filters
in the Intel i40e driver. I wanted to see whether this approach was
acceptable, and whether we should implement additional checks. The
primary reason is that today's drivers are "fail closed" in that a new
flag type will probably fail on drivers due to checking for flow types
they recognize. Since drivers only remove the masked bits they recognize
this works. However, this gets cumbersome if new additional flags get
added in the future. I would like some sort of helper, but if we
encourage its use, and a new flag gets added, the helper will then
unforunately make the driver "fail open" in that a new flag will get
ignored as the driver won't know to return -EINVAL.

I think the right solution will be to add some sort of checks in core
ethtool which we can basically set the recognized flags in some way for
all drivers such that the ethtool core can drop requests for flows with
unknown flag types. I'm unsure how to implement this though.

Thoughts?

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c         |  4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c |  6 +++---
 include/uapi/linux/ethtool.h                            | 11 ++++++++---
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 487a58f9c192..d8f9839ce2a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1270,7 +1270,7 @@ static int mlx4_en_validate_flow(struct net_device *dev,
 			return -EINVAL;
 	}
 
-	switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(cmd->fs.flow_type)) {
 	case TCP_V4_FLOW:
 	case UDP_V4_FLOW:
 		if (cmd->fs.m_u.tcp_ip4_spec.tos)
@@ -1493,7 +1493,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	if (err)
 		return err;
 
-	switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(cmd->fs.flow_type)) {
 	case ETHER_FLOW:
 		spec_l2 = kzalloc(sizeof(*spec_l2), GFP_KERNEL);
 		if (!spec_l2)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
index 3691451c728c..066e6c5cf38b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
@@ -63,7 +63,7 @@ static struct mlx5e_ethtool_table *get_flow_table(struct mlx5e_priv *priv,
 	int table_size;
 	int prio;
 
-	switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(fs->flow_type)) {
 	case TCP_V4_FLOW:
 	case UDP_V4_FLOW:
 		max_tuples = ETHTOOL_NUM_L3_L4_FTS;
@@ -147,7 +147,7 @@ static int set_flow_attrs(u32 *match_c, u32 *match_v,
 					     outer_headers);
 	void *outer_headers_v = MLX5_ADDR_OF(fte_match_param, match_v,
 					     outer_headers);
-	u32 flow_type = fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT);
+	u32 flow_type = ethtool_get_flow_spec_type(fs->flow_type);
 	struct ethtool_tcpip4_spec *l4_mask;
 	struct ethtool_tcpip4_spec *l4_val;
 	struct ethtool_usrip4_spec *l3_mask;
@@ -393,7 +393,7 @@ static int validate_flow(struct mlx5e_priv *priv,
 	    fs->ring_cookie != RX_CLS_FLOW_DISC)
 		return -EINVAL;
 
-	switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+	switch (ethtool_get_flow_spec_type(fs->flow_type)) {
 	case ETHER_FLOW:
 		eth_mask = &fs->m_u.ether_spec;
 		if (!is_zero_ether_addr(eth_mask->h_dest))
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index f0db7788f887..e6d7d2aea56c 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -880,6 +880,14 @@ struct ethtool_rx_flow_spec {
 	__u32		location;
 };
 
+/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
+#define	FLOW_EXT	0x80000000
+#define	FLOW_MAC_EXT	0x40000000
+static inline __u32 ethtool_get_flow_spec_type(__u32 flow_type)
+{
+	return flow_type & (FLOW_EXT | FLOW_MAC_EXT);
+}
+
 /* How rings are layed out when accessing virtual functions or
  * offloaded queues is device specific. To allow users to do flow
  * steering and specify these queues the ring cookie is partitioned
@@ -1579,9 +1587,6 @@ static inline int ethtool_validate_duplex(__u8 duplex)
 #define	IPV4_FLOW	0x10	/* hash only */
 #define	IPV6_FLOW	0x11	/* hash only */
 #define	ETHER_FLOW	0x12	/* spec only (ether_spec) */
-/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
-#define	FLOW_EXT	0x80000000
-#define	FLOW_MAC_EXT	0x40000000
 
 /* L3-L4 network traffic flow hash options */
 #define	RXH_L2DA	(1 << 1)
-- 
2.11.0.rc2.152.g4d04e67

^ permalink raw reply related

* Re: [PATCH net-next] net/sched: cls_flower: verify root pointer before dereferncing it
From: John Fastabend @ 2016-11-22 23:36 UTC (permalink / raw)
  To: Daniel Borkmann, Cong Wang, Jiri Pirko
  Cc: Roi Dayan, David S. Miller, Linux Kernel Network Developers,
	Jiri Pirko, Or Gerlitz, Cong Wang
In-Reply-To: <5834AD97.1020600@iogearbox.net>

On 16-11-22 12:41 PM, Daniel Borkmann wrote:
> On 11/22/2016 08:28 PM, Cong Wang wrote:
>> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>>>> Hmm, I don't think we want to have such an additional test in fast
>>>> path for each and every classifier. Can we think of ways to avoid that?
>>>>
>>>> My question is, since we unlink individual instances from such
>>>> tp-internal
>>>> lists through RCU and release the instance through call_rcu() as
>>>> well as
>>>> the head (tp->root) via kfree_rcu() eventually, against what are we
>>>> protecting
>>>> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback?
>>>> Something
>>>> not respecting grace period?
>>>
>>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
>>> to null.
> 
> But that's not really an answer to my question. ;)
> 
>> We do need to respect the grace period if we touch the globally visible
>> data structure tp in tcf_destroy(). Therefore Roi's patch is not
>> fixing the
>> right place.
> 
> I think there may be multiple issues actually.
> 
> At the time we go into tc_classify(), from ingress as well as egress side,
> we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
> everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
> we use kfree_rcu() on tp, although we iterate tps (and implicitly inner
> filters)
> via rcu_dereference_bh() from reader side. Is there a reason why we don't
> use call_rcu_bh() variant on destruction for all this instead?

I can't think of any if its all under _bh we can convert the call_rcu to
call_rcu_bh it just needs an audit.

> 
> Just looking at cls_bpf and others, what protects
> RCU_INIT_POINTER(tp->root,
> NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
> tcf_destroy() cases. Still active readers under RCU BH can race against
> this
> (tp->root being NULL), as the commit identified. Only the get() callback
> checks
> for head against NULL, but both are serialized under rtnl, and the only
> place
> we call this is tc_ctl_tfilter(). Even if we create a new tp, head
> should not
> be NULL there, if it was assigned during the init() cb, but contains an
> empty
> list. (It's different for things like cls_cgroup, though.) So, I'm
> wondering
> if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead
> (unless I'm
> missing something obvious)?


Just took a look at this I think there are a couple possible solutions.
The easiest is likely to fix all the call sites so that 'tp' is unlinked
before calling the destroy() handlers AND not doing the NULL set. I only
see one such call site where destroy is called before unlinking at the
moment. This should enforce that after a grace period there is no path
to reach the classifiers because 'tp' is unlinked. Calling destroy
before unlinking 'tp' however could cause a small race between grace
period of 'tp' and grace period of the filter.

Another would be to only call the destroy path from the call_rcu path
of the 'tp' object so that destroy is only ever called after the object
is guaranteed to be unlinked from the tc_filter path.

I think both solutions would be fine.

Cong were you working on one of these? Or do you have another idea?


> 
>> Also I don't know why you blame my commit, this problem should already
>> exist prior to my commit, probably date back to John's RCU patches.
> 
> It seems so.

^ permalink raw reply

* Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!
From: Linus Torvalds @ 2016-11-22 23:30 UTC (permalink / raw)
  To: Eric Dumazet, Andre Noll
  Cc: kernel test robot, David Miller, LKP, LKML, Alexei Starovoitov,
	Willem de Bruijn, Alexander Duyck, Network Development
In-Reply-To: <CANn89i+B8F8thJRCdNPHA=BddTTxHCHz+xkvT_wuZ+b8osoTVQ@mail.gmail.com>

On Tue, Nov 22, 2016 at 2:28 PM, Eric Dumazet <edumazet@google.com> wrote:
>
> This is fixed by :
> https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=c9b8af1330198ae241cd545e1f040019010d44d9

Thanks guys. This was one of the less esoteric-looking regressions, so
I'm happy to hear it's solved.

             Linus

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Christoph Lameter @ 2016-11-22 23:04 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161122194918.GA69241-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

On Tue, 22 Nov 2016, Vishwanathapura, Niranjana wrote:

> Ok, I do understand Jason's point that we should probably not put this driver
> under drivers/infiniband/sw/.., as this driver is not a HCA.
> It is an ULP similar to ipoib, built on top of Omni-path irrespective of
> whether we register a hfi_vnic_bus or a direct custom interface with HFI1.
> This ULP will transmit and recieve Omni-path packets over the fabric, and is
> dependent on IB MAD interface and the HFI1 driver.

This is something that encapsulates IP (v4 right?) in something else.
Would belong into

	linux/net/ipv4

You already have similar implementations there

See f.e. ipip.c, ip_tunnel.c and lots more (try
	ls linux/net/ipv4/*tunnel*

)

If this is more like a device then it would belong into

linux/drivers/net/hfi or so (see also linux/drivers/net/ppp, plip,
loopback, etc etc)



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!
From: Andre Noll @ 2016-11-22 22:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, David Miller, Eric Dumazet, LKP, LKML,
	Alexei Starovoitov, Willem de Bruijn, Alexander Duyck,
	Network Development
In-Reply-To: <CA+55aFxV7Bq583QOdYauuo2jY9EkAmgnceBukrN27ArjzFszYg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 771 bytes --]

On Tue, Nov 22, 14:04, Linus Torvalds wrote
>  what's the situation on this issue? The bisection looks a bit odd,
> but the commit in question does end up changing the key_control->thoff
> value for the failure case, so maybe that in turn ends up screwing up
> a later skb_pull.
> 
> I'm not seeing anything that might fix this in the last networking
> pull, but I may have missed something.

I think that's the bug Eric has fixed today. See thread

	[PATCH net] flow_dissect: call init_default_flow_dissectors() earlier

David has queued up the fix and will send it your way shortly.

Andre
-- 
Max Planck Institute for Developmental Biology
Spemannstraße 35, 72076 Tübingen, Germany. Phone: (+49) 7071 601 829
http://people.tuebingen.mpg.de/maan/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!
From: Eric Dumazet @ 2016-11-22 22:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, David Miller, LKP, LKML, Alexei Starovoitov,
	Willem de Bruijn, Alexander Duyck, Network Development
In-Reply-To: <CA+55aFxV7Bq583QOdYauuo2jY9EkAmgnceBukrN27ArjzFszYg@mail.gmail.com>

On Tue, Nov 22, 2016 at 2:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> David, Eric,
>
>  what's the situation on this issue? The bisection looks a bit odd,
> but the commit in question does end up changing the key_control->thoff
> value for the failure case, so maybe that in turn ends up screwing up
> a later skb_pull.
>
> I'm not seeing anything that might fix this in the last networking
> pull, but I may have missed something.
>
> I also noticed that the kernel test robot had screwed up the
> participants list for some reason, and had
>
>   "Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>, David S.
> Miller" <davem@davemloft.net>
>
> as one of the participants. So there's some odd commit parsing issue
> there somewhere. But Alexander seems to have seen this report despite
> that, it just never went anywhere that I can tell.
>
>                 Linus
>

This is fixed by :
https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=c9b8af1330198ae241cd545e1f040019010d44d9

Thanks

^ permalink raw reply

* Re: [PATCH] net: dsa: mv88e6xxx: egress all frames
From: Vivien Didelot @ 2016-11-22 22:15 UTC (permalink / raw)
  To: Andrew Lunn, Stefan Eichenberger; +Cc: Stefan Eichenberger, f.fainelli, netdev
In-Reply-To: <20161122190206.GE14947@lunn.ch>

Hi Andrew, Stefan,

Andrew Lunn <andrew@lunn.ch> writes:

> What you might find useful is
>
> https://github.com/vivien/linux.git 161b96bd7d16d21b0f046c935b70c3b2d277ccc2
>
> although it might need some changes for recent commits.
>
> With that, you can see deeper into the switches registers.

FYI, I have rebased it on top of the latest net-next (f9aa9dc7d2d0):

    https://github.com/vivien/linux.git dsa/dev

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH 2/2] net: qcom/emac: add support for the Qualcomm Technologies QDF2400
From: Timur Tabi @ 2016-11-22 22:14 UTC (permalink / raw)
  To: David Miller, alokc, netdev
In-Reply-To: <1479769102-27909-3-git-send-email-timur@codeaurora.org>

On 11/21/2016 04:58 PM, Timur Tabi wrote:
> The QDF2432 and the QDF2400 have slightly different internal PHYs,
> so there are some programming differences.  Some of the registers in
> the QDF2400 have moved, and some registers require different values
> during initialization.
>
> Because of the differences, the internal PHY on the QDF2400 has a new
> ACPI HID, QCOM8072.
>
> Signed-off-by: Timur Tabi<timur@codeaurora.org>

There seems to be some disagreement internally as to whether a new HID 
is the right approach.  Please hold off on applying patch [2/2] for now.

Patch [1/2] can be applied, however, if it passes review.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration
From: Jiri Pirko @ 2016-11-22 22:08 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Ido Schimmel, Florian Fainelli, netdev, davem, bridge, stephen,
	vivien.didelot, jiri, idosch
In-Reply-To: <20161122174829.GD14947@lunn.ch>

Tue, Nov 22, 2016 at 06:48:29PM CET, andrew@lunn.ch wrote:
>Hi Ido
> 
>> First of all, I want to be sure that when we say "CPU port", we're
>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>> the device and the host, through which all packets trapped to the host
>> go through. So, when a packet is trapped, the driver reads its Rx
>> descriptor, checks through which port it ingressed, resolves its netdev,
>> sets skb->dev accordingly and injects it to the Rx path via
>> netif_receive_skb(). The CPU port itself isn't represented using a
>> netdev.
>
>With DSA, we have a real physical ethernet network interface for the
>'cpu' port. It connects to one of the ports of the switch. Frames on

Every port should be visible as a netdevice, including cpu port.
Would it make sence to have representors for those?

>this interface have an extra header, indicating which switch port it
>came from, and we do a similar resolving it to a slave netdev, strip
>of the header and injecting it into the receiver path via
>netif_receive_skb().
>
>	Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox