Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v2 2/2] enic: add vmname port-profile handler
From: Govindarajulu Varadarajan @ 2014-10-02 22:41 UTC (permalink / raw)
  To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan
In-Reply-To: <1412289683-8278-1-git-send-email-_govind@gmx.com>

This patch adds support to read vmname from port-profile netlink message and
send it to firmware.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
---
 drivers/net/ethernet/cisco/enic/enic.h      | 2 ++
 drivers/net/ethernet/cisco/enic/enic_main.c | 6 ++++++
 drivers/net/ethernet/cisco/enic/enic_pp.c   | 5 +++++
 3 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 5ba5ad0..0b63227 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -88,6 +88,7 @@ struct enic_rx_coal {
 #define ENIC_SET_NAME			(1 << 2)
 #define ENIC_SET_INSTANCE		(1 << 3)
 #define ENIC_SET_HOST			(1 << 4)
+#define ENIC_SET_VMNAME			(1 << 5)
 
 struct enic_port_profile {
 	u32 set;
@@ -97,6 +98,7 @@ struct enic_port_profile {
 	u8 host_uuid[PORT_UUID_MAX];
 	u8 vf_mac[ETH_ALEN];
 	u8 mac_addr[ETH_ALEN];
+	char vmname[PORT_PROFILE_MAX];
 };
 
 /* enic_rfs_fltr_node - rfs filter node in hash table
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 929bfe7..6c7cd1b 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -831,6 +831,12 @@ static int enic_set_vf_port(struct net_device *netdev, int vf,
 			nla_data(port[IFLA_PORT_HOST_UUID]), PORT_UUID_MAX);
 	}
 
+	if (port[IFLA_PORT_VMNAME]) {
+		pp->set |= ENIC_SET_VMNAME;
+		memcpy(pp->vmname,
+		       nla_data(port[IFLA_PORT_VMNAME]), PORT_PROFILE_MAX);
+	}
+
 	if (vf == PORT_SELF_VF) {
 		/* Special case handling: mac came from IFLA_VF_MAC */
 		if (!is_zero_ether_addr(prev_pp.vf_mac))
diff --git a/drivers/net/ethernet/cisco/enic/enic_pp.c b/drivers/net/ethernet/cisco/enic/enic_pp.c
index e6a8319..c2b620f 100644
--- a/drivers/net/ethernet/cisco/enic/enic_pp.c
+++ b/drivers/net/ethernet/cisco/enic/enic_pp.c
@@ -128,6 +128,11 @@ static int enic_set_port_profile(struct enic *enic, int vf)
 			sizeof(uuid_str), uuid_str);
 	}
 
+	if (pp->set & ENIC_SET_VMNAME) {
+		VIC_PROVINFO_ADD_TLV(vp, VIC_GENERIC_PROV_TLV_CLIENT_NAME_STR,
+				     sizeof(pp->vmname), pp->vmname);
+	}
+
 	VIC_PROVINFO_ADD_TLV(vp,
 		VIC_GENERIC_PROV_TLV_OS_TYPE,
 		sizeof(os_type), &os_type);
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v2 1/2] if_link: add client name to port profile
From: Govindarajulu Varadarajan @ 2014-10-02 22:41 UTC (permalink / raw)
  To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan
In-Reply-To: <1412289683-8278-1-git-send-email-_govind@gmx.com>

This patch adds client name to port profile.

This is used by netlink client to send the client name in port profile.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
---
 include/uapi/linux/if_link.h | 1 +
 net/core/rtnetlink.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 0bdb77e..6ae0b0b 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -512,6 +512,7 @@ enum {
 	IFLA_PORT_HOST_UUID,		/* binary UUID */
 	IFLA_PORT_REQUEST,		/* __u8 */
 	IFLA_PORT_RESPONSE,		/* __u16, output only */
+	IFLA_PORT_VMNAME,		/* vm-name used by port profile */
 	__IFLA_PORT_MAX,
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a688268..116d647 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1238,6 +1238,8 @@ static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = {
 				    .len = PORT_UUID_MAX },
 	[IFLA_PORT_REQUEST]	= { .type = NLA_U8, },
 	[IFLA_PORT_RESPONSE]	= { .type = NLA_U16, },
+	[IFLA_PORT_VMNAME]	= { .type = NLA_STRING,
+				    .len = PORT_PROFILE_MAX },
 };
 
 static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v2 0/2] enic: add support to send port-profile's client name to firmware
From: Govindarajulu Varadarajan @ 2014-10-02 22:41 UTC (permalink / raw)
  To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan

Firmware has support for sending client name of the port profile to the switch
it's connected to.

This series adds client name to port profile which is sent to hardware while
associating a port profile to VF.

Since port profile are defined in switch, this patch makes it easier to check
what VM is using a port profile.

v2:
Split single patch to multiple patches. First patch adds client name to netlink
facility. Second patch adds driver support.

Govindarajulu Varadarajan (2):
  if_link: add client name to port profile
  enic: add vmname port-profile handler

 drivers/net/ethernet/cisco/enic/enic.h      | 2 ++
 drivers/net/ethernet/cisco/enic/enic_main.c | 6 ++++++
 drivers/net/ethernet/cisco/enic/enic_pp.c   | 5 +++++
 include/uapi/linux/if_link.h                | 1 +
 net/core/rtnetlink.c                        | 2 ++
 5 files changed, 16 insertions(+)

-- 
2.1.0

^ permalink raw reply

* Re: macvlan: optimizing the receive path?
From: Stephen Hemminger @ 2014-10-02 21:31 UTC (permalink / raw)
  To: Jason Baron; +Cc: netdev, kaber
In-Reply-To: <542DB55D.3090601@akamai.com>

On Thu, 02 Oct 2014 16:28:13 -0400
Jason Baron <jbaron@akamai.com> wrote:

> Hi,
> 
> I was just wondering why the netif_rx(skb) call in macvlan_handle_frame()
> was necessary? IE:


It is to prevent too deep a call stack of
  netif_receive_skb
     macvlan_receive
        netif_receive_skb ...

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Stephen Hemminger @ 2014-10-02 21:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Dobriyan, Nicolas Dichtel, netdev, linux-kernel, davem,
	akpm, rui.xiang, viro, oleg, gorcunov, kirill.shutemov,
	grant.likely, tytso, Thierry Herbelot
In-Reply-To: <87h9zmji3q.fsf@x220.int.ebiederm.org>

On Thu, 02 Oct 2014 14:07:37 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Alexey Dobriyan <adobriyan@gmail.com> writes:
> 
> > On Thu, Oct 02, 2014 at 11:01:50AM -0700, Eric W. Biederman wrote:
> >> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
> >> 
> >> > From: Thierry Herbelot <thierry.herbelot@6wind.com>
> >> >
> >> > The current implementation for the directories in /proc is using a single
> >> > linked list. This is slow when handling directories with large numbers of
> >> > entries (eg netdevice-related entries when lots of tunnels are opened).
> >> >
> >> > This patch enables multiple linked lists. A hash based on the entry name is
> >> > used to select the linked list for one given entry.
> >> >
> >> > The speed creation of netdevices is faster as shorter linked lists must be
> >> > scanned when adding a new netdevice.
> >> 
> >> Is the directory of primary concern /proc/net/dev/snmp6 ?
> >> 
> >> Unless I have configured my networking stack weird by mistake that
> >> is the only directory under /proc/net that grows when we add an
> >> interface.
> >> 
> >> I just want to make certain I am seeing the same things that you are
> >> seeing.
> >> 
> >> I feel silly for overlooking this directory when the rest of the
> >> scalability work was done.
> >
> > Slowdown comes from "duplicate name" check:
> >
> >         for (tmp = dir->subdir; tmp; tmp = tmp->next)
> >                 if (strcmp(tmp->name, dp->name) == 0) {
> >                         WARN(1, "proc_dir_entry '%s/%s' already registered\n",
> >                                 dir->name, dp->name);
> >                         break;
> >                 }
> >
> > Removal can be made O(1) after switching to doubly-linked list.
> 
> Yes.  There is the however unfortunate fact that proc directories exist
> to be used.  If we don't switch to a better data structure than a linked
> list the actual use will then opening of the files under
> /proc/net/dev/snmp6/ will become O(N^2).  Which doesn't help much
> (assuming those files are good for something).
> 
> If those files aren't actually useful we should just make registering
> them a config option.  Deprecate them strongly and let only people who
> need extreme backwards compatibility enable them.

Net-snmp uses them (agent/mibgroup/mibII/kernel_linux.c)

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Eric W. Biederman @ 2014-10-02 21:07 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Nicolas Dichtel, netdev, linux-kernel, davem, akpm, rui.xiang,
	viro, oleg, gorcunov, kirill.shutemov, grant.likely, tytso,
	Thierry Herbelot
In-Reply-To: <20141002200639.GA3497@p183.telecom.by>

Alexey Dobriyan <adobriyan@gmail.com> writes:

> On Thu, Oct 02, 2014 at 11:01:50AM -0700, Eric W. Biederman wrote:
>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>> 
>> > From: Thierry Herbelot <thierry.herbelot@6wind.com>
>> >
>> > The current implementation for the directories in /proc is using a single
>> > linked list. This is slow when handling directories with large numbers of
>> > entries (eg netdevice-related entries when lots of tunnels are opened).
>> >
>> > This patch enables multiple linked lists. A hash based on the entry name is
>> > used to select the linked list for one given entry.
>> >
>> > The speed creation of netdevices is faster as shorter linked lists must be
>> > scanned when adding a new netdevice.
>> 
>> Is the directory of primary concern /proc/net/dev/snmp6 ?
>> 
>> Unless I have configured my networking stack weird by mistake that
>> is the only directory under /proc/net that grows when we add an
>> interface.
>> 
>> I just want to make certain I am seeing the same things that you are
>> seeing.
>> 
>> I feel silly for overlooking this directory when the rest of the
>> scalability work was done.
>
> Slowdown comes from "duplicate name" check:
>
>         for (tmp = dir->subdir; tmp; tmp = tmp->next)
>                 if (strcmp(tmp->name, dp->name) == 0) {
>                         WARN(1, "proc_dir_entry '%s/%s' already registered\n",
>                                 dir->name, dp->name);
>                         break;
>                 }
>
> Removal can be made O(1) after switching to doubly-linked list.

Yes.  There is the however unfortunate fact that proc directories exist
to be used.  If we don't switch to a better data structure than a linked
list the actual use will then opening of the files under
/proc/net/dev/snmp6/ will become O(N^2).  Which doesn't help much
(assuming those files are good for something).

If those files aren't actually useful we should just make registering
them a config option.  Deprecate them strongly and let only people who
need extreme backwards compatibility enable them.

Alexey do you know that those files aren't useful?  Unless we know
otherwise we should make those files useful.

Eric

^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: David Miller @ 2014-10-02 20:43 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: raghuram.kothakota, netdev
In-Reply-To: <20141002201203.GA6001@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Thu, 2 Oct 2014 16:12:03 -0400

> The patch is attached to the end of this email.

There are an explosion of simplifications and optimizations
possilbe once you've done a NAPI conversion, which I haven't
seen you perform here in this patch.

For example, you can now move everything into software IRQ context,
just disable the VIO interrupt and unconditionally go into NAPI
context from the VIO event.

No more irqsave/irqrestore.

Then the TX path even can run mostly lockless, it just needs
to hold the VIO lock for a minute period of time.  The caller
holds the xmit_lock of the network device to prevent re-entry
into the ->ndo_start_xmit() path.

Really, what you've done here as a NAPI conversion is just the
beginning.

^ permalink raw reply

* macvlan: optimizing the receive path?
From: Jason Baron @ 2014-10-02 20:28 UTC (permalink / raw)
  To: netdev; +Cc: kaber

Hi,

I was just wondering why the netif_rx(skb) call in macvlan_handle_frame()
was necessary? IE:

macvlan_handle_frame()
{
       ........

        skb->dev = dev;
        skb->pkt_type = PACKET_HOST;

****>ret = netif_rx(skb);

out:
        macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, 0);
        return RX_HANDLER_CONSUMED;
}


I think the point of going through netif_rx() is to ensure that we throttle
incoming packets, but hasn't that already been accomplished in this path?
That is if the packets are arriving from the physical NIC, we've already
throttled them by this point. Otherwise, if they are coming via
macvlan_queue_xmit(), it calls either 'dev_forward_skb()', which ends
up calling netif_rx_internal(), or else in broadcast mode there is
to be throttling via macvlan_broadcast_enqueue().

So I suspect there is a code path that I am missing but the netif_rx() call in
question essentially re-queues packets coming from off the box. I've tried the
simple patch below to optimize this path, and obviously performs a lot better
in my limited testing.

Thanks,

-Jason

--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -321,8 +321,8 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
        skb->dev = dev;
        skb->pkt_type = PACKET_HOST;
 
-       ret = netif_rx(skb);
-
+      macvlan_count_rx(vlan, len, true, 0);
+      return RX_HANDLER_ANOTHER;
 out:
        macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, 0);
        return RX_HANDLER_CONSUMED;

^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: Sowmini Varadhan @ 2014-10-02 20:12 UTC (permalink / raw)
  To: David Miller; +Cc: raghuram.kothakota, netdev
In-Reply-To: <20141001.162529.2246298941833907545.davem@davemloft.net>

On (10/01/14 16:25), David Miller wrote:
> 
> The limit is by default 64 packets, it won't matter.
> 
> I think you're overplaying the things that block use of NAPI, how
> about implementing it properly, using netif_gso_receive() and proper
> RCU accesses, and coming back with some real performance measurements?

I hope I did not give the impression that I've cut some corners and
did not do adequate testing to get here, because that is not the case.

I dont know what s/netif_skb_receive/napi_gro_receive has to do with it -  
but I resurrected my napi prototype, caught up with Jumbo MTU patc,
and replaced netif_receive_skb with napi_gro_receive.

The patch is attached to the end of this email. "Real performance
measurements" are below.

Afaict, the patch is quite "proper" - I was following
   http://www.linuxfoundation.org/collaborate/workgroups/networking/napi -
and the patch even goes to a lot of trouble to avoid sending needless
ldc messages arising from some napi-imposed budget. Here's the perf
data. Remember that packet size is 1500 bytes, so, e.g., 2 Gbps is approx
167k pps. Also the baseline perf today (without napi) is 1 - 1.3 Gbps.

     budget      iperf throughput
      64           336 Mbps
     128           556 Mbps
     512           398 Mbps

If I over-budget to 2048, and force my vnet_poll() to lie by returning
`budget', I can get an iperf throughput of approx 2.2 - 2.3 Gbps
for 1500 byte packets i.e., 167k pps. Yes, I violate NAPI rules in doing
this, and from reading the code, this forces me to a non-polling, 
pure-softirq mode. But this is also the best number I can get.

And as for mpstat output it comes out wit 100% of the softirqs on
2 cpus- something like this:

CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
all    0.00    0.00    0.57    0.06    0.00   12.67    0.00    0.00   86.70
0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
8    0.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   99.00
9    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
10    0.00    0.00    1.98    0.00    0.00    0.00    0.00    0.00   98.02
11    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
12    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
13    0.00    0.00    3.00    0.00    0.00    0.00    0.00    0.00   97.00
14    0.00    0.00    2.56    0.00    0.00    0.00    0.00    0.00   97.44
15    0.00    0.00    1.00    1.00    0.00    0.00    0.00    0.00   98.00

Whereas with the workq, the load was spread nicely across multiple cpus.
I can share "Real performance data" for that as well, if you are curious.

Some differences between sunvnet-ldc and the typical network driver
that might be causing the perf drop here:

- The biggest benefit of NAPI is that it permits the reading of multiple
  packets in the context of a single interrupt, but the ldc/sunvnet infra
  already does that anyway. So the extra polling offered by NAPI does
  not have a significant benefit for my test- I can just as easily
  achieve load-spreading and fare-share in a non-interrupt context with
  a workq/kthread?

- But the VDC driver is also different from the typical driver in the
  "STOPPED" message- usually drivers only get signalled when the producer
  publishes data, the consumer does not send any signal back to producer,
  though the VDC driver does the latter. I've had to add more state-tracking
  code to get around this. 
 
Note that I am *not* attempting to fix the vnet race condition here-
that one is a pre-existing condition that I caught by merely reading
the code (I can easily look the other way), and my patch does not make
it worse.  Let's discuss that one later.  

NAPI Patch follows. Please tell me what's improper about it.

---------------------------------------------------------------------

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 1539672..da05d68 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -33,6 +33,9 @@
 #define DRV_MODULE_VERSION	"1.0"
 #define DRV_MODULE_RELDATE	"June 25, 2007"
 
+/* #define	VNET_BUDGET	2048 */	 /* see comments in vnet_poll() */
+#define	VNET_BUDGET	64	 /*  typical budget */
+
 static char version[] =
 	DRV_MODULE_NAME ".c:v" DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")\n";
 MODULE_AUTHOR("David S. Miller (davem@davemloft.net)");
@@ -274,6 +277,7 @@ static struct sk_buff *alloc_and_align_skb(struct net_device *dev,
 	return skb;
 }
 
+/* reads in exactly one sk_buff */
 static int vnet_rx_one(struct vnet_port *port, unsigned int len,
 		       struct ldc_trans_cookie *cookies, int ncookies)
 {
@@ -311,9 +315,7 @@ static int vnet_rx_one(struct vnet_port *port, unsigned int len,
 
 	dev->stats.rx_packets++;
 	dev->stats.rx_bytes += len;
-
-	netif_rx(skb);
-
+	napi_gro_receive(&port->napi, skb);
 	return 0;
 
 out_free_skb:
@@ -444,6 +446,7 @@ static int vnet_walk_rx_one(struct vnet_port *port,
 	       desc->cookies[0].cookie_addr,
 	       desc->cookies[0].cookie_size);
 
+	/* read in one packet */
 	err = vnet_rx_one(port, desc->size, desc->cookies, desc->ncookies);
 	if (err == -ECONNRESET)
 		return err;
@@ -456,10 +459,11 @@ static int vnet_walk_rx_one(struct vnet_port *port,
 }
 
 static int vnet_walk_rx(struct vnet_port *port, struct vio_dring_state *dr,
-			u32 start, u32 end)
+			u32 start, u32 end, int *npkts, int budget)
 {
 	struct vio_driver_state *vio = &port->vio;
 	int ack_start = -1, ack_end = -1;
+	int send_ack = 1;
 
 	end = (end == (u32) -1) ? prev_idx(start, dr) : next_idx(end, dr);
 
@@ -471,6 +475,7 @@ static int vnet_walk_rx(struct vnet_port *port, struct vio_dring_state *dr,
 			return err;
 		if (err != 0)
 			break;
+		(*npkts)++;
 		if (ack_start == -1)
 			ack_start = start;
 		ack_end = start;
@@ -482,13 +487,28 @@ static int vnet_walk_rx(struct vnet_port *port, struct vio_dring_state *dr,
 				return err;
 			ack_start = -1;
 		}
+		if ((*npkts) >= budget ) {
+			send_ack = 0;
+			break;
+		}
 	}
 	if (unlikely(ack_start == -1))
 		ack_start = ack_end = prev_idx(start, dr);
-	return vnet_send_ack(port, dr, ack_start, ack_end, VIO_DRING_STOPPED);
+	if (send_ack) {
+		int ret;
+		port->napi_resume = false;
+		ret = vnet_send_ack(port, dr, ack_start, ack_end,
+				     VIO_DRING_STOPPED);
+		return ret;
+	} else  {
+		port->napi_resume = true;
+		port->napi_stop_idx = ack_end;
+		return (56);
+	}
 }
 
-static int vnet_rx(struct vnet_port *port, void *msgbuf)
+static int vnet_rx(struct vnet_port *port, void *msgbuf, int *npkts,
+		   int budget)
 {
 	struct vio_dring_data *pkt = msgbuf;
 	struct vio_dring_state *dr = &port->vio.drings[VIO_DRIVER_RX_RING];
@@ -505,11 +525,13 @@ static int vnet_rx(struct vnet_port *port, void *msgbuf)
 		return 0;
 	}
 
-	dr->rcv_nxt++;
+	if (!port->napi_resume)
+		dr->rcv_nxt++;
 
 	/* XXX Validate pkt->start_idx and pkt->end_idx XXX */
 
-	return vnet_walk_rx(port, dr, pkt->start_idx, pkt->end_idx);
+	return vnet_walk_rx(port, dr, pkt->start_idx, pkt->end_idx,
+			    npkts, budget);
 }
 
 static int idx_is_pending(struct vio_dring_state *dr, u32 end)
@@ -534,7 +556,10 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
 	struct net_device *dev;
 	struct vnet *vp;
 	u32 end;
+	unsigned long flags;
 	struct vio_net_desc *desc;
+	bool need_trigger = false;
+
 	if (unlikely(pkt->tag.stype_env != VIO_DRING_DATA))
 		return 0;
 
@@ -545,21 +570,17 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
 	/* sync for race conditions with vnet_start_xmit() and tell xmit it
 	 * is time to send a trigger.
 	 */
+	spin_lock_irqsave(&port->vio.lock, flags);
 	dr->cons = next_idx(end, dr);
 	desc = vio_dring_entry(dr, dr->cons);
-	if (desc->hdr.state == VIO_DESC_READY && port->start_cons) {
-		/* vnet_start_xmit() just populated this dring but missed
-		 * sending the "start" LDC message to the consumer.
-		 * Send a "start" trigger on its behalf.
-		 */
-		if (__vnet_tx_trigger(port, dr->cons) > 0)
-			port->start_cons = false;
-		else
-			port->start_cons = true;
-	} else {
-		port->start_cons = true;
-	}
+	if (desc->hdr.state == VIO_DESC_READY && !port->start_cons)
+		need_trigger = true;
+	else
+		port->start_cons = true; /* vnet_start_xmit will send trigger */
+	spin_unlock_irqrestore(&port->vio.lock, flags);
 
+	if (need_trigger && __vnet_tx_trigger(port, dr->cons) <= 0)
+		port->start_cons = true;
 
 	vp = port->vp;
 	dev = vp->dev;
@@ -617,33 +638,12 @@ static void maybe_tx_wakeup(unsigned long param)
 	netif_tx_unlock(dev);
 }
 
-static void vnet_event(void *arg, int event)
+static int vnet_event_napi(struct vnet_port *port, int budget)
 {
-	struct vnet_port *port = arg;
 	struct vio_driver_state *vio = &port->vio;
-	unsigned long flags;
 	int tx_wakeup, err;
 
-	spin_lock_irqsave(&vio->lock, flags);
-
-	if (unlikely(event == LDC_EVENT_RESET ||
-		     event == LDC_EVENT_UP)) {
-		vio_link_state_change(vio, event);
-		spin_unlock_irqrestore(&vio->lock, flags);
-
-		if (event == LDC_EVENT_RESET) {
-			port->rmtu = 0;
-			vio_port_up(vio);
-		}
-		return;
-	}
-
-	if (unlikely(event != LDC_EVENT_DATA_READY)) {
-		pr_warn("Unexpected LDC event %d\n", event);
-		spin_unlock_irqrestore(&vio->lock, flags);
-		return;
-	}
-
+	int npkts = 0;
 	tx_wakeup = err = 0;
 	while (1) {
 		union {
@@ -651,6 +651,20 @@ static void vnet_event(void *arg, int event)
 			u64 raw[8];
 		} msgbuf;
 
+		if (port->napi_resume) {
+			struct vio_dring_data *pkt =
+				(struct vio_dring_data *)&msgbuf;
+			struct vio_dring_state *dr =
+				&port->vio.drings[VIO_DRIVER_RX_RING];
+
+			pkt->tag.type = VIO_TYPE_DATA;
+			pkt->tag.stype = VIO_SUBTYPE_INFO;
+			pkt->tag.stype_env = VIO_DRING_DATA;
+			pkt->seq = dr->rcv_nxt;
+			pkt->start_idx = next_idx(port->napi_stop_idx, dr);
+			pkt->end_idx = -1;
+			goto napi_resume;
+		}
 		err = ldc_read(vio->lp, &msgbuf, sizeof(msgbuf));
 		if (unlikely(err < 0)) {
 			if (err == -ECONNRESET)
@@ -667,10 +681,12 @@ static void vnet_event(void *arg, int event)
 		err = vio_validate_sid(vio, &msgbuf.tag);
 		if (err < 0)
 			break;
-
+napi_resume:
 		if (likely(msgbuf.tag.type == VIO_TYPE_DATA)) {
 			if (msgbuf.tag.stype == VIO_SUBTYPE_INFO) {
-				err = vnet_rx(port, &msgbuf);
+				err = vnet_rx(port, &msgbuf, &npkts, budget);
+				if (npkts >= budget || npkts == 0)
+					break;
 			} else if (msgbuf.tag.stype == VIO_SUBTYPE_ACK) {
 				err = vnet_ack(port, &msgbuf);
 				if (err > 0)
@@ -691,15 +707,54 @@ static void vnet_event(void *arg, int event)
 		if (err == -ECONNRESET)
 			break;
 	}
-	spin_unlock(&vio->lock);
-	/* Kick off a tasklet to wake the queue.  We cannot call
-	 * maybe_tx_wakeup directly here because we could deadlock on
-	 * netif_tx_lock() with dev_watchdog()
-	 */
 	if (unlikely(tx_wakeup && err != -ECONNRESET))
 		tasklet_schedule(&port->vp->vnet_tx_wakeup);
+	return npkts;
+}
 
+static int vnet_poll(struct napi_struct *napi, int budget)
+{
+	struct vnet_port *port = container_of(napi, struct vnet_port, napi);
+	struct vio_driver_state *vio = &port->vio;
+	int processed = vnet_event_napi(port, budget);
+
+	if (processed < budget) {
+		napi_complete(napi);
+		napi_reschedule(napi);
+		vio_set_intr(vio->vdev->rx_ino, HV_INTR_ENABLED);
+	}
+	/* return budget; */ /* liar! but better throughput! */
+	return processed;
+}
+
+static void vnet_event(void *arg, int event)
+{
+	struct vnet_port *port = arg;
+	struct vio_driver_state *vio = &port->vio;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vio->lock, flags);
+
+	if (unlikely(event == LDC_EVENT_RESET ||
+		     event == LDC_EVENT_UP)) {
+		vio_link_state_change(vio, event);
+		spin_unlock_irqrestore(&vio->lock, flags);
+
+		if (event == LDC_EVENT_RESET)
+			vio_port_up(vio);
+		return;
+	}
+
+	if (unlikely(event != LDC_EVENT_DATA_READY)) {
+		pr_warning("Unexpected LDC event %d\n", event);
+		spin_unlock_irqrestore(&vio->lock, flags);
+		return;
+	}
+
+	spin_unlock(&vio->lock);
 	local_irq_restore(flags);
+	vio_set_intr(vio->vdev->rx_ino, HV_INTR_DISABLED);
+	napi_schedule(&port->napi);
 }
 
 static int __vnet_tx_trigger(struct vnet_port *port, u32 start)
@@ -1342,6 +1397,22 @@ err_out:
 	return err;
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void vnet_poll_controller(struct net_device *dev)
+{
+	struct vnet *vp = netdev_priv(dev);
+	struct vnet_port *port;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vp->lock, flags);
+	if (!list_empty(&vp->port_list)) {
+		port = list_entry(vp->port_list.next, struct vnet_port, list);
+		napi_schedule(&port->napi);
+	}
+	spin_unlock_irqrestore(&vp->lock, flags);
+
+}
+#endif
 static LIST_HEAD(vnet_list);
 static DEFINE_MUTEX(vnet_list_mutex);
 
@@ -1354,6 +1425,9 @@ static const struct net_device_ops vnet_ops = {
 	.ndo_tx_timeout		= vnet_tx_timeout,
 	.ndo_change_mtu		= vnet_change_mtu,
 	.ndo_start_xmit		= vnet_start_xmit,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	.ndo_poll_controller	= vnet_poll_controller,
+#endif
 };
 
 static struct vnet *vnet_new(const u64 *local_mac)
@@ -1536,6 +1610,9 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 	if (err)
 		goto err_out_free_port;
 
+	netif_napi_add(port->vp->dev, &port->napi, vnet_poll, VNET_BUDGET);
+	napi_enable(&port->napi);
+
 	err = vnet_port_alloc_tx_bufs(port);
 	if (err)
 		goto err_out_free_ldc;
@@ -1592,6 +1669,8 @@ static int vnet_port_remove(struct vio_dev *vdev)
 		del_timer_sync(&port->vio.timer);
 		del_timer_sync(&port->clean_timer);
 
+		napi_disable(&port->napi);
+
 		spin_lock_irqsave(&vp->lock, flags);
 		list_del(&port->list);
 		hlist_del(&port->hash);
                hlist_del(&port->hash);
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index c911045..3c745d0 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -56,6 +56,10 @@ struct vnet_port {
        struct timer_list       clean_timer;

        u64                     rmtu;
+
+       struct napi_struct      napi;
+       u32                     napi_stop_idx;
+       bool                    napi_resume;
 };

 static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)

^ permalink raw reply related

* Re: [PATCH net-next] net: better IFF_XMIT_DST_RELEASE support
From: Eric Dumazet @ 2014-10-02 20:06 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David Miller, netdev
In-Reply-To: <alpine.LFD.2.11.1410022231100.3225@ja.home.ssi.bg>

On Thu, 2014-10-02 at 22:34 +0300, Julian Anastasov wrote:
> 	Hello,
> 
> On Thu, 2 Oct 2014, Eric Dumazet wrote:
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Testing xmit_more support with netperf and connected UDP sockets,
> > I found strange dst refcount false sharing.
> > 
> > Current handling of IFF_XMIT_DST_RELEASE is not optimal.
> > 
> > dropping dst in validate_xmit_skb() is certainly too late in case
> > packet was queued by cpu X but dequeued by cpu Y
> > 
> > The logical point to take care of drop/force is in __dev_queue_xmit()
> > before even taking qdisc lock.
> 
> 	Does it hurt skb_dst usage in net/sched/, for example,
> dst->tclassid in net/sched/cls_route.c ?

Hmm good point...

Then, if a qdisc like that is used, I guess we can remove
IFF_XMIT_DST_RELEASE from dev->priv_flags

Maybe I should add a dev->dst_are_needed counter, instead using a flag.

Thanks

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Alexey Dobriyan @ 2014-10-02 20:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Nicolas Dichtel, netdev, linux-kernel, davem, akpm, rui.xiang,
	viro, oleg, gorcunov, kirill.shutemov, grant.likely, tytso,
	Thierry Herbelot
In-Reply-To: <87h9zmpcz5.fsf@x220.int.ebiederm.org>

On Thu, Oct 02, 2014 at 11:01:50AM -0700, Eric W. Biederman wrote:
> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
> 
> > From: Thierry Herbelot <thierry.herbelot@6wind.com>
> >
> > The current implementation for the directories in /proc is using a single
> > linked list. This is slow when handling directories with large numbers of
> > entries (eg netdevice-related entries when lots of tunnels are opened).
> >
> > This patch enables multiple linked lists. A hash based on the entry name is
> > used to select the linked list for one given entry.
> >
> > The speed creation of netdevices is faster as shorter linked lists must be
> > scanned when adding a new netdevice.
> 
> Is the directory of primary concern /proc/net/dev/snmp6 ?
> 
> Unless I have configured my networking stack weird by mistake that
> is the only directory under /proc/net that grows when we add an
> interface.
> 
> I just want to make certain I am seeing the same things that you are
> seeing.
> 
> I feel silly for overlooking this directory when the rest of the
> scalability work was done.

Slowdown comes from "duplicate name" check:

        for (tmp = dir->subdir; tmp; tmp = tmp->next)
                if (strcmp(tmp->name, dp->name) == 0) {
                        WARN(1, "proc_dir_entry '%s/%s' already registered\n",
                                dir->name, dp->name);
                        break;
                }

Removal can be made O(1) after switching to doubly-linked list.

	Alexey

^ permalink raw reply

* Re: [PATCH v1 5/5] driver-core: add driver asynchronous probe support
From: Luis R. Rodriguez @ 2014-10-02 20:06 UTC (permalink / raw)
  To: Tom Gundersen, systemd-devel
  Cc: One Thousand Gnomes, Takashi Iwai, Kay Sievers, pmladek, LKML,
	Michal Hocko, Praveen Krishnamoorthy, hare, Nagalakshmi Nandigama,
	werner, Tetsuo Handa, mpt-fusionlinux.pdl, Tim Gardner,
	Benjamin Poirier, Santosh Rastapur, Casey Leedom, Hariprasad S,
	Pierre Fersing, dbueso, Sreekanth Reddy, Arjan van de Ven,
	Abhijit Mahajan, Linux SCSI List <linu
In-Reply-To: <CAG-2HqVBnXUKSRBrJE=gEKA3St5KMfdgAbx2vRfpF3qw_teLOg@mail.gmail.com>

As per Tom, adding systemd-devel for advice / review / of the request to avoid
the sigkill for kmod workers. Keeping others on Cc as its a discussion that
I think can help if both camps are involved. Specially since we've been
ping ponging back and forth on this particular topic for a long time now.

On Thu, Oct 02, 2014 at 08:12:37AM +0200, Tom Gundersen wrote:
> On Tue, Sep 30, 2014 at 5:24 PM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> >> > commit e64fae5573e566ce4fd9b23c68ac8f3096603314
> >> > Author: Kay Sievers <kay.sievers@vrfy.org>
> >> > Date:   Wed Jan 18 05:06:18 2012 +0100
> >> >
> >> >     udevd: kill hanging event processes after 30 seconds
> >> >
> >> >     Some broken kernel drivers load firmware synchronously in the module init
> >> >     path and block modprobe until the firmware request is fulfilled.
> >> >     <...>
> >>
> >> This was a workaround to avoid a deadlock between udev and the kernel.
> >> The 180 s timeout was already in place before this change, and was not
> >> motivated by firmware loading. Also note that this patch was not about
> >> "tracking device drivers", just about avoiding dead-lock.
> >
> > Thanks, can you elaborate on how a deadlock can occur if the kmod
> > worker is not at some point sigkilled?
> 
> This was only relevant whet udev did the firmware loading. modprobe
> would wait for the kernel, which would wait for the firmware loading,
> which would wait for modprobe. This is no longer a problem as udev
> does not do firmware loading any more.

Thanks for clarifying. So the deadlock concern is no longer there, therefore
it is not a reason to keep the sigkill for kmod.

> > Is the issue that if there is no extra worker available and all are
> > idling on sleep / synchronous long work boot will potentially hang
> > unless a new worker becomes available to do more work?
> 
> Correct.

Ok.

> > If so I can
> > see the sigkill helping for hanging tasks but it doesn't necessarily
> > mean its a good idea to kill modules loading taking a while. Also
> > what if the sigkill is just avoided for *just* kmod workers?
> 
> Depending on the number of devices you have, I suppose we could still
> exhaust the workers.

Ok can systemd dynamically create a worker or set of workers per device
that creeps up? Async probe for most drivers will help with this but
having it dynamic should help as well, specially since there are drivers
that will require probe synchronously -- and the fact that async probe
mechanism is not yet merged.

> >> The way I see it, the current status from systemd's side is: our
> >> short-term work-around is to increase the timeout, and at the moment
> >> it appears no long-term solution is needed (i.e., it seems like the
> >> right thing to do is to make sure insmod can be near instantaneous, it
> >> appears people are working towards this goal, and so far no examples
> >> have cropped up showing that it is fundamentally impossible (once/if
> >> they do, we should of course revisit the problem)).
> >
> > That again would be reactive behaviour, what would prevent avoiding the
> > sigkill only for kmod workers? Is it known the deadlock is immiment?
> > If the amount of workers for kmod that would hit the timeout is
> > considered low I don't see how that's possible and why not just lift
> > the sigkill.
> 
> Making kmod a special case is of course possible. However, as long as
> there is no fundamental reason why kmod should get this special
> treatment, this just looks like a work-around to me. 

I've mentioned a series of five reasons why its a bad idea right now to
sigkill modules [0], we're reviewed them each and still at least
items 2-4 remain particularly valid fundamental reasons to avoid it
specially if the deadlock is no longer possible. Running out of
workers because they are loading modules and that is taking a while
is not really a good standing reason to be killing them, specially
if the timeout already is set to a high value. All we're doing there is
limiting Linux / number of devices arbitrarily just to help free
workers, and it seems that should be dealt with differently. Killing
module loading arbitrarily in the middle is not advisable and can cause
more issue than help in any way.

Async probe mechanism will help free workers faster but this patch series is
still being evolved, we should still address the sigkill for kmod workers
separately and see what remaining reasons we have for it in light of the
possible issues highlighted that it can introduce if kept. If we want to
capture drivers taking long on probe each subsystem should handle that and WARN
/ pick up on it, we cannot however assume that this a generally bad things as
discussed before. We will also not be able to async probe *every* driver,
which is why the series allows a flag to specify for this.

[0] https://lkml.org/lkml/2014/9/26/879

> We already have a
> work-around, which is to increase the global timeout. If you still
> think we should do something different in systemd, it is probably best
> to take the discussion to systemd-devel to make sure all the relevant
> people are involved.

Sure, I've included systemd-devel. Hope is we can have a constructive
discussion on the sigkill for kmod.

  Luis

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/5] netns: allow to identify peer netns
From: Andy Lutomirski @ 2014-10-02 19:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Network Development, Linux Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Stephen Hemminger, Cong Wang, Linux API, Nicolas Dichtel,
	David S. Miller
In-Reply-To: <877g0il0gd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Thu, Oct 2, 2014 at 12:45 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> On Thu, Oct 2, 2014 at 12:20 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>>
>>>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>>>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>>>>
>>>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>>>>>
>>>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>>>> I see two ways to go with this.
>>>>>>>>>
>>>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>>>>     network namespaces.  The table would need to be populated manually by
>>>>>>>>>     the likes of ip netns add.
>>>>>>>>>
>>>>>>>>>     That flips the order of assignment and makes this idea solid.
>>>>>> I have a preference for this solution, because it allows to have a full
>>>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>>>> it saves a lot of time to avoid another request to get all informations.
>>>>>
>>>>> My practical question is how often does it happen that we care?
>>>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>>>> x-netns interfaces. It will be more one "link" netns with the physical
>>>> interface and all other with one interface with the link part in this "link"
>>>> netns. Hence, only one nsid is needing in each netns.
>>>
>>> I will buy that a full mesh is unlikely.
>>>
>>> For people doing simulations anything physical has a limited number of
>>> links.
>>>
>>> For people wanting all to all connectivity setting up an internal
>>> macvlan (or the equivalent) is likely much simpler and more efficient
>>> that a full mesh.
>>>
>>> So the question in my mind is how do we create these identifiers at need
>>> (when we create the cross network namespace links) instead of at network
>>> namespace creation time.  I don't see an answer to that in your patches,
>>> and perhaps it obvious.
>>>
>>
>> I wonder whether part of the problem is that we're thinking about
>> scoping wrong.  What if we made the hierarchy more explicit?
>>
>> For example, we could give each netns an admin-assigned identifier
>> (e.g. a 64-bit number, maybe required to be unique, maybe not)
>> relative to its containing userns.  Then we could come up with a way
>> to identify user namespaces (i.e. inode number relative to containing
>> user ns, if that's well-defined).
>
> If as suggested we only assign ids when a tunnel (or equivalent) is
> created between two network namespaces the space cost is a non-issue.
> The ids become at worst a constant factor addition to the cost of the
> tunnel.
>
> To keep things simple we may want to assign a free id (if one does not
> exist) when we connect a tunnel to a network namespace.
>
>> From user code's perspective, netnses that are in the requester's
>> userns or its descendents are identified by a path through a (possibly
>> zero-length) sequence of userns ids followed by a netns id.  Netnses
>> outside the requester's userns hierarchy cannot be named at all.
>>
>> Would this make sense?
>
> Nope.  What happens if I migrate 2 of the 4 network namespaces in a user
> namespace?  The migration potentially fails.  Application migration does
> not require user namespace migration.

Hmm.  I guess that, as long as those network namespaces aren't
connected to anything else, migrating like that makes sense and ought
to work.  Fair enough.

--Andy
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/5] netns: allow to identify peer netns
From: Eric W. Biederman @ 2014-10-02 19:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Network Development, Linux Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Stephen Hemminger, Cong Wang, Linux API, Nicolas Dichtel,
	David S. Miller
In-Reply-To: <CALCETrWxqzUF1x+TmW5G4kuHPP+sUtiRaT6dpZ0mQTJ217QB5w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Andy Lutomirski <luto@amacapital.net> writes:

> On Thu, Oct 2, 2014 at 12:20 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>
>>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>>>
>>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>>>>
>>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>>> I see two ways to go with this.
>>>>>>>>
>>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>>>     network namespaces.  The table would need to be populated manually by
>>>>>>>>     the likes of ip netns add.
>>>>>>>>
>>>>>>>>     That flips the order of assignment and makes this idea solid.
>>>>> I have a preference for this solution, because it allows to have a full
>>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>>> it saves a lot of time to avoid another request to get all informations.
>>>>
>>>> My practical question is how often does it happen that we care?
>>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>>> x-netns interfaces. It will be more one "link" netns with the physical
>>> interface and all other with one interface with the link part in this "link"
>>> netns. Hence, only one nsid is needing in each netns.
>>
>> I will buy that a full mesh is unlikely.
>>
>> For people doing simulations anything physical has a limited number of
>> links.
>>
>> For people wanting all to all connectivity setting up an internal
>> macvlan (or the equivalent) is likely much simpler and more efficient
>> that a full mesh.
>>
>> So the question in my mind is how do we create these identifiers at need
>> (when we create the cross network namespace links) instead of at network
>> namespace creation time.  I don't see an answer to that in your patches,
>> and perhaps it obvious.
>>
>
> I wonder whether part of the problem is that we're thinking about
> scoping wrong.  What if we made the hierarchy more explicit?
>
> For example, we could give each netns an admin-assigned identifier
> (e.g. a 64-bit number, maybe required to be unique, maybe not)
> relative to its containing userns.  Then we could come up with a way
> to identify user namespaces (i.e. inode number relative to containing
> user ns, if that's well-defined).

If as suggested we only assign ids when a tunnel (or equivalent) is
created between two network namespaces the space cost is a non-issue.
The ids become at worst a constant factor addition to the cost of the
tunnel.

To keep things simple we may want to assign a free id (if one does not
exist) when we connect a tunnel to a network namespace.

> From user code's perspective, netnses that are in the requester's
> userns or its descendents are identified by a path through a (possibly
> zero-length) sequence of userns ids followed by a netns id.  Netnses
> outside the requester's userns hierarchy cannot be named at all.
>
> Would this make sense? 

Nope.  What happens if I migrate 2 of the 4 network namespaces in a user
namespace?  The migration potentially fails.  Application migration does
not require user namespace migration.

Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply

* HELLO BELOVED IN CHRIST PLEASE I NEED YOUR HELP
From: FROM MRS GRACE MANDA @ 2014-10-02 19:31 UTC (permalink / raw)

In-Reply-To: <92064508.199599.1412278051104.JavaMail.yahoo@jws100135.mail.ne1.yahoo.com>

[-- Attachment #1: Type: text/plain, Size: 54 bytes --]



 
KINDLY VIEW  THE ATTACHED FILE AND GET BACK TO ME

[-- Attachment #2: FROM MRS GRACE MANDA.rtf --]
[-- Type: application/msword, Size: 42406 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] net: better IFF_XMIT_DST_RELEASE support
From: Julian Anastasov @ 2014-10-02 19:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1412267647.22242.3.camel@edumazet-glaptop2.roam.corp.google.com>


	Hello,

On Thu, 2 Oct 2014, Eric Dumazet wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> Testing xmit_more support with netperf and connected UDP sockets,
> I found strange dst refcount false sharing.
> 
> Current handling of IFF_XMIT_DST_RELEASE is not optimal.
> 
> dropping dst in validate_xmit_skb() is certainly too late in case
> packet was queued by cpu X but dequeued by cpu Y
> 
> The logical point to take care of drop/force is in __dev_queue_xmit()
> before even taking qdisc lock.

	Does it hurt skb_dst usage in net/sched/, for example,
dst->tclassid in net/sched/cls_route.c ?

> @@ -2895,6 +2887,14 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>  
>  	skb_update_prio(skb);
>  
> +	/* If device doesn't need skb->dst, release it right now while
> +	 * its hot in this cpu cache
> +	 */
> +	if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> +		skb_dst_drop(skb);
> +	else
> +		skb_dst_force(skb);
> +
>  	txq = netdev_pick_tx(dev, skb, accel_priv);
>  	q = rcu_dereference_bh(txq->qdisc);

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [RFC PATCH net-next v3 1/4] netns: add genl cmd to add and get peer netns ids
From: Eric W. Biederman @ 2014-10-02 19:33 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	cwang-xCSkyg8dI+0RB7SZvlqPiA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1412257690-31253-2-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:

> With this patch, a user can define an id for a peer netns by providing a FD or a
> PID. These ids are local to netns (ie valid only into one netns).
>
> This will be useful for netlink messages when a x-netns interface is
> dumped.

You have a "id -> struct net *" table but you don't have a 
"struct net * -> id" table which looks like it will impact the
performance of peernet2id at scale.

Eric

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/5] netns: allow to identify peer netns
From: Andy Lutomirski @ 2014-10-02 19:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Network Development, Linux Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Stephen Hemminger, Cong Wang, Linux API, Nicolas Dichtel,
	David S. Miller
In-Reply-To: <8761g2nurx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Thu, Oct 2, 2014 at 12:20 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>
>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>>
>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>>>
>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>> I see two ways to go with this.
>>>>>>>
>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>>     network namespaces.  The table would need to be populated manually by
>>>>>>>     the likes of ip netns add.
>>>>>>>
>>>>>>>     That flips the order of assignment and makes this idea solid.
>>>> I have a preference for this solution, because it allows to have a full
>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>> it saves a lot of time to avoid another request to get all informations.
>>>
>>> My practical question is how often does it happen that we care?
>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>> x-netns interfaces. It will be more one "link" netns with the physical
>> interface and all other with one interface with the link part in this "link"
>> netns. Hence, only one nsid is needing in each netns.
>
> I will buy that a full mesh is unlikely.
>
> For people doing simulations anything physical has a limited number of
> links.
>
> For people wanting all to all connectivity setting up an internal
> macvlan (or the equivalent) is likely much simpler and more efficient
> that a full mesh.
>
> So the question in my mind is how do we create these identifiers at need
> (when we create the cross network namespace links) instead of at network
> namespace creation time.  I don't see an answer to that in your patches,
> and perhaps it obvious.
>

I wonder whether part of the problem is that we're thinking about
scoping wrong.  What if we made the hierarchy more explicit?

For example, we could give each netns an admin-assigned identifier
(e.g. a 64-bit number, maybe required to be unique, maybe not)
relative to its containing userns.  Then we could come up with a way
to identify user namespaces (i.e. inode number relative to containing
user ns, if that's well-defined).

From user code's perspective, netnses that are in the requester's
userns or its descendents are identified by a path through a (possibly
zero-length) sequence of userns ids followed by a netns id.  Netnses
outside the requester's userns hierarchy cannot be named at all.

Would this make sense?  It should keep the asymptotic complexity of
everything under control and, for users of very large numbers of
network namespaces with complex routing, it doesn't require a
correspondingly large number of fds. It would have the added benefit
of allowing the same scheme to be used for all the other namespace
types, although it could be a bit odd for pid namespaces, which really
do have their own hierarchy.

--Andy
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/5] netns: allow to identify peer netns
From: Eric W. Biederman @ 2014-10-02 19:20 UTC (permalink / raw)
  To: nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w
  Cc: Network Development, Linux Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Stephen Hemminger, Cong Wang, Linux API,
	Andrew Morton, David S. Miller
In-Reply-To: <542D5726.8070308-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:

> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>
>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>>
>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>> <ebiederm@xmission.com> wrote:
>>>>>> I see two ways to go with this.
>>>>>>
>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>     network namespaces.  The table would need to be populated manually by
>>>>>>     the likes of ip netns add.
>>>>>>
>>>>>>     That flips the order of assignment and makes this idea solid.
>>> I have a preference for this solution, because it allows to have a full
>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>> it saves a lot of time to avoid another request to get all informations.
>>
>> My practical question is how often does it happen that we care?
> In fact, I don't think that scenarii with a lot of netns have a full mesh of
> x-netns interfaces. It will be more one "link" netns with the physical
> interface and all other with one interface with the link part in this "link"
> netns. Hence, only one nsid is needing in each netns.

I will buy that a full mesh is unlikely.  

For people doing simulations anything physical has a limited number of
links.

For people wanting all to all connectivity setting up an internal
macvlan (or the equivalent) is likely much simpler and more efficient
that a full mesh.

So the question in my mind is how do we create these identifiers at need
(when we create the cross network namespace links) instead of at network
namespace creation time.  I don't see an answer to that in your patches,
and perhaps it obvious.

Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply

* [PATCH net-next] Removed unused inet6 address state
From: Sébastien Barré @ 2014-10-02 19:15 UTC (permalink / raw)
  To: David Miller
  Cc: Sébastien Barré, netdev, Christoph Paasch, Herbert Xu

the inet6 state INET6_IFADDR_STATE_UP only appeared in its definition.

Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>

---
 include/net/if_inet6.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d07b1a6..55a8d40 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -35,7 +35,6 @@ enum {
 	INET6_IFADDR_STATE_DAD,
 	INET6_IFADDR_STATE_POSTDAD,
 	INET6_IFADDR_STATE_ERRDAD,
-	INET6_IFADDR_STATE_UP,
 	INET6_IFADDR_STATE_DEAD,
 };
 
-- 
tg: (739e4a7..) net-next/remove_unused_inet6_state (depends on: net-next/master)

^ permalink raw reply related

* bug: race in team_{notify_peers,mcast_rejoin} scheduling
From: Joe Lawrence @ 2014-10-02 18:50 UTC (permalink / raw)
  To: netdev, Jiri Pirko

Hello Jiri,

Occasionally on boot I noticed that team_notify_peers_work would get
*very* busy.

With the following debugging added to team_notify_peers:

        netdev_info(team->dev, "%s(%p)\n", __func__, team);
        dump_stack();

I saw the following:

% dmesg | grep -e 'team[0-9]: team_notify_peers' -e 'port_enable' -e 'port_disable'
[   68.340861] team0: team_notify_peers(ffff88104ffa4de0)
[   68.743264]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
[   69.622395] team0: team_notify_peers(ffff88104ffa4de0)
[   69.966758]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
[   71.099263] team0: team_notify_peers(ffff88104ffa4de0)
[   71.466243]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
[   72.383788] team0: team_notify_peers(ffff88104ffa4de0)
[   72.744778]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
[   73.476190] team0: team_notify_peers(ffff88104ffa4de0)
[   73.830592]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
[   74.796738] team1: team_notify_peers(ffff88104f5df080)
[   75.165577]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
[   75.694968] team1: team_notify_peers(ffff88104f5df080)
[   75.694984]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
[   77.316488] team1: team_notify_peers(ffff88104f5df080)
[   77.663122]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
[   78.470488] team1: team_notify_peers(ffff88104f5df080)
[   78.814722]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
[   82.690765] team2: team_notify_peers(ffff88083d24df40)
[   83.083540]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
[   83.942458] team2: team_notify_peers(ffff88083d24df40)
[   84.286446]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
[   86.089955] team3: team_notify_peers(ffff88083fd14de0)
[   86.453495]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
[   87.267773] team3: team_notify_peers(ffff88083fd14de0)
[   87.610203]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]

which shows team_port_enable/disable getting invoked in short
succession.  When looking at one of the team's
notify_peers.count_pending value, I saw that it was negative and slowly
counting down from 0xffff...ffff!

This lead me believe that there is a race condition present in
the .count_pending pattern that both team_notify_peers and
team_mcast_rejoin employ.

Can you comment on the following patch/workaround?

Thanks,

-- Joe

-->8-- -->8-- -->8--

>From b11d7dcd051a2f141c1eec0a43c4a4ddf0361d10 Mon Sep 17 00:00:00 2001
From: Joe Lawrence <joe.lawrence@stratus.com>
Date: Thu, 2 Oct 2014 14:24:26 -0400
Subject: [PATCH] team: avoid race condition in scheduling delayed work

When team_notify_peers and team_mcast_rejoin are called, they both reset
their respective .count_pending atomic variable. Then when the actual
worker function is executed, the variable is atomically decremented.
This pattern introduces a potential race condition where the
.count_pending rolls over and the worker function keeps rescheduling
until .count_pending decrements to zero again:

THREAD 1                           THREAD 2
========                           ========
team_notify_peers(teamX)
  atomic_set count_pending = 1
  schedule_delayed_work
                                   team_notify_peers(teamX)
                                   atomic_set count_pending = 1
team_notify_peers_work
  atomic_dec_and_test
    count_pending = 0
  (return)
                                   schedule_delayed_work
                                   team_notify_peers_work
                                   atomic_dec_and_test
                                     count_pending = -1
                                   schedule_delayed_work
                                   (repeat until count_pending = 0)

Instead of assigning a new value to .count_pending, use atomic_add to
tack-on the additional desired worker function invocations.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/net/team/team.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index d46df38..2b87e3f 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -647,7 +647,7 @@ static void team_notify_peers(struct team *team)
 {
 	if (!team->notify_peers.count || !netif_running(team->dev))
 		return;
-	atomic_set(&team->notify_peers.count_pending, team->notify_peers.count);
+	atomic_add(team->notify_peers.count, &team->notify_peers.count_pending);
 	schedule_delayed_work(&team->notify_peers.dw, 0);
 }
 
@@ -687,7 +687,7 @@ static void team_mcast_rejoin(struct team *team)
 {
 	if (!team->mcast_rejoin.count || !netif_running(team->dev))
 		return;
-	atomic_set(&team->mcast_rejoin.count_pending, team->mcast_rejoin.count);
+	atomic_add(team->mcast_rejoin.count, &team->mcast_rejoin.count_pending);
 	schedule_delayed_work(&team->mcast_rejoin.dw, 0);
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: linux-next: manual merge of the net-next tree with the net tree
From: David Miller @ 2014-10-02 18:27 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, hayeswang
In-Reply-To: <20141002141641.0e43a18e@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 2 Oct 2014 14:16:41 +1000

> Today's linux-next merge of the net-next tree got a conflict in
> drivers/net/usb/r8152.c between commit 204c87041289 ("r8152: remove
> clearing bp") from the net tree and commit 8ddfa07778af ("r8152: use
> usleep_range") from the net-next tree.
> 
> I fixed it up (the former removed some of the code updated by the
> latter) and can carry the fix as necessary (no action is required).

I'm merging net into net-next today and thus resolving this.

Thanks Stephen!

^ permalink raw reply

* Re: [Patch net-next] net_sched: avoid calling tcf_unbind_filter() in call_rcu callback
From: John Fastabend @ 2014-10-02 18:04 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, netdev, john.r.fastabend
In-Reply-To: <20141001.220123.1054083748022563765.davem@davemloft.net>

On 10/01/2014 07:01 PM, David Miller wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Tue, 30 Sep 2014 16:07:24 -0700
>
>> This fixes the following crash:
>   ...
>> tp could be freed in call_rcu callback too, the order is not guaranteed.
>>
>> Cc: John Fastabend <john.r.fastabend@intel.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>
> Applied, and I added John's description of why this is legal to the
> commit message.
> --

Thanks, I'll have another series shortly to fix the
other classifiers with the same issue and pull 'tp'
out of the ematch stuff.

Passing around free'd pointers that never get used
through the ematch code wont cause a crash but there is
no reason to propagate it like this.

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Eric W. Biederman @ 2014-10-02 18:01 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: netdev, linux-kernel, davem, akpm, adobriyan, rui.xiang, viro,
	oleg, gorcunov, kirill.shutemov, grant.likely, tytso,
	Thierry Herbelot
In-Reply-To: <1412263501-6572-3-git-send-email-nicolas.dichtel@6wind.com>

Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:

> From: Thierry Herbelot <thierry.herbelot@6wind.com>
>
> The current implementation for the directories in /proc is using a single
> linked list. This is slow when handling directories with large numbers of
> entries (eg netdevice-related entries when lots of tunnels are opened).
>
> This patch enables multiple linked lists. A hash based on the entry name is
> used to select the linked list for one given entry.
>
> The speed creation of netdevices is faster as shorter linked lists must be
> scanned when adding a new netdevice.

Is the directory of primary concern /proc/net/dev/snmp6 ?

Unless I have configured my networking stack weird by mistake that
is the only directory under /proc/net that grows when we add an
interface.

I just want to make certain I am seeing the same things that you are
seeing.

I feel silly for overlooking this directory when the rest of the
scalability work was done.

> Here are some numbers:
>
> dummy30000.batch contains 30 000 times 'link add type dummy'.
>
> Before the patch:
> time ip -b dummy30000.batch
> real    2m32.221s
> user    0m0.380s
> sys     2m30.610s
>
> After the patch:
> time ip -b dummy30000.batch
> real    1m57.190s
> user    0m0.350s
> sys     1m56.120s
>
> The single 'subdir' list head is replaced by a subdir hash table. The subdir
> hash buckets are only allocated for directories. The number of hash buckets
> is a compile-time parameter.

That looks like a nice speed up.  A couple of things.

With sysfs and sysctl when faced this class of challenge we used an
rbtree instead of a hash table.  That should use less memory and scale
better.

I am concerned about a fixed sized hash table moving the location where
we fall off a cliff but not removing the cliff itself.

I suppose it would be possible to use the new fancy resizable hash
tables but previous work on sysctl and sysfs suggests that we don't look
up these entries sufficiently to require a hash table.  We just need a
data structure that doesn't fall over at scale, and the rbtrees seem to
do that very nicely.

> For all functions which handle directory entries, an additional check on the
> directory nature of the dir entry ensures that pde_hash_buckets was allocated.
> This check was not needed as subdir was present for all dir entries, whether
> actual directories or simple files.

That bit of logic seems reasonable.

Eric

^ permalink raw reply

* <Reply ASAP>
From: Mr. James @ 2014-10-02  8:54 UTC (permalink / raw)
  To: Recipients

Sequel to your non-response to my previous email, I am re-sending this to you again thus; A deceased client of mine who died of a heart-related ailment about 3 years ago left behind some funds which I want you to  assist in retriving and distributing. Reply so I can give you details on what is needed to be done.

Regards,

James.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox