Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Ian Campbell @ 2012-11-16  9:57 UTC (permalink / raw)
  To: Annie Li
  Cc: xen-devel@lists.xensource.com, netdev@vger.kernel.org,
	konrad.wilk@oracle.com
In-Reply-To: <1352962987-541-1-git-send-email-annie.li@oracle.com>

On Thu, 2012-11-15 at 07:03 +0000, Annie Li wrote:
> This patch implements persistent grants for xen-netfront/netback.

Hang on a sec. It has just occurred to me that netfront/netback in the
current mainline kernels don't currently use grant maps at all, they use
grant copy on both the tx and rx paths.

The supposed benefit of persistent grants is to avoid the TLB shootdowns
on grant unmap, but in the current code there should be exactly zero of
those.

If I understand correctly this patch goes from using grant copy
operations to persistently mapping frames and then using memcpy on those
buffers to copy in/out to local buffers. I'm finding it hard to think of
a reason why this should perform any better, do you have a theory which
explains it? (my best theory is that it has a beneficial impact on where
the cache locality of the data, but netperf doesn't typically actually
access the data so I'm not sure why that would matter)

Also AIUI this is also doing persistent grants for both Tx and Rx
directions?

For guest Rx does this mean it now copies twice, in dom0 from the DMA
buffer to the guest provided buffer and then again in the guest from the
granted buffer to a normal one?

For guest Tx how do you handle the lifecycle of the grant mapped pages
which are being sent up into the dom0 network stack? Or are you also now
copying twice in this case? (i.e. guest copies into a granted buffer and
dom0 copies out into a local buffer?)

Did you do measurement of the Tx and Rx cases independently? Do you know
that they both benefit from this change (rather than for example an
improvement in one direction masking a regression in the other). Were
the numbers you previously posted in one particular direction or did you
measure both?

Ian.

^ permalink raw reply

* Re: [PATCH 3.7.0-rc4] of/net/mdio-gpio: Fix pdev->id issue when using devicetrees.
From: Srinivas KANDAGATLA @ 2012-11-16  9:54 UTC (permalink / raw)
  To: David Miller, Grant Likely; +Cc: netdev, devicetree-discuss
In-Reply-To: <20121114.185940.648295821386414260.davem@davemloft.net>

On 14/11/12 23:59, David Miller wrote:
> From: Srinivas KANDAGATLA <srinivas.kandagatla@st.com>
> Date: Tue, 13 Nov 2012 14:26:13 +0000
>
>> From: Srinivas Kandagatla <srinivas.kandagatla@st.com>
>>
>> When the mdio-gpio driver is probed via device trees, the platform
>> device id is set as -1, However the id is re-used in the code while
>> creating an mdio bus.
>> So, setting up the id via aliases from device tree is a sensible
>> solution to fix this issue.
>>
>> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com>
> This seems rather pointless unless you also update every single device
> tree out there.
>
> Also you need to describe what are the ramifications of this problem
> otherwise it is impossible to figure out how serious this change is.
>
> Does it prevent probing?  Does it cause a crash?
I apologies, I should have explained the full use-case.

use-case is if the mac driver want to connect via phy_connect() to
mdio-gpio phy it would use bus name to do so.

mdio-gpio phy bus name is formated as "gpio-<bus-number>:<phy-addr>.
In the existing code the bus number for mdio-gpio if probed from device
trees will be set to -1 which results in bus name set to
"gpio-ffffffff:<phy-addr>" which is the problem here.
fffffff is result of pdev->id set to -1 which should be set to a logical
number, and this is only possible via aliases.

Having fffffff as bus-id also means that we can't have two mdio-gpio
buses via device trees as it will result in same bus-id.
This patch attempts to fix this issue.
So getting the id from alias would be a right choice.

I also agree with Grant's comments about setting up pdev->id.

Will send v2 patch with considering Grant's comments.

>
> Basically, what I'm saying is that this is a very poor submission and
> you need to substantially improve it and communicate better.
>
> If the problem is basically benign, then you should target this change
> to net-next instead of the net tree, along with the necessary dt file
> updates.
I have looked in net-next and there are no dt files which use this driver.

Thanks,
srini
> Thanks.
>
>

^ permalink raw reply

* Re: [Xen-devel] [PATCH 1/4] xen/netback: implements persistent grant with one page pool.
From: ANNIE LI @ 2012-11-16  9:55 UTC (permalink / raw)
  To: Ian Campbell
  Cc: xen-devel@lists.xensource.com, netdev@vger.kernel.org,
	konrad.wilk@oracle.com
In-Reply-To: <1353058074.3499.166.camel@zakaz.uk.xensource.com>



On 2012-11-16 17:27, Ian Campbell wrote:
> On Fri, 2012-11-16 at 02:18 +0000, ANNIE LI wrote:
>> In this patch,
>> The maximum of memory overhead is about
>>
>> (XEN_NETIF_TX_RING_SIZE+XEN_NETIF_RX_RING_SIZE)*PAGE_SIZE  (plus size of grant_ref_t and handle)
>> which is about 512 PAGE_SIZE. Normally, without heavy network offload, this maximum can not be reached.
>>
>> In next patch of splitting tx/rx pool, the maximum is about
> "about" or just "is"?

For only grant pages, it is this value. I took into account other 
element of grant_ref_t and map(change to handle in future)....

>
>>   (256+512)PAGE_SIZE.
> IOW 3MB.
>
>>>> +
>>>> +       return NULL;
>>>> +}
>>>> +
>>>> @@ -1338,7 +1497,11 @@ static unsigned xen_netbk_tx_build_gops(struct xen_netbk *netbk)
>>>>                   gop->source.domid = vif->domid;
>>>>                   gop->source.offset = txreq.offset;
>>>>
>>>> -               gop->dest.u.gmfn = virt_to_mfn(page_address(page));
>>>> +               if (!vif->persistent_grant)
>>>> +                       gop->dest.u.gmfn = virt_to_mfn(page_address(page));
>>>> +               else
>>>> +                       gop->dest.u.gmfn = (unsigned long)page_address(page);
>>> page_address doesn't return any sort of frame number, does it? This is
>>> rather confusing...
>> Yes. I only use dest.u.gmfn element to save the page_address here for
>> future memcpy, and it does not mean to use frame number actually. To
>> avoid confusion, here I can use
>>
>> gop->dest.u.gmfn = virt_to_mfn(page_address(page));
>>
>> and then call mfn_to_virt when doing memcpy.
> It seems a bit odd to be using the gop structure in this way when you
> aren't actually doing a grant op on it.
>
> While investigating I noticed:
> +static int
> +grant_memory_copy_op(unsigned int cmd, void *vuop, unsigned int count,
> +                    struct xen_netbk *netbk, bool tx_pool)
> ...
> +       struct gnttab_copy *uop = vuop;
>
> Why void *vuop? Why not struct gnttab_copy * in the parameter?

Sorry, my mistake.

>
> I also noticed your new grant_memory_copy_op() seems to have unbatched
> the grant ops in the non-persistent case, which is going to suck for
> performance in non-persistent mode. You need to pull the conditional and
> the HYPERVISOR_grant_table_op outside the loop and pass it full array
> instead of doing them one at a time.

This still connects with netback per-VIF implementation.
Currently, these could not be pulled out outside since netback queue may 
contains persistent and nonpersistent in the same queue. I did consider 
to implement per-VIF first and then the persistent grant,
but thinking of it is part of wei's patch combined with other patches, 
and finally decided to implement per-VIF later.

But this does limit implementation of persistent grant.

Thanks
Annie
>
> Ian
>

^ permalink raw reply

* Re: [Xen-devel] [PATCH 1/4] xen/netback: implements persistent grant with one page pool.
From: Ian Campbell @ 2012-11-16  9:32 UTC (permalink / raw)
  To: ANNIE LI
  Cc: Roger Pau Monne, xen-devel@lists.xensource.com,
	netdev@vger.kernel.org, konrad.wilk@oracle.com
In-Reply-To: <50A5A9CF.8030008@oracle.com>

On Fri, 2012-11-16 at 02:49 +0000, ANNIE LI wrote:
> >
> > Take a look at the following functions from blkback; foreach_grant,
> > add_persistent_gnt and get_persistent_gnt. They are generic functions to
> > deal with persistent grants.
> 
> Ok, thanks.
> Or moving those functions into a separate common file?

Please put them somewhere common.

> > This is highly inefficient, one of the points of using gnttab_set_map_op
> > is that you can queue a bunch of grants, and then map them at the same
> > time using gnttab_map_refs, but here you are using it to map a single
> > grant at a time. You should instead see how much grants you need to map
> > to complete the request and map them all at the same time.
> 
> Yes, it is inefficient here. But this is limited by current netback
> implementation. Current netback is not per-VIF based(not like blkback
> does). After combining persistent grant and non persistent grant
> together, every vif request in the queue may/may not support persistent
> grant. I have to judge whether every vif in the queue supports
> persistent grant or not. If it support, memcpy is used, if not,
> grantcopy is used.

You could (and should) still batch all the grant copies into one
hypercall, e.g. walk the list either doing memcpy or queuing up copyops
as appropriate, then at the end if the queue is non-zero length issue
the hypercall.

I'd expect this lack of batching here and in the other case I just
spotted to have a detrimental affect on guests running with this patch
but not using persistent grants. Did you benchmark that case?

> After making netback per-VIF works, this issue can be fixed.

You've mentioned improvements which are conditional on this work a few
times I think, perhaps it makes sense to make that change first?

Ian.

^ permalink raw reply

* Re: [Xen-devel] [PATCH 1/4] xen/netback: implements persistent grant with one page pool.
From: Ian Campbell @ 2012-11-16  9:27 UTC (permalink / raw)
  To: ANNIE LI
  Cc: xen-devel@lists.xensource.com, netdev@vger.kernel.org,
	konrad.wilk@oracle.com
In-Reply-To: <50A5A285.1030805@oracle.com>

On Fri, 2012-11-16 at 02:18 +0000, ANNIE LI wrote:
> In this patch,
> The maximum of memory overhead is about
> 
> (XEN_NETIF_TX_RING_SIZE+XEN_NETIF_RX_RING_SIZE)*PAGE_SIZE  (plus size of grant_ref_t and handle)
> which is about 512 PAGE_SIZE. Normally, without heavy network offload, this maximum can not be reached.
> 
> In next patch of splitting tx/rx pool, the maximum is about

"about" or just "is"?

>  (256+512)PAGE_SIZE.

IOW 3MB.

> >
> >> +
> >> +       return NULL;
> >> +}
> >> +
> >> @@ -1338,7 +1497,11 @@ static unsigned xen_netbk_tx_build_gops(struct xen_netbk *netbk)
> >>                  gop->source.domid = vif->domid;
> >>                  gop->source.offset = txreq.offset;
> >>
> >> -               gop->dest.u.gmfn = virt_to_mfn(page_address(page));
> >> +               if (!vif->persistent_grant)
> >> +                       gop->dest.u.gmfn = virt_to_mfn(page_address(page));
> >> +               else
> >> +                       gop->dest.u.gmfn = (unsigned long)page_address(page);
> > page_address doesn't return any sort of frame number, does it? This is
> > rather confusing...
> 
> Yes. I only use dest.u.gmfn element to save the page_address here for 
> future memcpy, and it does not mean to use frame number actually. To 
> avoid confusion, here I can use
> 
> gop->dest.u.gmfn = virt_to_mfn(page_address(page));
> 
> and then call mfn_to_virt when doing memcpy.

It seems a bit odd to be using the gop structure in this way when you
aren't actually doing a grant op on it. 

While investigating I noticed:
+static int
+grant_memory_copy_op(unsigned int cmd, void *vuop, unsigned int count,
+                    struct xen_netbk *netbk, bool tx_pool)
...
+       struct gnttab_copy *uop = vuop;

Why void *vuop? Why not struct gnttab_copy * in the parameter?

I also noticed your new grant_memory_copy_op() seems to have unbatched
the grant ops in the non-persistent case, which is going to suck for
performance in non-persistent mode. You need to pull the conditional and
the HYPERVISOR_grant_table_op outside the loop and pass it full array
instead of doing them one at a time.

Ian

^ permalink raw reply

* Re: [PATCH 08/14] xen: netback: Remove redundant check on unsigned variable
From: Ian Campbell @ 2012-11-16  9:16 UTC (permalink / raw)
  To: Tushar Behera
  Cc: linux-kernel@vger.kernel.org, patches@linaro.org,
	xen-devel@lists.xensource.com, netdev@vger.kernel.org
In-Reply-To: <1353048646-10935-9-git-send-email-tushar.behera@linaro.org>

On Fri, 2012-11-16 at 06:50 +0000, Tushar Behera wrote:
> No need to check whether unsigned variable is less than 0.
> 
> CC: Ian Campbell <ian.campbell@citrix.com>
> CC: xen-devel@lists.xensource.com
> CC: netdev@vger.kernel.org
> Signed-off-by: Tushar Behera <tushar.behera@linaro.org>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

Thanks.

> ---
>  drivers/net/xen-netback/netback.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index aab8677..515e10c 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -190,14 +190,14 @@ static int get_page_ext(struct page *pg,
>  
>  	group = ext.e.group - 1;
>  
> -	if (group < 0 || group >= xen_netbk_group_nr)
> +	if (group >= xen_netbk_group_nr)
>  		return 0;
>  
>  	netbk = &xen_netbk[group];
>  
>  	idx = ext.e.idx;
>  
> -	if ((idx < 0) || (idx >= MAX_PENDING_REQS))
> +	if (idx >= MAX_PENDING_REQS)
>  		return 0;
>  
>  	if (netbk->mmap_pages[idx] != pg)

^ permalink raw reply

* [PATCH 4/4] batman-adv: process broadcast packets in BLA earlier
From: Antonio Quartulli @ 2012-11-16  8:49 UTC (permalink / raw)
  To: davem
  Cc: netdev, Simon Wunderlich, Marek Lindner, Sven Eckelmann,
	Antonio Quartulli, Simon Wunderlich
In-Reply-To: <1353055758-2901-1-git-send-email-ordex@autistici.org>

The logic in the BLA mechanism may decide to drop broadcast packets
because the node may still be in the setup phase. For this reason,
further broadcast processing like the early client detection mechanism
must be done only after the BLA check.

This patches moves the invocation to BLA before any other broadcast
processing.

This was introduced 30cfd02b60e1cb16f5effb0a01f826c5bb7e4c59
("batman-adv: detect not yet announced clients")

Reported-by: Glen Page <glen.page@thet.net>
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
---
 net/batman-adv/soft-interface.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index b9a28d2..ce0684a 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -325,6 +325,12 @@ void batadv_interface_rx(struct net_device *soft_iface,
 
 	soft_iface->last_rx = jiffies;
 
+	/* Let the bridge loop avoidance check the packet. If will
+	 * not handle it, we can safely push it up.
+	 */
+	if (batadv_bla_rx(bat_priv, skb, vid, is_bcast))
+		goto out;
+
 	if (orig_node)
 		batadv_tt_add_temporary_global_entry(bat_priv, orig_node,
 						     ethhdr->h_source);
@@ -332,12 +338,6 @@ void batadv_interface_rx(struct net_device *soft_iface,
 	if (batadv_is_ap_isolated(bat_priv, ethhdr->h_source, ethhdr->h_dest))
 		goto dropped;
 
-	/* Let the bridge loop avoidance check the packet. If will
-	 * not handle it, we can safely push it up.
-	 */
-	if (batadv_bla_rx(bat_priv, skb, vid, is_bcast))
-		goto out;
-
 	netif_rx(skb);
 	goto out;
 
-- 
1.8.0

^ permalink raw reply related

* [PATCH 3/4] batman-adv: don't add TEMP clients belonging to other backbone nodes
From: Antonio Quartulli @ 2012-11-16  8:49 UTC (permalink / raw)
  To: davem
  Cc: netdev, Simon Wunderlich, Marek Lindner, Sven Eckelmann,
	Antonio Quartulli
In-Reply-To: <1353055758-2901-1-git-send-email-ordex@autistici.org>

The "early client detection" mechanism must not add clients belonging
to other backbone nodes. Such clients must be reached by directly
using the LAN instead of the mesh.

This was introduced by 30cfd02b60e1cb16f5effb0a01f826c5bb7e4c59
("batman-adv: detect not yet announced clients")

Reported-by: Glen Page <glen.page@thet.net>
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
---
 net/batman-adv/translation-table.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c
index fec1a00..baae715 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -2456,6 +2456,13 @@ bool batadv_tt_add_temporary_global_entry(struct batadv_priv *bat_priv,
 {
 	bool ret = false;
 
+	/* if the originator is a backbone node (meaning it belongs to the same
+	 * LAN of this node) the temporary client must not be added because to
+	 * reach such destination the node must use the LAN instead of the mesh
+	 */
+	if (batadv_bla_is_backbone_gw_orig(bat_priv, orig_node->orig))
+		goto out;
+
 	if (!batadv_tt_global_add(bat_priv, orig_node, addr,
 				  BATADV_TT_CLIENT_TEMP,
 				  atomic_read(&orig_node->last_ttvn)))
-- 
1.8.0

^ permalink raw reply related

* [PATCH 2/4] batman-adv: correctly pass the client flag on tt_response
From: Antonio Quartulli @ 2012-11-16  8:49 UTC (permalink / raw)
  To: davem
  Cc: netdev, Simon Wunderlich, Marek Lindner, Sven Eckelmann,
	Antonio Quartulli
In-Reply-To: <1353055758-2901-1-git-send-email-ordex@autistici.org>

When a TT response with the full table is sent, the client flags
should be sent as well. This patch fix the flags assignment when
populating the tt_response to send back

This was introduced by 30cfd02b60e1cb16f5effb0a01f826c5bb7e4c59
("batman-adv: detect not yet announced clients")

Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
---
 net/batman-adv/translation-table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c
index 64c0012..fec1a00 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -1502,7 +1502,7 @@ batadv_tt_response_fill_table(uint16_t tt_len, uint8_t ttvn,
 
 			memcpy(tt_change->addr, tt_common_entry->addr,
 			       ETH_ALEN);
-			tt_change->flags = BATADV_NO_FLAGS;
+			tt_change->flags = tt_common_entry->flags;
 
 			tt_count++;
 			tt_change++;
-- 
1.8.0

^ permalink raw reply related

* [PATCH 1/4] batman-adv: fix tt_global_entries flags update
From: Antonio Quartulli @ 2012-11-16  8:49 UTC (permalink / raw)
  To: davem
  Cc: netdev, Simon Wunderlich, Marek Lindner, Sven Eckelmann,
	Antonio Quartulli
In-Reply-To: <1353055758-2901-1-git-send-email-ordex@autistici.org>

Flags carried by a change_entry have to be always copied into the
client entry as they may contain important attributes (e.g.
TT_CLIENT_WIFI).

For instance, a client added by means of the "early detection
mechanism" has no flag set at the beginning, so they must be updated once the
proper ADD event is received.

This was introduced by 30cfd02b60e1cb16f5effb0a01f826c5bb7e4c59
("batman-adv: detect not yet announced clients")

Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
---
 net/batman-adv/translation-table.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c
index 112edd3..64c0012 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -769,6 +769,12 @@ int batadv_tt_global_add(struct batadv_priv *bat_priv,
 		 */
 		tt_global_entry->common.flags &= ~BATADV_TT_CLIENT_TEMP;
 
+		/* the change can carry possible "attribute" flags like the
+		 * TT_CLIENT_WIFI, therefore they have to be copied in the
+		 * client entry
+		 */
+		tt_global_entry->common.flags |= flags;
+
 		/* If there is the BATADV_TT_CLIENT_ROAM flag set, there is only
 		 * one originator left in the list and we previously received a
 		 * delete + roaming change for this originator.
-- 
1.8.0

^ permalink raw reply related

* pull request: batman-adv 2012-11-16
From: Antonio Quartulli @ 2012-11-16  8:49 UTC (permalink / raw)
  To: davem
  Cc: netdev, Simon Wunderlich, Marek Lindner, Sven Eckelmann,
	Antonio Quartulli

Hello David,

here is small set of fixes intended for net/linux-3.7.
These patches are fixing some interoperability problems due to the features we
added in 3.7. Mainly we have two big issues: one is preventing clients connected
to the mesh network to contact any other hosts, caused by a not proper
translation table handling; the second one compromises the AP isolation feature
causing it to be completely useless, no matter it was on or off.

Please, let me know if there is any problem.

Thank you very much!
		Antonio



The following changes since commit 80d11788fb8f4d9fcfae5ad508c7f1b65e8b28a3:

  Revert "drivers/net/phy/mdio-bitbang.c: Call mdiobus_unregister before mdiobus_free" (2012-11-14 22:32:15 -0500)

are available in the git repository at:

  git://git.open-mesh.org/linux-merge.git tags/batman-adv-fix-for-davem

for you to fetch changes up to 74490f969155caf1ec945ad2d35d3a8eec6be71d:

  batman-adv: process broadcast packets in BLA earlier (2012-11-16 09:36:54 +0100)

----------------------------------------------------------------
Included fixes are:
- update the client entry status flags when using the "early client
  detection". This makes the Distributed AP isolation correctly work;
- transfer the client entry status flags when recovering the translation
  table from another node. This makes the Distributed AP isolation correctly
  work;
- prevent the "early client detection mechanism" to add clients belonging to
  other backbone nodes in the same LAN. This breaks connectivity when using this
  mechanism together with the Bridge Loop Avoidance
- process broadcast packets with the Bridge Loop Avoidance before any other
  component. BLA can possibly drop the packets based on the source address. This
  makes the "early client detection mechanism" correctly work when used with
  BLA.

----------------------------------------------------------------
Antonio Quartulli (4):
      batman-adv: fix tt_global_entries flags update
      batman-adv: correctly pass the client flag on tt_response
      batman-adv: don't add TEMP clients belonging to other backbone nodes
      batman-adv: process broadcast packets in BLA earlier

 net/batman-adv/soft-interface.c    | 12 ++++++------
 net/batman-adv/translation-table.c | 15 ++++++++++++++-
 2 files changed, 20 insertions(+), 7 deletions(-)

^ permalink raw reply

* Re: [PATCH v4 2/9] net: rds: use this_cpu_* per-cpu helper
From: Shan Wei @ 2012-11-16  8:38 UTC (permalink / raw)
  To: David Miller
  Cc: Shan Wei, venkat.x.venkatsubra, rds-devel, NetDev,
	Kernel-Maillist, cl, Tejun Heo
In-Reply-To: <50A1A7C1.50308@gmail.com>

Shan Wei said, at 2012/11/13 9:52:
> From: Shan Wei <davidshan@tencent.com>
> 
> 
> Signed-off-by: Shan Wei <davidshan@tencent.com>
> Reviewed-by: Christoph Lameter <cl@linux.com>

David Miller,  would you like to pick it up to your net-next tree?


> ---
> v4:
> 1. add missing __percpu annotations.
> 2. [read|write]ing fields of struct rds_ib_cache_head
> using __this_cpu_* operation, drop per_cpu_ptr.
> ---
>  net/rds/ib.h      |    2 +-
>  net/rds/ib_recv.c |   24 +++++++++++++-----------
>  2 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/net/rds/ib.h b/net/rds/ib.h
> index 8d2b3d5..7280ab8 100644
> --- a/net/rds/ib.h
> +++ b/net/rds/ib.h
> @@ -50,7 +50,7 @@ struct rds_ib_cache_head {
>  };
>  
>  struct rds_ib_refill_cache {
> -	struct rds_ib_cache_head *percpu;
> +	struct rds_ib_cache_head __percpu *percpu;
>  	struct list_head	 *xfer;
>  	struct list_head	 *ready;
>  };
> diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
> index 8d19491..8c5bc85 100644
> --- a/net/rds/ib_recv.c
> +++ b/net/rds/ib_recv.c
> @@ -418,20 +418,21 @@ static void rds_ib_recv_cache_put(struct list_head *new_item,
>  				 struct rds_ib_refill_cache *cache)
>  {
>  	unsigned long flags;
> -	struct rds_ib_cache_head *chp;
>  	struct list_head *old;
> +	struct list_head __percpu *chpfirst;
>  
>  	local_irq_save(flags);
>  
> -	chp = per_cpu_ptr(cache->percpu, smp_processor_id());
> -	if (!chp->first)
> +	chpfirst = __this_cpu_read(cache->percpu->first);
> +	if (!chpfirst)
>  		INIT_LIST_HEAD(new_item);
>  	else /* put on front */
> -		list_add_tail(new_item, chp->first);
> -	chp->first = new_item;
> -	chp->count++;
> +		list_add_tail(new_item, chpfirst);
>  
> -	if (chp->count < RDS_IB_RECYCLE_BATCH_COUNT)
> +	__this_cpu_write(chpfirst, new_item);
> +	__this_cpu_inc(cache->percpu->count);
> +
> +	if (__this_cpu_read(cache->percpu->count) < RDS_IB_RECYCLE_BATCH_COUNT)
>  		goto end;
>  
>  	/*
> @@ -443,12 +444,13 @@ static void rds_ib_recv_cache_put(struct list_head *new_item,
>  	do {
>  		old = xchg(&cache->xfer, NULL);
>  		if (old)
> -			list_splice_entire_tail(old, chp->first);
> -		old = cmpxchg(&cache->xfer, NULL, chp->first);
> +			list_splice_entire_tail(old, chpfirst);
> +		old = cmpxchg(&cache->xfer, NULL, chpfirst);
>  	} while (old);
>  
> -	chp->first = NULL;
> -	chp->count = 0;
> +
> +	__this_cpu_write(chpfirst, NULL);
> +	__this_cpu_write(cache->percpu->count, 0);
>  end:
>  	local_irq_restore(flags);
>  }
> 

^ permalink raw reply

* Re: [PATCH v4 1/9] net: core: use this_cpu_ptr per-cpu helper
From: Shan Wei @ 2012-11-16  8:38 UTC (permalink / raw)
  To: David Miller
  Cc: Shan Wei, timo.teras, steffen.klassert, NetDev, Kernel-Maillist,
	cl, Tejun Heo
In-Reply-To: <50A1A7BA.5000507@gmail.com>

Shan Wei said, at 2012/11/13 9:51:
> From: Shan Wei <davidshan@tencent.com>
> 
> flush_tasklet is a struct, not a pointer in percpu var.
> so use this_cpu_ptr to get the member pointer.
> 
> Signed-off-by: Shan Wei <davidshan@tencent.com>
> Reviewed-by: Christoph Lameter <cl@linux.com>

David Miller,  would you like to pick it up to your net-next tree?

> ---
> no changes vs v3.
> ---
>  net/core/flow.c |    4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/flow.c b/net/core/flow.c
> index e318c7e..b0901ee 100644
> --- a/net/core/flow.c
> +++ b/net/core/flow.c
> @@ -327,11 +327,9 @@ static void flow_cache_flush_tasklet(unsigned long data)
>  static void flow_cache_flush_per_cpu(void *data)
>  {
>  	struct flow_flush_info *info = data;
> -	int cpu;
>  	struct tasklet_struct *tasklet;
>  
> -	cpu = smp_processor_id();
> -	tasklet = &per_cpu_ptr(info->cache->percpu, cpu)->flush_tasklet;
> +	tasklet = this_cpu_ptr(&info->cache->percpu->flush_tasklet);
>  	tasklet->data = (unsigned long)info;
>  	tasklet_schedule(tasklet);
>  }
> 

^ permalink raw reply

* Re: [PATCH v4 4/9] net: openvswitch: use this_cpu_ptr per-cpu helper
From: Shan Wei @ 2012-11-16  8:35 UTC (permalink / raw)
  To: jesse-l0M0P4e3n4LQT0dZR+AlfA
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Tejun Heo,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, NetDev, Kernel-Maillist,
	Shan Wei, David Miller
In-Reply-To: <50A1A7D9.2040506-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Shan Wei said, at 2012/11/13 9:52:
> From: Shan Wei <davidshan-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
> 
> just use more faster this_cpu_ptr instead of per_cpu_ptr(p, smp_processor_id());
> 
> 
> Signed-off-by: Shan Wei <davidshan-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
> Reviewed-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Jesse Gross,  would you like to pick it up to your tree?

> ---
> no changes vs v3,v2.
> ---
>  net/openvswitch/datapath.c |    4 ++--
>  net/openvswitch/vport.c    |    5 ++---
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 4c4b62c..77d16a5 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -208,7 +208,7 @@ void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
>  	int error;
>  	int key_len;
>  
> -	stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id());
> +	stats = this_cpu_ptr(dp->stats_percpu);
>  
>  	/* Extract flow from 'skb' into 'key'. */
>  	error = ovs_flow_extract(skb, p->port_no, &key, &key_len);
> @@ -282,7 +282,7 @@ int ovs_dp_upcall(struct datapath *dp, struct sk_buff *skb,
>  	return 0;
>  
>  err:
> -	stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id());
> +	stats = this_cpu_ptr(dp->stats_percpu);
>  
>  	u64_stats_update_begin(&stats->sync);
>  	stats->n_lost++;
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index 03779e8..70af0be 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -333,8 +333,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb)
>  {
>  	struct vport_percpu_stats *stats;
>  
> -	stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
> -
> +	stats = this_cpu_ptr(vport->percpu_stats);
>  	u64_stats_update_begin(&stats->sync);
>  	stats->rx_packets++;
>  	stats->rx_bytes += skb->len;
> @@ -359,7 +358,7 @@ int ovs_vport_send(struct vport *vport, struct sk_buff *skb)
>  	if (likely(sent)) {
>  		struct vport_percpu_stats *stats;
>  
> -		stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
> +		stats = this_cpu_ptr(vport->percpu_stats);
>  
>  		u64_stats_update_begin(&stats->sync);
>  		stats->tx_packets++;
> 

^ permalink raw reply

* Re: [PATCH 0/9 v4] use efficient this_cpu_* helper
From: Shan Wei @ 2012-11-16  8:30 UTC (permalink / raw)
  To: Tejun Heo, David Miller, paulmck, rostedt
  Cc: Christoph Lameter, NetDev, Kernel-Maillist
In-Reply-To: <20121115145325.GC7306@mtj.dyndns.org>

Hi Tejun Heo:

Tejun Heo said, at 2012/11/15 22:53:
> On Thu, Nov 15, 2012 at 02:19:38PM +0000, Christoph Lameter wrote:
>> Tejon: Could you pick up this patchset?
> 
> Sure, but, Shan, when posting patchset, please make the patches
> replies to the head message; otherwise, it's pretty difficult to track
> what's going on with the patchset as a whole.  I see that some patches
> are being picked up by respective subsystems.  If you have patches
> left, please let me know.

OK, next time i will do as you suggest.

This patchset include more subsystem, i.e network, rcu, trace.
The best way to avoid code conflict is subsystem maintainer to pick them up
to their code tree. I will remind them in each patch that not yet applied and 
add you to the receiver list.

Best Regards
Shan Wei

> 
> Thanks.
> 

^ permalink raw reply

* Re: [Xen-devel] [PATCH 3/4] Xen/netfront: Implement persistent grant in netfront.
From: ANNIE LI @ 2012-11-16  7:58 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: netdev@vger.kernel.org, xen-devel@lists.xensource.com,
	Ian Campbell, konrad.wilk@oracle.com
In-Reply-To: <50A5CD7F.2010609@oracle.com>



On 2012-11-16 13:22, ANNIE LI wrote:
>
>
> On 2012-11-15 18:52, Roger Pau Monné wrote:
>>> +       err = xenbus_printf(xbt, dev->nodename, 
>>> "feature-persistent-grants",
>>> +                           "%u", info->persistent_gnt);
>> As in netback, I think "feature-persistent" should be used.
>
> Same in blkback, I assume it is  "feature-persistent-grants", right?
> I referred your RFC patch, did you change it later? Or I missed 
> something?
>
>
My mistake.
In your v2 patch, it is "feature-persistent". I will change the code as 
blkback/blkfront.

Thanks
Annie

^ permalink raw reply

* Re: [Xen-devel] [PATCH 1/4] xen/netback: implements persistent grant with one page pool.
From: ANNIE LI @ 2012-11-16  7:57 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xensource.com, netdev@vger.kernel.org,
	konrad.wilk@oracle.com, Ian Campbell
In-Reply-To: <50A5A9CF.8030008@oracle.com>



On 2012-11-16 10:49, ANNIE LI wrote:
>
>
> On 2012-11-15 17:57, Roger Pau Monné wrote:
>>>
>>> @@ -453,7 +460,12 @@ static int connect_rings(struct backend_info *be)
>>>                  val = 0;
>>>          vif->csum = !val;
>>>
>>> -       /* Map the shared frame, irq etc. */
>>> +       if (xenbus_scanf(XBT_NIL, dev->otherend, 
>>> "feature-persistent-grants",
>>> +                        "%u",&val)<  0)
>> In block devices "feature-persistent" is used, so I think that for
>> clearness it should be announced the same way in net.
> Is it  "feature-persistent" ? I checked your RFC patch, the key is 
> "feature-persistent-grants".
>
>
My mistake.
In your v2 patch, it is "feature-persistent". I will change the code as 
blkback/blkfront.

Thanks
Annie

^ permalink raw reply

* Re: [tcpdump-workers] vlan tagged packets and libpcap breakage
From: Eric W. Biederman @ 2012-11-16  6:51 UTC (permalink / raw)
  To: Ani Sinha; +Cc: Michael Richardson, netdev, tcpdump-workers, Francesco Ruggeri
In-Reply-To: <CAOxq_8PsXb-oUy85Eji7GSAbUZD_Bds8skmGNP4BWSewQWx8PA@mail.gmail.com>

Ani Sinha <ani@aristanetworks.com> writes:

> cc'ing netdev.
>
>
> On Wed, Oct 31, 2012 at 2:01 PM, Michael Richardson <mcr@sandelman.ca> wrote:
>>
>> Thanks for this email.
>>
>>>>>>> "Ani" == Ani Sinha <ani@aristanetworks.com> writes:
>>     Ani> remove "inline" from vlan_core.c functions
>>     Ani> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>>     Ani> As a result of this patch, with or without HW acceleration support in
>>     Ani> the kernel driver, the vlan tag information is pulled out from the
>>     Ani> packet and is inserted in the skb metadata. Thereafter, the vlan tag
>>     Ani> information for a 802.1q packet can be obtained my applications
>>     Ani> running in the user space by using the auxdata and cmsg
>>     Ani> structure.
>>
>> Do you think that the existance of this behaviour could be exposed in
>> sysctl, /proc/net or /sys equivalent (we still don't have /sys/net...)?
>> As a read only file that had a 0/1 in it?
>
> yes, we definitely need a run time check. Whether this could be in the
> form of a socket option or a /proc entry I don't know.

I don't see any need to add any kernel code to allow checking if vlan
tags are stripped.  Vlan headers are stripped on all kernel interfaces
today.  Vlan headers have been stripped on all but a handful of software
interfaces for 6+ years.  For all kernels if the vlan header is stripped
it is reported in the auxdata, upon packet reception.  Careful code
should also look for TP_STATUS_VLAN_VALID which allows for
distinguishing a striped vlan header of 0 from no vlan header.

The safe assumption then is that testing for vlan headers and vlan
values in bpf filters is not possible without the new bpf extentions.

It is possible to test for the presence of support of the new vlan bpf
extensions by attempting to load a filter that uses them.  As only valid
filters can be loaded, old kernels that do not support filtering of vlan
tags will fail to load the a test filter with uses them.

For old kernels that do not support the new extensions it is possible to
generate code that looks at the ethernet header and sees if the
ethertype is 0x8100 and then does things with it, but that will only
work on a small handful of software only interfaces.

Eric

^ permalink raw reply

* [PATCH 14/14] wlcore: Remove redundant check on unsigned variable
From: Tushar Behera @ 2012-11-16  6:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: patches, Luciano Coelho, linux-wireless, netdev
In-Reply-To: <1353048646-10935-1-git-send-email-tushar.behera@linaro.org>

No need to check whether unsigned variable is less than 0.

CC: Luciano Coelho <coelho@ti.com>
CC: linux-wireless@vger.kernel.org
CC: netdev@vger.kernel.org
Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
---
 drivers/net/wireless/ti/wlcore/debugfs.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/wireless/ti/wlcore/debugfs.c b/drivers/net/wireless/ti/wlcore/debugfs.c
index c86bb00..93f801d 100644
--- a/drivers/net/wireless/ti/wlcore/debugfs.c
+++ b/drivers/net/wireless/ti/wlcore/debugfs.c
@@ -993,7 +993,7 @@ static ssize_t sleep_auth_write(struct file *file,
 		return -EINVAL;
 	}
 
-	if (value < 0 || value > WL1271_PSM_MAX) {
+	if (value > WL1271_PSM_MAX) {
 		wl1271_warning("sleep_auth must be between 0 and %d",
 			       WL1271_PSM_MAX);
 		return -ERANGE;
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 10/14] atm: Removed redundant check on unsigned variable
From: Tushar Behera @ 2012-11-16  6:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: patches, Chas Williams, linux-atm-general, netdev
In-Reply-To: <1353048646-10935-1-git-send-email-tushar.behera@linaro.org>

No need to check whether unsigned variable is less than 0.

CC: Chas Williams <chas@cmf.nrl.navy.mil>
CC: linux-atm-general@lists.sourceforge.net
CC: netdev@vger.kernel.org
Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
---
 drivers/atm/fore200e.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/atm/fore200e.c b/drivers/atm/fore200e.c
index 361f5ae..fdd3fe7 100644
--- a/drivers/atm/fore200e.c
+++ b/drivers/atm/fore200e.c
@@ -972,7 +972,7 @@ int bsq_audit(int where, struct host_bsq* bsq, int scheme, int magn)
 		   where, scheme, magn, buffer->index, buffer->scheme);
 	}
 
-	if ((buffer->index < 0) || (buffer->index >= fore200e_rx_buf_nbr[ scheme ][ magn ])) {
+	if (buffer->index >= fore200e_rx_buf_nbr[ scheme ][ magn ]) {
 	    printk(FORE200E "bsq_audit(%d): queue %d.%d, out of range buffer index = %ld !\n",
 		   where, scheme, magn, buffer->index);
 	}
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 08/14] xen: netback: Remove redundant check on unsigned variable
From: Tushar Behera @ 2012-11-16  6:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: patches, Ian Campbell, xen-devel, netdev
In-Reply-To: <1353048646-10935-1-git-send-email-tushar.behera@linaro.org>

No need to check whether unsigned variable is less than 0.

CC: Ian Campbell <ian.campbell@citrix.com>
CC: xen-devel@lists.xensource.com
CC: netdev@vger.kernel.org
Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
---
 drivers/net/xen-netback/netback.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index aab8677..515e10c 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -190,14 +190,14 @@ static int get_page_ext(struct page *pg,
 
 	group = ext.e.group - 1;
 
-	if (group < 0 || group >= xen_netbk_group_nr)
+	if (group >= xen_netbk_group_nr)
 		return 0;
 
 	netbk = &xen_netbk[group];
 
 	idx = ext.e.idx;
 
-	if ((idx < 0) || (idx >= MAX_PENDING_REQS))
+	if (idx >= MAX_PENDING_REQS)
 		return 0;
 
 	if (netbk->mmap_pages[idx] != pg)
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 00/14] Modify signed comparisons of unsigned variables
From: Tushar Behera @ 2012-11-16  6:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: patches, Mauro Carvalho Chehab, Linus Walleij, Ian Campbell,
	Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Chas Williams,
	Jack Steiner, Arnd Bergmann, Luciano Coelho, Jiri Kosina,
	ivtv-devel, linux-media, xen-devel, netdev, virtualization,
	linux-atm-general, linux-usb, linux-input, linux-wireless

The occurrences were identified through the coccinelle script at
following location.

http://www.emn.fr/z-info/coccinelle/rules/find_unsigned.cocci

Signed checks for unsigned variables are removed if it is also checked
for upper error limit. For error checks, IS_ERR_VALUE() macros is used.

Tushar Behera (14):
  [media] ivtv: Remove redundant check on unsigned variable
  [media] meye: Remove redundant check on unsigned variable
  [media] saa7134: Remove redundant check on unsigned variable
  [media] tlg2300: Remove redundant check on unsigned variable
  [media] atmel-isi: Update error check for unsigned variables
  pinctrl: samsung: Update error check for unsigned variables
  pinctrl: SPEAr: Update error check for unsigned variables
  xen: netback: Remove redundant check on unsigned variable
  xen: events: Remove redundant check on unsigned variable
  atm: Removed redundant check on unsigned variable
  HID: hiddev: Remove redundant check on unsigned variable
  gru: Remove redundant check on unsigned variable
  misc: tsl2550: Remove redundant check on unsigned variable
  wlcore: Remove redundant check on unsigned variable

 drivers/atm/fore200e.c                        |    2 +-
 drivers/hid/usbhid/hiddev.c                   |    2 +-
 drivers/media/pci/ivtv/ivtv-ioctl.c           |    2 +-
 drivers/media/pci/meye/meye.c                 |    2 +-
 drivers/media/pci/saa7134/saa7134-video.c     |    2 +-
 drivers/media/platform/soc_camera/atmel-isi.c |    2 +-
 drivers/media/usb/tlg2300/pd-video.c          |    2 +-
 drivers/misc/sgi-gru/grukdump.c               |    2 +-
 drivers/misc/tsl2550.c                        |    4 ++--
 drivers/net/wireless/ti/wlcore/debugfs.c      |    2 +-
 drivers/net/xen-netback/netback.c             |    4 ++--
 drivers/pinctrl/pinctrl-samsung.c             |    2 +-
 drivers/pinctrl/spear/pinctrl-plgpio.c        |    2 +-
 drivers/xen/events.c                          |    2 +-
 14 files changed, 16 insertions(+), 16 deletions(-)

-- 
1.7.4.1

CC: Mauro Carvalho Chehab <mchehab@infradead.org>
CC: Linus Walleij <linus.walleij@linaro.org>
CC: Ian Campbell <ian.campbell@citrix.com>
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Chas Williams <chas@cmf.nrl.navy.mil>
CC: Jack Steiner <steiner@sgi.com>
CC: Arnd Bergmann <arnd@arndb.de>
CC: Luciano Coelho <coelho@ti.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: ivtv-devel@ivtvdriver.org
CC: linux-media@vger.kernel.org
CC: xen-devel@lists.xensource.com
CC: netdev@vger.kernel.org
CC: virtualization@lists.linux-foundation.org
CC: linux-atm-general@lists.sourceforge.net
CC: linux-usb@vger.kernel.org
CC: linux-input@vger.kernel.org
CC: linux-wireless@vger.kernel.org

^ permalink raw reply

* Re: [PATCH net-next] ipv6: export IP6_RT_PRIO_* to userland
From: David Miller @ 2012-11-16  6:48 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <5098127A.2040405@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Mon, 05 Nov 2012 20:24:42 +0100

> In IPv4, there is no such default metric. If you add a route with
> metric X, it remains X in the kernel, even if it's 0.

Ok, fair enough, applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH] tilegx: request_irq with a non-null device name
From: David Miller @ 2012-11-16  6:40 UTC (permalink / raw)
  To: simon.marchi; +Cc: netdev, linux-kernel, cmetcalf
In-Reply-To: <1353039199-12520-1-git-send-email-simon.marchi@polymtl.ca>

From: Simon Marchi <simon.marchi@polymtl.ca>
Date: Thu, 15 Nov 2012 23:13:19 -0500

> This patch simply makes the tilegx net driver call request_irq with a
> non-null name. It makes the output in /proc/interrupts more obvious, but
> also helps tools that don't expect to find null there.
> 
> Signed-off-by: Simon Marchi <simon.marchi@polymtl.ca>
> Acked-by: Chris Metcalf <cmetcalf@tilera.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures
From: David Miller @ 2012-11-16  6:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, jln
In-Reply-To: <1353022864.10798.6.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 15 Nov 2012 15:41:04 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> order-5 allocations can fail with current kernels, we should
> try to reduce allocation sizes to allow network namespace
> creation.
> 
> Reported-by: Julien Tinnes <jln@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Indeed, this has to be done better.

But this kind of retry solution results in non-deterministic behavior.
Yes the tcp metrics cache is best effort, but it's size can influence
behavior in a substantial way depending upon the workload.

I would suggest that we instead use different limits, ones which the
page allocator will satisfy for us always with GFP_KERNEL.

1) include linux/mmzone.h

2) Make the two limits based upon PAGE_ALLOC_COSTLY_ORDER.

That is, make the larger table size PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER
and the smaller one PAGE_SIZE << (PAGE_ALLOC_COSTLY_ORDER - 1).

How about something like this?

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 53bc584..d4b2d42 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1,7 +1,6 @@
 #include <linux/rcupdate.h>
 #include <linux/spinlock.h>
 #include <linux/jiffies.h>
-#include <linux/bootmem.h>
 #include <linux/module.h>
 #include <linux/cache.h>
 #include <linux/slab.h>
@@ -9,6 +8,7 @@
 #include <linux/tcp.h>
 #include <linux/hash.h>
 #include <linux/tcp_metrics.h>
+#include <linux/mmzone.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/net_namespace.h>
@@ -1025,10 +1025,12 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 
 	slots = tcpmhash_entries;
 	if (!slots) {
-		if (totalram_pages >= 128 * 1024)
-			slots = 16 * 1024;
-		else
-			slots = 8 * 1024;
+		int order = PAGE_ALLOC_COSTLY_ORDER;
+
+		if (totalram_pages < 128 * 1024)
+			order--;
+		slots = (PAGE_SIZE << order) /
+			sizeof(struct tcpm_hash_bucket);
 	}
 
 	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox