Netdev List
 help / color / mirror / Atom feed
* [PATCH net 5/6] bnx2x: Fix 578xx link LED
From: Yaniv Rosner @ 2011-09-06  6:47 UTC (permalink / raw)
  To: davem; +Cc: eilong, netdev

Fix 1G link LED for the BCM578xx-SFI/KR.

Signed-off-by: Yaniv Rosner <yanivr@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_link.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_link.c b/drivers/net/bnx2x/bnx2x_link.c
index 3428075..ba15bdc 100644
--- a/drivers/net/bnx2x/bnx2x_link.c
+++ b/drivers/net/bnx2x/bnx2x_link.c
@@ -5924,7 +5924,7 @@ int bnx2x_set_led(struct link_params *params,
 					(tmp | EMAC_LED_OVERRIDE));
 				/*
 				 * return here without enabling traffic
-				 * LED blink andsetting rate in ON mode.
+				 * LED blink and setting rate in ON mode.
 				 * In oper mode, enabling LED blink
 				 * and setting rate is needed.
 				 */
@@ -5936,7 +5936,11 @@ int bnx2x_set_led(struct link_params *params,
 			 * This is a work-around for HW issue found when link
 			 * is up in CL73
 			 */
-			REG_WR(bp, NIG_REG_LED_10G_P0 + port*4, 1);
+			if ((!CHIP_IS_E3(bp)) ||
+			    (CHIP_IS_E3(bp) &&
+			     mode == LED_MODE_ON))
+				REG_WR(bp, NIG_REG_LED_10G_P0 + port*4, 1);
+
 			if (CHIP_IS_E1x(bp) ||
 			    CHIP_IS_E2(bp) ||
 			    (mode == LED_MODE_ON))
-- 
1.7.1

^ permalink raw reply related

* [PATCH net 4/6] bnx2x: Fix XMAC loopback test
From: Yaniv Rosner @ 2011-09-06  6:47 UTC (permalink / raw)
  To: davem; +Cc: eilong, netdev

Change XMAC loopback type from CORE LOCAL to LINE LOCAL for the BCM578xx due to intermittent problem with the loopback with this configuration.

Signed-off-by: Yaniv Rosner <yanivr@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_link.c |    2 +-
 drivers/net/bnx2x/bnx2x_reg.h  |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_link.c b/drivers/net/bnx2x/bnx2x_link.c
index db5913d..3428075 100644
--- a/drivers/net/bnx2x/bnx2x_link.c
+++ b/drivers/net/bnx2x/bnx2x_link.c
@@ -1720,7 +1720,7 @@ static int bnx2x_xmac_enable(struct link_params *params,
 
 	/* Check loopback mode */
 	if (lb)
-		val |= XMAC_CTRL_REG_CORE_LOCAL_LPBK;
+		val |= XMAC_CTRL_REG_LINE_LOCAL_LPBK;
 	REG_WR(bp, xmac_base + XMAC_REG_CTRL, val);
 	bnx2x_set_xumac_nig(params,
 			    ((vars->flow_ctrl & BNX2X_FLOW_CTRL_TX) != 0), 1);
diff --git a/drivers/net/bnx2x/bnx2x_reg.h b/drivers/net/bnx2x/bnx2x_reg.h
index 2bfedde..687ee12 100644
--- a/drivers/net/bnx2x/bnx2x_reg.h
+++ b/drivers/net/bnx2x/bnx2x_reg.h
@@ -5320,7 +5320,7 @@
 #define XCM_REG_XX_OVFL_EVNT_ID 				 0x20058
 #define XMAC_CLEAR_RX_LSS_STATUS_REG_CLEAR_LOCAL_FAULT_STATUS	 (0x1<<0)
 #define XMAC_CLEAR_RX_LSS_STATUS_REG_CLEAR_REMOTE_FAULT_STATUS	 (0x1<<1)
-#define XMAC_CTRL_REG_CORE_LOCAL_LPBK				 (0x1<<3)
+#define XMAC_CTRL_REG_LINE_LOCAL_LPBK				 (0x1<<2)
 #define XMAC_CTRL_REG_RX_EN					 (0x1<<1)
 #define XMAC_CTRL_REG_SOFT_RESET				 (0x1<<6)
 #define XMAC_CTRL_REG_TX_EN					 (0x1<<0)
-- 
1.7.1

^ permalink raw reply related

* [PATCH net 6/6] bnx2x: Fix ethtool advertisement
From: Yaniv Rosner @ 2011-09-06  6:47 UTC (permalink / raw)
  To: davem; +Cc: eilong, netdev

Enable changing advertisement settings via ethtool and fix flow-control advertisement when autoneg flow-control is disabled.

Signed-off-by: Yaniv Rosner <yanivr@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_ethtool.c |   45 +++++++++++++++++++++++++++++++++---
 drivers/net/bnx2x/bnx2x_main.c    |    6 +++++
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_ethtool.c b/drivers/net/bnx2x/bnx2x_ethtool.c
index 2218630..6aa94d3 100644
--- a/drivers/net/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/bnx2x/bnx2x_ethtool.c
@@ -363,13 +363,50 @@ static int bnx2x_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 		}
 
 		/* advertise the requested speed and duplex if supported */
-		cmd->advertising &= bp->port.supported[cfg_idx];
+		if (cmd->advertising & ~(bp->port.supported[cfg_idx])) {
+			DP(NETIF_MSG_LINK, "Advertisement parameters "
+					   "are not supported\n");
+			return -EINVAL;
+		}
 
 		bp->link_params.req_line_speed[cfg_idx] = SPEED_AUTO_NEG;
-		bp->link_params.req_duplex[cfg_idx] = DUPLEX_FULL;
-		bp->port.advertising[cfg_idx] |= (ADVERTISED_Autoneg |
+		bp->link_params.req_duplex[cfg_idx] = cmd->duplex;
+		bp->port.advertising[cfg_idx] = (ADVERTISED_Autoneg |
 					 cmd->advertising);
+		if (cmd->advertising) {
+
+			bp->link_params.speed_cap_mask[cfg_idx] = 0;
+			if (cmd->advertising & ADVERTISED_10baseT_Half) {
+				bp->link_params.speed_cap_mask[cfg_idx] |=
+				PORT_HW_CFG_SPEED_CAPABILITY_D0_10M_HALF;
+			}
+			if (cmd->advertising & ADVERTISED_10baseT_Full)
+				bp->link_params.speed_cap_mask[cfg_idx] |=
+				PORT_HW_CFG_SPEED_CAPABILITY_D0_10M_FULL;
 
+			if (cmd->advertising & ADVERTISED_100baseT_Full)
+				bp->link_params.speed_cap_mask[cfg_idx] |=
+				PORT_HW_CFG_SPEED_CAPABILITY_D0_100M_FULL;
+
+			if (cmd->advertising & ADVERTISED_100baseT_Half) {
+				bp->link_params.speed_cap_mask[cfg_idx] |=
+				     PORT_HW_CFG_SPEED_CAPABILITY_D0_100M_HALF;
+			}
+			if (cmd->advertising & ADVERTISED_1000baseT_Half) {
+				bp->link_params.speed_cap_mask[cfg_idx] |=
+					PORT_HW_CFG_SPEED_CAPABILITY_D0_1G;
+			}
+			if (cmd->advertising & (ADVERTISED_1000baseT_Full |
+						ADVERTISED_1000baseKX_Full))
+				bp->link_params.speed_cap_mask[cfg_idx] |=
+					PORT_HW_CFG_SPEED_CAPABILITY_D0_1G;
+
+			if (cmd->advertising & (ADVERTISED_10000baseT_Full |
+						ADVERTISED_10000baseKX4_Full |
+						ADVERTISED_10000baseKR_Full))
+				bp->link_params.speed_cap_mask[cfg_idx] |=
+					PORT_HW_CFG_SPEED_CAPABILITY_D0_10G;
+		}
 	} else { /* forced speed */
 		/* advertise the requested speed and duplex if supported */
 		switch (speed) {
diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
index f74582a..42c7be1 100644
--- a/drivers/net/bnx2x/bnx2x_main.c
+++ b/drivers/net/bnx2x/bnx2x_main.c
@@ -2125,6 +2125,12 @@ static int bnx2x_set_spio(struct bnx2x *bp, int spio_num, u32 mode)
 void bnx2x_calc_fc_adv(struct bnx2x *bp)
 {
 	u8 cfg_idx = bnx2x_get_link_cfg_idx(bp);
+	if (bp->link_params.req_flow_ctrl[cfg_idx] != BNX2X_FLOW_CTRL_AUTO) {
+		bp->port.advertising[cfg_idx] &= ~(ADVERTISED_Asym_Pause |
+						   ADVERTISED_Pause);
+		return;
+	}
+
 	switch (bp->link_vars.ieee_fc &
 		MDIO_COMBO_IEEE0_AUTO_NEG_ADV_PAUSE_MASK) {
 	case MDIO_COMBO_IEEE0_AUTO_NEG_ADV_PAUSE_NONE:
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Nicolas de Pesloüan @ 2011-09-06  6:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David S. Miller, netdev
In-Reply-To: <20110905105735.1b912715@nehalam.ftrdhcpuser.net>

Le 05/09/2011 19:57, Stephen Hemminger a écrit :
> The root cause of the problem is applications that don't deal with unresolved
> IPv6 addresses. I already had to solve this in our distribution for NTP in a
> not bridge related problem. It is better to fix the applications to understand
> IPv6 address semantics than to try and force bridge to behave in a way that
> is friendly to these applications.

Thanks for clarifying.

> The earlier mail said it is a problem with dnsmasq and radvd. Let's work on understanding
> if they need to be updated before jumping in with hacks to the bridge code.
>

I really support the idea to keep the current behavior (assert carrier on br0 when at least one port 
have carrier) and to fix the applications to wait for the IPv6 address to be checked (DAD) instead 
of dying on bind() failure.

Thanks.

	Nicolas.

^ permalink raw reply

* Re: [patch net-next-2.6 v3] net: consolidate and fix ethtool_ops->get_settings calling
From: Bhanu Prakash Gollapudi @ 2011-09-06  6:52 UTC (permalink / raw)
  To: Zou, Yi
  Cc: amit.salecha-h88ZbnxC6KDQT0dZR+AlfA@public.gmane.org,
	bridge-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	linux-mips-6z/3iImG2C8G8FEW9MqTrA@public.gmane.org,
	JBottomley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	decot-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	shemminger-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	andy-QlMahl40kYEqcZcGjlUOXw@public.gmane.org,
	therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	fubar-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org,
	xiaosuo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org,
	Duyck, Alexander H,
	mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org, Ben Hutchings,
	greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org
In-Reply-To: <5A9BD224CEA58D4CB62235967D650C160A1B7D1C-osO9UTpF0UQd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>

On 9/5/2011 8:25 PM, Zou, Yi wrote:
>>
>> On Sat, 2011-09-03 at 15:34 +0200, Jiri Pirko wrote:
>>> This patch does several things:
>>> - introduces __ethtool_get_settings which is called from ethtool code
>> and
>>>    from drivers as well. Put ASSERT_RTNL there.
>>> - dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
>>> - changes calling in drivers so rtnl locking is respected. In
>>>    iboe_get_rate was previously ->get_settings() called unlocked. This
>>>    fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
>>>    problem. Also fixed by calling __dev_get_by_index() instead of
>>>    dev_get_by_index() and holding rtnl_lock for both calls.
>>> - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
>>>    so bnx2fc_if_create() and fcoe_if_create() are called locked as they
>>>    are from other places.
>>> - use __ethtool_get_settings() in bonding code
>>>
>>> Signed-off-by: Jiri Pirko<jpirko-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Reviewed-by: Ben Hutchings<bhutchings-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>  [except FCoE bits]
>>
>> Ben.
> FCoE bits look ok to me. Thanks,
>
> Reviewed-by: Yi Zou<yi.zou-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

bnx2fc changes looks OK to me.
Reviewed-by: Bhanu Prakash Gollapudi <bprakash-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>
>>
>> --
>> Ben Hutchings, Staff Engineer, Solarflare
>> Not speaking for my employer; that's the marketing department's job.
>> They asked us to note that Solarflare product names are trademarked.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] ipv6: don't use inetpeer to store metrics for routes.
From: Yan, Zheng @ 2011-09-06  7:34 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: davem@davemloft.net, eric.dumazet@gmail.com

Current IPv6 implementation uses inetpeer to store metrics for
routes. The problem of inetpeer is that it doesn't take subnet
prefix length in to consideration. If two routes have the same
address but different prefix length, they share same inetpeer.
So changing metrics of one route also affects the other. The
fix is to allocate separate metrics storage for each route.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
---
 net/ipv6/route.c |   33 ++++++++++++++++++++++-----------
 1 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 9e69eb0..1250f90 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -104,6 +104,9 @@ static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
 	struct inet_peer *peer;
 	u32 *p = NULL;
 
+	if (!(rt->dst.flags & DST_HOST))
+		return NULL;
+
 	if (!rt->rt6i_peer)
 		rt6_bind_peer(rt, 1);
 
@@ -252,6 +255,9 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	struct inet6_dev *idev = rt->rt6i_idev;
 	struct inet_peer *peer = rt->rt6i_peer;
 
+	if (!(rt->dst.flags & DST_HOST))
+		dst_destroy_metrics_generic(dst);
+
 	if (idev != NULL) {
 		rt->rt6i_idev = NULL;
 		in6_dev_put(idev);
@@ -723,9 +729,7 @@ static struct rt6_info *rt6_alloc_cow(const struct rt6_info *ort,
 			ipv6_addr_copy(&rt->rt6i_gateway, daddr);
 		}
 
-		rt->rt6i_dst.plen = 128;
 		rt->rt6i_flags |= RTF_CACHE;
-		rt->dst.flags |= DST_HOST;
 
 #ifdef CONFIG_IPV6_SUBTREES
 		if (rt->rt6i_src.plen && saddr) {
@@ -775,9 +779,7 @@ static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort,
 	struct rt6_info *rt = ip6_rt_copy(ort, daddr);
 
 	if (rt) {
-		rt->rt6i_dst.plen = 128;
 		rt->rt6i_flags |= RTF_CACHE;
-		rt->dst.flags |= DST_HOST;
 		dst_set_neighbour(&rt->dst, neigh_clone(dst_get_neighbour_raw(&ort->dst)));
 	}
 	return rt;
@@ -1078,12 +1080,15 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
 			neigh = NULL;
 	}
 
-	rt->rt6i_idev     = idev;
+	rt->dst.flags |= DST_HOST;
+	rt->dst.output  = ip6_output;
 	dst_set_neighbour(&rt->dst, neigh);
 	atomic_set(&rt->dst.__refcnt, 1);
-	ipv6_addr_copy(&rt->rt6i_dst.addr, addr);
 	dst_metric_set(&rt->dst, RTAX_HOPLIMIT, 255);
-	rt->dst.output  = ip6_output;
+
+	ipv6_addr_copy(&rt->rt6i_dst.addr, addr);
+	rt->rt6i_dst.plen = 128;
+	rt->rt6i_idev     = idev;
 
 	spin_lock_bh(&icmp6_dst_lock);
 	rt->dst.next = icmp6_dst_gc_list;
@@ -1261,6 +1266,14 @@ int ip6_route_add(struct fib6_config *cfg)
 	if (rt->rt6i_dst.plen == 128)
 	       rt->dst.flags |= DST_HOST;
 
+	if (!(rt->dst.flags & DST_HOST) && cfg->fc_mx) {
+		u32 *metrics = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
+		if (!metrics) {
+			err = -ENOMEM;
+			goto out;
+		}
+		dst_init_metrics(&rt->dst, metrics, 0);
+	}
 #ifdef CONFIG_IPV6_SUBTREES
 	ipv6_addr_prefix(&rt->rt6i_src.addr, &cfg->fc_src, cfg->fc_src_len);
 	rt->rt6i_src.plen = cfg->fc_src_len;
@@ -1607,9 +1620,6 @@ void rt6_redirect(const struct in6_addr *dest, const struct in6_addr *src,
 	if (on_link)
 		nrt->rt6i_flags &= ~RTF_GATEWAY;
 
-	nrt->rt6i_dst.plen = 128;
-	nrt->dst.flags |= DST_HOST;
-
 	ipv6_addr_copy(&nrt->rt6i_gateway, (struct in6_addr*)neigh->primary_key);
 	dst_set_neighbour(&nrt->dst, neigh_clone(neigh));
 
@@ -1754,9 +1764,10 @@ static struct rt6_info *ip6_rt_copy(const struct rt6_info *ort,
 	if (rt) {
 		rt->dst.input = ort->dst.input;
 		rt->dst.output = ort->dst.output;
+		rt->dst.flags |= DST_HOST;
 
 		ipv6_addr_copy(&rt->rt6i_dst.addr, dest);
-		rt->rt6i_dst.plen = ort->rt6i_dst.plen;
+		rt->rt6i_dst.plen = 128;
 		dst_copy_metrics(&rt->dst, &ort->dst);
 		rt->dst.error = ort->dst.error;
 		rt->rt6i_idev = ort->rt6i_idev;
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH -next] drivers/net: Makefile, fix netconsole link order
From: Lin Ming @ 2011-09-06  8:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: lkml, netdev, Jeff Kirsher

Commit 88491d8(drivers/net: Kconfig & Makefile cleanup) causes a
regression that netconsole does not work if netconsole and network
device driver are build into kernel, because netconsole is linked before
network device driver.

Fixes it by moving netconsole.o after network device driver.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
---
 drivers/net/Makefile |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index fa877cd..ec15311 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -14,7 +14,6 @@ obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_MII) += mii.o
 obj-$(CONFIG_MDIO) += mdio.o
 obj-$(CONFIG_NET) += Space.o loopback.o
-obj-$(CONFIG_NETCONSOLE) += netconsole.o
 obj-$(CONFIG_PHYLIB) += phy/
 obj-$(CONFIG_RIONET) += rionet.o
 obj-$(CONFIG_TUN) += tun.o
@@ -66,3 +65,9 @@ obj-$(CONFIG_USB_USBNET)        += usb/
 obj-$(CONFIG_USB_ZD1201)        += usb/
 obj-$(CONFIG_USB_IPHETH)        += usb/
 obj-$(CONFIG_USB_CDC_PHONET)   += usb/
+
+#
+# If netconsole and network device driver are build-in,
+# netconsole must be linked after network device driver
+#
+obj-$(CONFIG_NETCONSOLE) += netconsole.o
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH net-next v5 5/5] r8169: support new chips of RTL8111F
From: Hayes Wang @ 2011-09-06  8:55 UTC (permalink / raw)
  To: romieu; +Cc: netdev, linux-kernel, Hayes Wang
In-Reply-To: <1315299318-1547-1-git-send-email-hayeswang@realtek.com>

Support new chips of RTL8111F.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |  180 +++++++++++++++++++++++++++++++++-
 1 files changed, 178 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 8cd305f..8e6a200 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -42,6 +42,8 @@
 #define FIRMWARE_8168E_1	"rtl_nic/rtl8168e-1.fw"
 #define FIRMWARE_8168E_2	"rtl_nic/rtl8168e-2.fw"
 #define FIRMWARE_8168E_3	"rtl_nic/rtl8168e-3.fw"
+#define FIRMWARE_8168F_1	"rtl_nic/rtl8168f-1.fw"
+#define FIRMWARE_8168F_2	"rtl_nic/rtl8168f-2.fw"
 #define FIRMWARE_8105E_1	"rtl_nic/rtl8105e-1.fw"
 
 #ifdef RTL8169_DEBUG
@@ -133,6 +135,8 @@ enum mac_version {
 	RTL_GIGA_MAC_VER_32,
 	RTL_GIGA_MAC_VER_33,
 	RTL_GIGA_MAC_VER_34,
+	RTL_GIGA_MAC_VER_35,
+	RTL_GIGA_MAC_VER_36,
 	RTL_GIGA_MAC_NONE   = 0xff,
 };
 
@@ -218,7 +222,11 @@ static const struct {
 	[RTL_GIGA_MAC_VER_33] =
 		_R("RTL8168e/8111e",	RTL_TD_1, FIRMWARE_8168E_2),
 	[RTL_GIGA_MAC_VER_34] =
-		_R("RTL8168evl/8111evl",RTL_TD_1, FIRMWARE_8168E_3)
+		_R("RTL8168evl/8111evl",RTL_TD_1, FIRMWARE_8168E_3),
+	[RTL_GIGA_MAC_VER_35] =
+		_R("RTL8168f/8111f",	RTL_TD_1, FIRMWARE_8168F_1),
+	[RTL_GIGA_MAC_VER_36] =
+		_R("RTL8168f/8111f",	RTL_TD_1, FIRMWARE_8168F_2)
 };
 #undef _R
 
@@ -713,6 +721,8 @@ MODULE_FIRMWARE(FIRMWARE_8168E_1);
 MODULE_FIRMWARE(FIRMWARE_8168E_2);
 MODULE_FIRMWARE(FIRMWARE_8168E_3);
 MODULE_FIRMWARE(FIRMWARE_8105E_1);
+MODULE_FIRMWARE(FIRMWARE_8168F_1);
+MODULE_FIRMWARE(FIRMWARE_8168F_2);
 
 static int rtl8169_open(struct net_device *dev);
 static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
@@ -1201,6 +1211,19 @@ static void rtl_link_chg_patch(struct rtl8169_private *tp)
 			     ERIAR_EXGMAC);
 		rtl_w1w0_eri(ioaddr, 0xdc, ERIAR_MASK_0001, 0x01, 0x00,
 			     ERIAR_EXGMAC);
+	} else if (tp->mac_version == RTL_GIGA_MAC_VER_35 ||
+		   tp->mac_version == RTL_GIGA_MAC_VER_36) {
+		if (RTL_R8(PHYstatus) & _1000bpsF) {
+			rtl_eri_write(ioaddr, 0x1bc, ERIAR_MASK_1111,
+				      0x00000011, ERIAR_EXGMAC);
+			rtl_eri_write(ioaddr, 0x1dc, ERIAR_MASK_1111,
+				      0x00000005, ERIAR_EXGMAC);
+		} else {
+			rtl_eri_write(ioaddr, 0x1bc, ERIAR_MASK_1111,
+				      0x0000001f, ERIAR_EXGMAC);
+			rtl_eri_write(ioaddr, 0x1dc, ERIAR_MASK_1111,
+				      0x0000003f, ERIAR_EXGMAC);
+		}
 	}
 }
 
@@ -1740,6 +1763,10 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp,
 		u32 val;
 		int mac_version;
 	} mac_info[] = {
+		/* 8168F family. */
+		{ 0x7cf00000, 0x48100000,	RTL_GIGA_MAC_VER_36 },
+		{ 0x7cf00000, 0x48000000,	RTL_GIGA_MAC_VER_35 },
+
 		/* 8168E family. */
 		{ 0x7c800000, 0x2c800000,	RTL_GIGA_MAC_VER_34 },
 		{ 0x7cf00000, 0x2c200000,	RTL_GIGA_MAC_VER_33 },
@@ -2874,6 +2901,97 @@ static void rtl8168e_2_hw_phy_config(struct rtl8169_private *tp)
 	rtl_writephy(tp, 0x1f, 0x0000);
 }
 
+static void rtl8168f_1_hw_phy_config(struct rtl8169_private *tp)
+{
+	static const struct phy_reg phy_reg_init[] = {
+		/* Channel estimation fine tune */
+		{ 0x1f, 0x0003 },
+		{ 0x09, 0xa20f },
+		{ 0x1f, 0x0000 },
+
+		/* Modify green table for giga & fnet */
+		{ 0x1f, 0x0005 },
+		{ 0x05, 0x8b55 },
+		{ 0x06, 0x0000 },
+		{ 0x05, 0x8b5e },
+		{ 0x06, 0x0000 },
+		{ 0x05, 0x8b67 },
+		{ 0x06, 0x0000 },
+		{ 0x05, 0x8b70 },
+		{ 0x06, 0x0000 },
+		{ 0x1f, 0x0000 },
+		{ 0x1f, 0x0007 },
+		{ 0x1e, 0x0078 },
+		{ 0x17, 0x0000 },
+		{ 0x19, 0x00fb },
+		{ 0x1f, 0x0000 },
+
+		/* Modify green table for 10M */
+		{ 0x1f, 0x0005 },
+		{ 0x05, 0x8b79 },
+		{ 0x06, 0xaa00 },
+		{ 0x1f, 0x0000 },
+
+		/* Disable hiimpedance detection (RTCT) */
+		{ 0x1f, 0x0003 },
+		{ 0x01, 0x328a },
+		{ 0x1f, 0x0000 }
+	};
+
+	rtl_apply_firmware(tp);
+
+	rtl_writephy_batch(tp, phy_reg_init, ARRAY_SIZE(phy_reg_init));
+
+	/* For 4-corner performance improve */
+	rtl_writephy(tp, 0x1f, 0x0005);
+	rtl_writephy(tp, 0x05, 0x8b80);
+	rtl_w1w0_phy(tp, 0x06, 0x0006, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* PHY auto speed down */
+	rtl_writephy(tp, 0x1f, 0x0007);
+	rtl_writephy(tp, 0x1e, 0x002d);
+	rtl_w1w0_phy(tp, 0x18, 0x0010, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+	rtl_w1w0_phy(tp, 0x14, 0x8000, 0x0000);
+
+	/* improve 10M EEE waveform */
+	rtl_writephy(tp, 0x1f, 0x0005);
+	rtl_writephy(tp, 0x05, 0x8b86);
+	rtl_w1w0_phy(tp, 0x06, 0x0001, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Improve 2-pair detection performance */
+	rtl_writephy(tp, 0x1f, 0x0005);
+	rtl_writephy(tp, 0x05, 0x8b85);
+	rtl_w1w0_phy(tp, 0x06, 0x4000, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+}
+
+static void rtl8168f_2_hw_phy_config(struct rtl8169_private *tp)
+{
+	rtl_apply_firmware(tp);
+
+	/* For 4-corner performance improve */
+	rtl_writephy(tp, 0x1f, 0x0005);
+	rtl_writephy(tp, 0x05, 0x8b80);
+	rtl_w1w0_phy(tp, 0x06, 0x0006, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* PHY auto speed down */
+	rtl_writephy(tp, 0x1f, 0x0007);
+	rtl_writephy(tp, 0x1e, 0x002d);
+	rtl_w1w0_phy(tp, 0x18, 0x0010, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+	rtl_w1w0_phy(tp, 0x14, 0x8000, 0x0000);
+
+	/* improve 10M EEE waveform */
+	rtl_writephy(tp, 0x1f, 0x0005);
+	rtl_writephy(tp, 0x05, 0x8b86);
+	rtl_w1w0_phy(tp, 0x06, 0x0001, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+}
+
 static void rtl8102e_hw_phy_config(struct rtl8169_private *tp)
 {
 	static const struct phy_reg phy_reg_init[] = {
@@ -2998,6 +3116,12 @@ static void rtl_hw_phy_config(struct net_device *dev)
 	case RTL_GIGA_MAC_VER_34:
 		rtl8168e_2_hw_phy_config(tp);
 		break;
+	case RTL_GIGA_MAC_VER_35:
+		rtl8168f_1_hw_phy_config(tp);
+		break;
+	case RTL_GIGA_MAC_VER_36:
+		rtl8168f_2_hw_phy_config(tp);
+		break;
 
 	default:
 		break;
@@ -3525,6 +3649,8 @@ static void __devinit rtl_init_pll_power_ops(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_32:
 	case RTL_GIGA_MAC_VER_33:
 	case RTL_GIGA_MAC_VER_34:
+	case RTL_GIGA_MAC_VER_35:
+	case RTL_GIGA_MAC_VER_36:
 		ops->down	= r8168_pll_power_down;
 		ops->up		= r8168_pll_power_up;
 		break;
@@ -3997,7 +4123,9 @@ static void rtl8169_hw_reset(struct rtl8169_private *tp)
 	    tp->mac_version == RTL_GIGA_MAC_VER_31) {
 		while (RTL_R8(TxPoll) & NPQ)
 			udelay(20);
-	} else if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
+	} else if (tp->mac_version == RTL_GIGA_MAC_VER_34 ||
+		   tp->mac_version == RTL_GIGA_MAC_VER_35 ||
+		   tp->mac_version == RTL_GIGA_MAC_VER_36) {
 		RTL_W8(ChipCmd, RTL_R8(ChipCmd) | StopReq);
 		while (!(RTL_R32(TxConfig) & TXCFG_EMPTY))
 			udelay(100);
@@ -4483,6 +4611,49 @@ static void rtl_hw_start_8168e_2(void __iomem *ioaddr, struct pci_dev *pdev)
 	RTL_W8(Config5, RTL_R8(Config5) & ~Spi_en);
 }
 
+static void rtl_hw_start_8168f_1(void __iomem *ioaddr, struct pci_dev *pdev)
+{
+	static const struct ephy_info e_info_8168f_1[] = {
+		{ 0x06, 0x00c0,	0x0020 },
+		{ 0x08, 0x0001,	0x0002 },
+		{ 0x09, 0x0000,	0x0080 },
+		{ 0x19, 0x0000,	0x0224 }
+	};
+
+	rtl_csi_access_enable_1(ioaddr);
+
+	rtl_ephy_init(ioaddr, e_info_8168f_1, ARRAY_SIZE(e_info_8168f_1));
+
+	rtl_tx_performance_tweak(pdev, 0x5 << MAX_READ_REQUEST_SHIFT);
+
+	rtl_eri_write(ioaddr, 0xc0, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xb8, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xc8, ERIAR_MASK_1111, 0x00100002, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xe8, ERIAR_MASK_1111, 0x00100006, ERIAR_EXGMAC);
+	rtl_w1w0_eri(ioaddr, 0xdc, ERIAR_MASK_0001, 0x00, 0x01, ERIAR_EXGMAC);
+	rtl_w1w0_eri(ioaddr, 0xdc, ERIAR_MASK_0001, 0x01, 0x00, ERIAR_EXGMAC);
+	rtl_w1w0_eri(ioaddr, 0x1b0, ERIAR_MASK_0001, 0x10, 0x00, ERIAR_EXGMAC);
+	rtl_w1w0_eri(ioaddr, 0x1d0, ERIAR_MASK_0001, 0x10, 0x00, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xcc, ERIAR_MASK_1111, 0x00000050, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xd0, ERIAR_MASK_1111, 0x00000060, ERIAR_EXGMAC);
+	rtl_w1w0_eri(ioaddr, 0x0d4, ERIAR_MASK_0011, 0x0c00, 0xff00,
+		     ERIAR_EXGMAC);
+
+	RTL_W8(MaxTxPacketSize, EarlySize);
+
+	rtl_disable_clock_request(pdev);
+
+	RTL_W32(TxConfig, RTL_R32(TxConfig) | TXCFG_AUTO_FIFO);
+	RTL_W8(MCU, RTL_R8(MCU) & ~NOW_IS_OOB);
+
+	/* Adjust EEE LED frequency */
+	RTL_W8(EEE_LED, RTL_R8(EEE_LED) & ~0x07);
+
+	RTL_W8(DLLPR, RTL_R8(DLLPR) | PFM_EN);
+	RTL_W32(MISC, RTL_R32(MISC) | PWM_EN);
+	RTL_W8(Config5, RTL_R8(Config5) & ~Spi_en);
+}
+
 static void rtl_hw_start_8168(struct net_device *dev)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -4577,6 +4748,11 @@ static void rtl_hw_start_8168(struct net_device *dev)
 		rtl_hw_start_8168e_2(ioaddr, pdev);
 		break;
 
+	case RTL_GIGA_MAC_VER_35:
+	case RTL_GIGA_MAC_VER_36:
+		rtl_hw_start_8168f_1(ioaddr, pdev);
+		break;
+
 	default:
 		printk(KERN_ERR PFX "%s: unknown chipset (mac_version = %d).\n",
 			dev->name, tp->mac_version);
-- 
1.7.6

^ permalink raw reply related

* [PATCH net-next v5 4/5] r8169: add MODULE_FIRMWARE for the firmware of 8111evl
From: Hayes Wang @ 2011-09-06  8:55 UTC (permalink / raw)
  To: romieu; +Cc: netdev, linux-kernel, Hayes Wang
In-Reply-To: <1315299318-1547-1-git-send-email-hayeswang@realtek.com>

Add MODULE_FIRMWARE for the firmware of RTL8111E-VL

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 175c769..8cd305f 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -711,6 +711,7 @@ MODULE_FIRMWARE(FIRMWARE_8168D_1);
 MODULE_FIRMWARE(FIRMWARE_8168D_2);
 MODULE_FIRMWARE(FIRMWARE_8168E_1);
 MODULE_FIRMWARE(FIRMWARE_8168E_2);
+MODULE_FIRMWARE(FIRMWARE_8168E_3);
 MODULE_FIRMWARE(FIRMWARE_8105E_1);
 
 static int rtl8169_open(struct net_device *dev);
-- 
1.7.6

^ permalink raw reply related

* [PATCH net-next v5 1/5] r8169: fix WOL setting for 8105 and 8111EVL
From: Hayes Wang @ 2011-09-06  8:55 UTC (permalink / raw)
  To: romieu; +Cc: netdev, linux-kernel, Hayes Wang

rtl8105, rtl8111E, and rtl8111evl need enable RxConfig bit 1 ~ 3
for supporting wake on lan.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 1cf8c3c..aaae43e 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -3319,9 +3319,16 @@ static void r810x_phy_power_up(struct rtl8169_private *tp)
 
 static void r810x_pll_power_down(struct rtl8169_private *tp)
 {
+	void __iomem *ioaddr = tp->mmio_addr;
+
 	if (__rtl8169_get_wol(tp) & WAKE_ANY) {
 		rtl_writephy(tp, 0x1f, 0x0000);
 		rtl_writephy(tp, MII_BMCR, 0x0000);
+
+		if (tp->mac_version == RTL_GIGA_MAC_VER_29 ||
+		    tp->mac_version == RTL_GIGA_MAC_VER_30)
+			RTL_W32(RxConfig, RTL_R32(RxConfig) | AcceptBroadcast |
+				AcceptMulticast | AcceptMyPhys);
 		return;
 	}
 
@@ -3417,7 +3424,8 @@ static void r8168_pll_power_down(struct rtl8169_private *tp)
 		rtl_writephy(tp, MII_BMCR, 0x0000);
 
 		if (tp->mac_version == RTL_GIGA_MAC_VER_32 ||
-		    tp->mac_version == RTL_GIGA_MAC_VER_33)
+		    tp->mac_version == RTL_GIGA_MAC_VER_33 ||
+		    tp->mac_version == RTL_GIGA_MAC_VER_34)
 			RTL_W32(RxConfig, RTL_R32(RxConfig) | AcceptBroadcast |
 				AcceptMulticast | AcceptMyPhys);
 		return;
-- 
1.7.6

^ permalink raw reply related

* [PATCH net-next v5 3/5] r8169: fix the reset setting for 8111evl
From: Hayes Wang @ 2011-09-06  8:55 UTC (permalink / raw)
  To: romieu; +Cc: netdev, linux-kernel, Hayes Wang
In-Reply-To: <1315299318-1547-1-git-send-email-hayeswang@realtek.com>

rtl8111evl should stop any TLP requirement before resetting by
enabling register 0x37 bit 7.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index db5ab2c..175c769 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -3997,6 +3997,7 @@ static void rtl8169_hw_reset(struct rtl8169_private *tp)
 		while (RTL_R8(TxPoll) & NPQ)
 			udelay(20);
 	} else if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
+		RTL_W8(ChipCmd, RTL_R8(ChipCmd) | StopReq);
 		while (!(RTL_R32(TxConfig) & TXCFG_EMPTY))
 			udelay(100);
 	} else {
-- 
1.7.6

^ permalink raw reply related

* [PATCH net-next v5 2/5] r8169: define the early size for 8111evl
From: Hayes Wang @ 2011-09-06  8:55 UTC (permalink / raw)
  To: romieu; +Cc: netdev, linux-kernel, Hayes Wang
In-Reply-To: <1315299318-1547-1-git-send-email-hayeswang@realtek.com>

For RTL8111EVL, the register of MaxTxPacketSize doesn't acctually
limit the tx size. It influnces the feature of early tx.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index aaae43e..db5ab2c 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -311,6 +311,7 @@ enum rtl_registers {
 	MaxTxPacketSize	= 0xec,	/* 8101/8168. Unit of 128 bytes. */
 
 #define TxPacketMax	(8064 >> 7)
+#define EarlySize	0x27
 
 	FuncEvent	= 0xf0,
 	FuncEventMask	= 0xf4,
@@ -4465,7 +4466,7 @@ static void rtl_hw_start_8168e_2(void __iomem *ioaddr, struct pci_dev *pdev)
 	rtl_w1w0_eri(ioaddr, 0x0d4, ERIAR_MASK_0011, 0x0c00, 0xff00,
 		     ERIAR_EXGMAC);
 
-	RTL_W8(MaxTxPacketSize, 0x27);
+	RTL_W8(MaxTxPacketSize, EarlySize);
 
 	rtl_disable_clock_request(pdev);
 
-- 
1.7.6

^ permalink raw reply related

* Re: FW: [PATCH] af_packet: flush complete kernel cache in packet_sendmsg
From: Phil Sutter @ 2011-09-06  9:44 UTC (permalink / raw)
  To: chetan loke; +Cc: netdev, linux, davem, linux-arm-kernel
In-Reply-To: <CAAsGZS6Qiyc6nhgyVLrphSLW6vf16=hbGVWJ+CFw6rfZQsdiFQ@mail.gmail.com>

On Fri, Sep 02, 2011 at 12:49:47PM -0400, chetan loke wrote:
> On Fri, Sep 2, 2011 at 11:31 AM, Phil Sutter <phil.sutter@viprinet.com> wrote:
> 
> > So far we haven't noticed problems in that direction. I just tried some
> > explicit test: having tcpdump print local timestamps (not the pcap-ones)
> > on every received packet, activating icmp_echo_ignore_all and pinging
> > the host on a dedicated line. I expected to sometimes see a second
> > difference between the two timestamps, as like with sending from time to
> > time a packet should get "lost" in the cache, and then occur to
> > userspace after the next one arrived. Maybe my test is broken, or RX is
> > indeed unaffected.
> >
> 
> You will need high traffic rate. If interested, you could try
> pktgen(with varying packet-load). Keep the packet-payload under 1500
> bytes (don't send jumbo frames) unless you have the following fix:
> commit cc9f01b246ca8e4fa245991840b8076394f86707

Hmm. I don't really get your point here: with higher traffic rates, the
bug should be even harder to identify. Assuming the same behaviour as
for TX, of course. There are no packets lost, just not immediately
transmitted (or never, if it's the last packet to be sent). This is how
it goes:
1) userspace places packet into TX_RING, calls sendto()
2) kernel does not see the packet, reads TP_STATUS_AVAILABLE for the
   given field from the cache
3) userspace places second packet into TX_RING (after the first one,
   since it knows it's there)
4) something happens that makes caches flush
5) userspace calls sendto()
5) kernel sees two packets to be transmitted, sends them out

So analogous for RX, this should mean:
1) tcpdump runs pcap_loop() (which, according to strace, calls poll()
   with a timeout of 1s)
2) kernel receives packet, puts it into RX_RING, sets POLLIN
3) tcpdump's poll() returns, but an unmodified RX_RING is seen
4) something happens that makes caches flush
5) (2) happens again
6) tcpdump's poll() returns, two packets are seen in RX_RING

My tests on TX-side show that this "something that makes caches flush"
actually happens quite frequently. But nevertheless, when receiving a
packet once a second, I'm expecting to occasionally see no packet in two
seconds, and then two in the following. The higher the packet rate, the
harder it should be to notice this phenomenon.

> Your Tx path is working because flush_cache_call gets triggered before
> flush_dcache_page. On the Rx path, since you don't have that
> workaround, you will eventually(it's just a matter of time) see this
> problem.

So you say if I called flush_cache_all() _after_ flush_dcache_page() it
wouldn't work?

> Or, delete your patch and try this workaround (in
> __packet_get/set_status) and you may be able to cover both Tx and Rx
> paths.

Oh great, thanks a lot for improving my ugly hacks! :)

Greetings, Phil

^ permalink raw reply

* Re: [PATCH] af_packet: flush complete kernel cache in packet_sendmsg
From: Russell King - ARM Linux @ 2011-09-06  9:57 UTC (permalink / raw)
  To: Phil Sutter; +Cc: Ben Hutchings, netdev, David S. Miller, linux-arm-kernel
In-Reply-To: <20110905195714.GC29025@philter>

On Mon, Sep 05, 2011 at 09:57:14PM +0200, Phil Sutter wrote:
> Hi,
> 
> On Fri, Sep 02, 2011 at 06:28:50PM +0100, Russell King - ARM Linux wrote:
> > On Fri, Sep 02, 2011 at 02:46:17PM +0100, Ben Hutchings wrote:
> > > On Fri, 2011-09-02 at 13:08 +0200, Phil Sutter wrote:
> > > > This flushes the cache before and after accessing the mmapped packet
> > > > buffer. It seems like the call to flush_dcache_page from inside
> > > > __packet_get_status is not enough on Kirkwood (or ARM in general).
> > > > ---
> > > > I know this is far from an optimal solution, but it's in fact the only working
> > > > one I found.
> > > [...]
> > > 
> > > This is ridiculous.  If flush_dcache_page() isn't doing everything it
> > > should, you need to fix that.
> > 
> > It does do everything it should - which is to perform maintanence on
> > page cache pages.  It flushes the kernel mapping of the page.  It
> > also flushes the userspace mappings of the page which it finds by
> > walking the mmap list via the associated struct page.  It does not
> > touch vmalloc mappings because it has no way to know whether they
> > exist or not.
> > 
> > It doesn't do so much for anonymous pages - to do so would only
> > duplicate what flush_anon_page() does at the very same callsites.
> > Plus the mmap list isn't available for such pages so there's no
> > way to find out what userspace addresses to flush.
> 
> Indeed very interesting information, thanks a lot!
> 
> The code in question uses __get_free_pages(), and if that fails uses
> vmalloc() (see alloc_one_pg_vec_page() for reference). Both code paths
> show result in the same faulty behaviour.

So, what you're wanting is cache coherency between vmalloc() and
userspace.  There is no API in the kernel to do that, and you'll see
the same failures of this interface not only on ARM but also other
architectures with virtual caches.

It sounds like we need an API to flush the cache using both the
userspace address, plus the kernel side address be that in the direct
map or the vmalloc map areas.

Or maybe the right solution is to simply disable AF_PACKET MMAP support
for virtual cached architectures - it may be that adding cache flushing
calls makes the thing too expensive and the benefits of mmap over normal
read/write are lost.

^ permalink raw reply

* Re: [PATCH] per-cgroup tcp buffer limitation
From: KAMEZAWA Hiroyuki @ 2011-09-06 10:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman
In-Reply-To: <1315276556-10970-1-git-send-email-glommer@parallels.com>

On Mon,  5 Sep 2011 23:35:56 -0300
Glauber Costa <glommer@parallels.com> wrote:

> This patch introduces per-cgroup tcp buffers limitation. This allows
> sysadmins to specify a maximum amount of kernel memory that
> tcp connections can use at any point in time. TCP is the main interest
> in this work, but extending it to other protocols would be easy.
> 
> It piggybacks in the memory control mechanism already present in
> /proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
> that will suppress allocation when reached. For each cgroup, however,
> the file kmem.tcp_maxmem will be used to cap those values.
> 
> The usage I have in mind here is containers. Each container will
> define its own values for soft and hard limits, but none of them will
> be possibly bigger than the value the box' sysadmin specified from
> the outside.
> 
> To test for any performance impacts of this patch, I used netperf's
> TCP_RR benchmark on localhost, so we can have both recv and snd in action.
> 
> Command line used was ./src/netperf -t TCP_RR -H localhost, and the
> results:
> 
> Without the patch
> =================
> 
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
> 
> 16384  87380  1        1       10.00    26996.35
> 16384  87380
> 
> With the patch
> ===============
> 
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
> 
> 16384  87380  1        1       10.00    27291.86
> 16384  87380
> 
> The difference is within a one-percent range.
> 
> Nesting cgroups doesn't seem to be the dominating factor as well,
> with nestings up to 10 levels not showing a significant performance
> difference.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  crypto/af_alg.c               |    7 ++-
>  include/linux/cgroup_subsys.h |    4 +
>  include/net/netns/ipv4.h      |    1 +
>  include/net/sock.h            |   66 +++++++++++++++-
>  include/net/tcp.h             |   12 ++-
>  include/net/udp.h             |    3 +-
>  include/trace/events/sock.h   |   10 +-
>  init/Kconfig                  |   11 +++
>  mm/Makefile                   |    1 +
>  net/core/sock.c               |  136 +++++++++++++++++++++++++++-------
>  net/decnet/af_decnet.c        |   21 +++++-
>  net/ipv4/proc.c               |    8 +-
>  net/ipv4/sysctl_net_ipv4.c    |   59 +++++++++++++--
>  net/ipv4/tcp.c                |  164 +++++++++++++++++++++++++++++++++++-----
>  net/ipv4/tcp_input.c          |   17 ++--
>  net/ipv4/tcp_ipv4.c           |   27 +++++--
>  net/ipv4/tcp_output.c         |    2 +-
>  net/ipv4/tcp_timer.c          |    2 +-
>  net/ipv4/udp.c                |   20 ++++-
>  net/ipv6/tcp_ipv6.c           |   16 +++-
>  net/ipv6/udp.c                |    4 +-
>  net/sctp/socket.c             |   35 +++++++--
>  22 files changed, 514 insertions(+), 112 deletions(-)

Hmm...could you please devide patches into a few patches ?

If I was you, I'll devide the patches into

 - Kconfig/Makefile/kmem_cgroup.c skelton.
 - changes to struct sock and macro definition
 - hooks to tcp.
 - hooks to udp
 - hooks to sctp

And why not including mm/kmem_cgroup.c ?

some comments below.


> 
> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
> index ac33d5f..df168d8 100644
> --- a/crypto/af_alg.c
> +++ b/crypto/af_alg.c
> @@ -29,10 +29,15 @@ struct alg_type_list {
>  
>  static atomic_long_t alg_memory_allocated;
>  
> +static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
> +{
> +	return &alg_memory_allocated;
> +}
> +
>  static struct proto alg_proto = {
>  	.name			= "ALG",
>  	.owner			= THIS_MODULE,
> -	.memory_allocated	= &alg_memory_allocated,
> +	.memory_allocated	= memory_allocated_alg,
>  	.obj_size		= sizeof(struct alg_sock),
>  };
>  
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index ac663c1..363b8e8 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -35,6 +35,10 @@ SUBSYS(cpuacct)
>  SUBSYS(mem_cgroup)
>  #endif
>  
> +#ifdef CONFIG_CGROUP_KMEM
> +SUBSYS(kmem)
> +#endif
> +
>  /* */
>  
>  #ifdef CONFIG_CGROUP_DEVICE
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index d786b4f..bbd023a 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>  	int current_rt_cache_rebuild_count;
>  
>  	unsigned int sysctl_ping_group_range[2];
> +	long sysctl_tcp_mem[3];
>  
>  	atomic_t rt_genid;
>  	atomic_t dev_addr_genid;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 8e4062f..e085148 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -62,7 +62,9 @@
>  #include <linux/atomic.h>
>  #include <net/dst.h>
>  #include <net/checksum.h>
> +#include <linux/kmem_cgroup.h>
>  
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  /*
>   * This structure really needs to be cleaned up.
>   * Most of it is for TCP, and not used by any of
> @@ -339,6 +341,7 @@ struct sock {
>  #endif
>  	__u32			sk_mark;
>  	u32			sk_classid;
> +	struct kmem_cgroup	*sk_cgrp;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk, int bytes);
>  	void			(*sk_write_space)(struct sock *sk);
> @@ -786,16 +789,21 @@ struct proto {
>  
>  	/* Memory pressure */
>  	void			(*enter_memory_pressure)(struct sock *sk);
> -	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
> -	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
> +	/* Current allocated memory. */
> +	atomic_long_t		*(*memory_allocated)(struct kmem_cgroup *sg);
> +	/* Current number of sockets. */
> +	struct percpu_counter	*(*sockets_allocated)(struct kmem_cgroup *sg);
> +
> +	int			(*init_cgroup)(struct cgroup *cgrp,
> +					       struct cgroup_subsys *ss);
>  	/*
>  	 * Pressure flag: try to collapse.
>  	 * Technical note: it is used by multiple contexts non atomically.
>  	 * All the __sk_mem_schedule() is of this nature: accounting
>  	 * is strict, actions are advisory and have some latency.
>  	 */
> -	int			*memory_pressure;
> -	long			*sysctl_mem;
> +	int			*(*memory_pressure)(struct kmem_cgroup *sg);
> +	long			*(*prot_mem)(struct kmem_cgroup *sg);

Hmm. Socket interface callbacks doesn't have documentation ?
Adding explanation in Documenation is better, isn't it ?


>  	int			*sysctl_wmem;
>  	int			*sysctl_rmem;
>  	int			max_header;
> @@ -826,6 +834,56 @@ struct proto {
>  #endif
>  };
>  
> +#define sk_memory_pressure(sk)						\
> +({									\
> +	int *__ret = NULL;						\
> +	if ((sk)->sk_prot->memory_pressure)				\
> +		__ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);	\
> +	__ret;								\
> +})
> +
> +#define sk_sockets_allocated(sk)				\
> +({								\
> +	struct percpu_counter *__p;				\
> +	__p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);	\
> +	__p;							\
> +})
> +
> +#define sk_memory_allocated(sk)					\
> +({								\
> +	atomic_long_t *__mem;					\
> +	__mem = (sk)->sk_prot->memory_allocated(sk->sk_cgrp);	\
> +	__mem;							\
> +})
> +
> +#define sk_prot_mem(sk)						\
> +({								\
> +	long *__mem = (sk)->sk_prot->prot_mem(sk->sk_cgrp);	\
> +	__mem;							\
> +})
> +
> +#define sg_memory_pressure(prot, sg)				\
> +({								\
> +	int *__ret = NULL;					\
> +	if (prot->memory_pressure)				\
> +		__ret = (prot)->memory_pressure(sg);		\
> +	__ret;							\
> +})
> +
> +#define sg_memory_allocated(prot, sg)				\
> +({								\
> +	atomic_long_t *__mem;					\
> +	__mem = (prot)->memory_allocated(sg);			\
> +	__mem;							\
> +})
> +
> +#define sg_sockets_allocated(prot, sg)				\
> +({								\
> +	struct percpu_counter *__p;				\
> +	__p = (prot)->sockets_allocated(sg);			\
> +	__p;							\
> +})
> +

All functions are worth to be inlined ?



>  extern int proto_register(struct proto *prot, int alloc_slab);
>  extern void proto_unregister(struct proto *prot);
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 149a415..8e1ec4a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>  extern int sysctl_tcp_reordering;
>  extern int sysctl_tcp_ecn;
>  extern int sysctl_tcp_dsack;
> -extern long sysctl_tcp_mem[3];
>  extern int sysctl_tcp_wmem[3];
>  extern int sysctl_tcp_rmem[3];
>  extern int sysctl_tcp_app_win;
> @@ -253,9 +252,12 @@ extern int sysctl_tcp_cookie_size;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
>  
> -extern atomic_long_t tcp_memory_allocated;
> -extern struct percpu_counter tcp_sockets_allocated;
> -extern int tcp_memory_pressure;
> +struct kmem_cgroup;
> +extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
> +int *memory_pressure_tcp(struct kmem_cgroup *sg);
> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
>  
>  /*
>   * The next routines deal with comparing 32 bit unsigned ints
> @@ -286,7 +288,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
>  	}
>  
>  	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
> -	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
> +	    atomic_long_read(sk_memory_allocated(sk)) > sk_prot_mem(sk)[2])

Why not sk_memory_allocated() returns the value ?
Is it required to return pointer ?


>  		return true;
>  	return false;
>  }
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 67ea6fc..0e27388 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
>  
>  extern struct proto udp_prot;
>  
> -extern atomic_long_t udp_memory_allocated;
> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
> +long *udp_sysctl_mem(struct kmem_cgroup *sg);
>  
>  /* sysctl variables for udp */
>  extern long sysctl_udp_mem[3];
> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
> index 779abb9..12a6083 100644
> --- a/include/trace/events/sock.h
> +++ b/include/trace/events/sock.h
> @@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  
>  	TP_STRUCT__entry(
>  		__array(char, name, 32)
> -		__field(long *, sysctl_mem)
> +		__field(long *, prot_mem)
>  		__field(long, allocated)
>  		__field(int, sysctl_rmem)
>  		__field(int, rmem_alloc)
> @@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  
>  	TP_fast_assign(
>  		strncpy(__entry->name, prot->name, 32);
> -		__entry->sysctl_mem = prot->sysctl_mem;
> +		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
>  		__entry->allocated = allocated;
>  		__entry->sysctl_rmem = prot->sysctl_rmem[0];
>  		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
> @@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
>  		"sysctl_rmem=%d rmem_alloc=%d",
>  		__entry->name,
> -		__entry->sysctl_mem[0],
> -		__entry->sysctl_mem[1],
> -		__entry->sysctl_mem[2],
> +		__entry->prot_mem[0],
> +		__entry->prot_mem[1],
> +		__entry->prot_mem[2],
>  		__entry->allocated,
>  		__entry->sysctl_rmem,
>  		__entry->rmem_alloc)
> diff --git a/init/Kconfig b/init/Kconfig
> index d627783..5955ac2 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>  	  select this option (if, for some reason, they need to disable it
>  	  then swapaccount=0 does the trick).
>  
> +config CGROUP_KMEM
> +	bool "Kernel Memory Resource Controller for Control Groups"
> +	depends on CGROUPS
> +	help
> +	  The Kernel Memory cgroup can limit the amount of memory used by
> +	  certain kernel objects in the system. Those are fundamentally
> +	  different from the entities handled by the Memory Controller,
> +	  which are page-based, and can be swapped. Users of the kmem
> +	  cgroup can use it to guarantee that no group of processes will
> +	  ever exhaust kernel resources alone.
> +

This help seems nice but please add Documentation/cgroup/kmem.




>  config CGROUP_PERF
>  	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>  	depends on PERF_EVENTS && CGROUPS
> diff --git a/mm/Makefile b/mm/Makefile
> index 836e416..1b1aa24 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
>  obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 3449df8..2d968ea 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -134,6 +134,24 @@
>  #include <net/tcp.h>
>  #endif
>  
> +static DEFINE_RWLOCK(proto_list_lock);
> +static LIST_HEAD(proto_list);
> +
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +	struct proto *proto;
> +	int ret = 0;
> +
> +	read_lock(&proto_list_lock);
> +	list_for_each_entry(proto, &proto_list, node) {
> +		if (proto->init_cgroup)
> +			ret |= proto->init_cgroup(cgrp, ss);
> +	}
> +	read_unlock(&proto_list_lock);
> +
> +	return ret;
> +}
> +
>  /*
>   * Each address family might have different locking rules, so we have
>   * one slock key per address family:
> @@ -1114,6 +1132,31 @@ void sock_update_classid(struct sock *sk)
>  EXPORT_SYMBOL(sock_update_classid);
>  #endif
>  
> +void sock_update_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	sk->sk_cgrp = kcg_from_task(current);
> +
> +	/*
> +	 * We don't need to protect against anything task-related, because
> +	 * we are basically stuck with the sock pointer that won't change,
> +	 * even if the task that originated the socket changes cgroups.
> +	 *
> +	 * What we do have to guarantee, is that the chain leading us to
> +	 * the top level won't change under our noses. Incrementing the
> +	 * reference count via cgroup_exclude_rmdir guarantees that.
> +	 */
> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}

I'm not sure this kind of bare cgroup code in core/sock.c will be
welcomed by network guys.




> +
> +void sock_release_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}
> +
>  /**
>   *	sk_alloc - All socket objects are allocated here
>   *	@net: the applicable net namespace
> @@ -1139,6 +1182,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  		atomic_set(&sk->sk_wmem_alloc, 1);
>  
>  		sock_update_classid(sk);
> +		sock_update_kmem_cgrp(sk);
>  	}
>  
>  	return sk;
> @@ -1170,6 +1214,7 @@ static void __sk_free(struct sock *sk)
>  		put_cred(sk->sk_peer_cred);
>  	put_pid(sk->sk_peer_pid);
>  	put_net(sock_net(sk));
> +	sock_release_kmem_cgrp(sk);
>  	sk_prot_free(sk->sk_prot_creator, sk);
>  }
>  
> @@ -1287,8 +1332,8 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
>  		sk_set_socket(newsk, NULL);
>  		newsk->sk_wq = NULL;
>  
> -		if (newsk->sk_prot->sockets_allocated)
> -			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
> +		if (sk_sockets_allocated(sk))
> +			percpu_counter_inc(sk_sockets_allocated(sk));
How about 
	sk_sockets_allocated_inc(sk);
?


>  
>  		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
>  		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
> @@ -1676,29 +1721,51 @@ EXPORT_SYMBOL(sk_wait_data);
>   */
>  int __sk_mem_schedule(struct sock *sk, int size, int kind)
>  {
> -	struct proto *prot = sk->sk_prot;
>  	int amt = sk_mem_pages(size);
> +	struct proto *prot = sk->sk_prot;
>  	long allocated;
> +	int *memory_pressure;
> +	long *prot_mem;
> +	int parent_failure = 0;
> +	struct kmem_cgroup *sg;
>  
>  	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
> -	allocated = atomic_long_add_return(amt, prot->memory_allocated);
> +
> +	memory_pressure = sk_memory_pressure(sk);
> +	prot_mem = sk_prot_mem(sk);
> +
> +	allocated = atomic_long_add_return(amt, sk_memory_allocated(sk));
> +
> +#ifdef CONFIG_CGROUP_KMEM
> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
> +		long alloc;
> +		/*
> +		 * Large nestings are not the common case, and stopping in the
> +		 * middle would be complicated enough, that we bill it all the
> +		 * way through the root, and if needed, unbill everything later
> +		 */
> +		alloc = atomic_long_add_return(amt,
> +					       sg_memory_allocated(prot, sg));
> +		parent_failure |= (alloc > sk_prot_mem(sk)[2]);
> +	}
> +#endif
> +
> +	/* Over hard limit (we, or our parents) */
> +	if (parent_failure || (allocated > prot_mem[2]))
> +		goto suppress_allocation;
>  
>  	/* Under limit. */
> -	if (allocated <= prot->sysctl_mem[0]) {
> -		if (prot->memory_pressure && *prot->memory_pressure)
> -			*prot->memory_pressure = 0;
> +	if (allocated <= prot_mem[0]) {
> +		if (memory_pressure && *memory_pressure)
> +			*memory_pressure = 0;
>  		return 1;
>  	}
>  
>  	/* Under pressure. */
> -	if (allocated > prot->sysctl_mem[1])
> +	if (allocated > prot_mem[1])
>  		if (prot->enter_memory_pressure)
>  			prot->enter_memory_pressure(sk);
>  
> -	/* Over hard limit. */
> -	if (allocated > prot->sysctl_mem[2])
> -		goto suppress_allocation;
> -
>  	/* guarantee minimum buffer size under pressure */
>  	if (kind == SK_MEM_RECV) {
>  		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
> @@ -1712,13 +1779,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
>  				return 1;
>  	}
>  
> -	if (prot->memory_pressure) {
> +	if (memory_pressure) {
>  		int alloc;
>  
> -		if (!*prot->memory_pressure)
> +		if (!*memory_pressure)
>  			return 1;
> -		alloc = percpu_counter_read_positive(prot->sockets_allocated);
> -		if (prot->sysctl_mem[2] > alloc *
> +		alloc = percpu_counter_read_positive(sk_sockets_allocated(sk));
> +		if (prot_mem[2] > alloc *
>  		    sk_mem_pages(sk->sk_wmem_queued +
>  				 atomic_read(&sk->sk_rmem_alloc) +
>  				 sk->sk_forward_alloc))
> @@ -1741,7 +1808,13 @@ suppress_allocation:
>  
>  	/* Alas. Undo changes. */
>  	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
> -	atomic_long_sub(amt, prot->memory_allocated);
> +
> +	atomic_long_sub(amt, sk_memory_allocated(sk));
> +
> +#ifdef CONFIG_CGROUP_KMEM
> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent)
> +		atomic_long_sub(amt, sg_memory_allocated(prot, sg));
> +#endif
>  	return 0;
>  }
>  EXPORT_SYMBOL(__sk_mem_schedule);
> @@ -1753,14 +1826,24 @@ EXPORT_SYMBOL(__sk_mem_schedule);
>  void __sk_mem_reclaim(struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
> +	struct kmem_cgroup *sg = sk->sk_cgrp;
> +	int *memory_pressure = sk_memory_pressure(sk);
>  
>  	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
> -		   prot->memory_allocated);
> +		   sk_memory_allocated(sk));
> +
> +#ifdef CONFIG_CGROUP_KMEM
> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
> +		atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
> +						sg_memory_allocated(prot, sg));
> +	}
> +#endif
> +
>  	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
>  
> -	if (prot->memory_pressure && *prot->memory_pressure &&
> -	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
> -		*prot->memory_pressure = 0;
> +	if (memory_pressure && *memory_pressure &&
> +	    (atomic_long_read(sk_memory_allocated(sk)) < sk_prot_mem(sk)[0]))
> +		*memory_pressure = 0;
>  }
>  EXPORT_SYMBOL(__sk_mem_reclaim);
>  

IMHO, I like to hide atomic_long_xxxx ops under kmem cgroup ops.

And use callbacks like
	kmem_cgroup_read(SOCKET_MEM_ALLOCATED, sk)

If other component uses kmem_cgroup, a generic interface will be
helpful because optimization/fix in generic interface will improve
all users of kmem_cgroup.



> @@ -2252,9 +2335,6 @@ void sk_common_release(struct sock *sk)
>  }
>  EXPORT_SYMBOL(sk_common_release);
>  
> -static DEFINE_RWLOCK(proto_list_lock);
> -static LIST_HEAD(proto_list);
> -
>  #ifdef CONFIG_PROC_FS
>  #define PROTO_INUSE_NR	64	/* should be enough for the first time */
>  struct prot_inuse {
> @@ -2479,13 +2559,17 @@ static char proto_method_implemented(const void *method)
>  
>  static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
>  {
> +	struct kmem_cgroup *sg = kcg_from_task(current);
> +
>  	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
>  			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
>  		   proto->name,
>  		   proto->obj_size,
>  		   sock_prot_inuse_get(seq_file_net(seq), proto),
> -		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
> -		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
> +		   proto->memory_allocated != NULL ?
> +			atomic_long_read(sg_memory_allocated(proto, sg)) : -1L,
> +		   proto->memory_pressure != NULL ?
> +			*sg_memory_pressure(proto, sg) ? "yes" : "no" : "NI",
>  		   proto->max_header,
>  		   proto->slab == NULL ? "no" : "yes",
>  		   module_name(proto->owner),
> diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
> index 19acd00..463b299 100644
> --- a/net/decnet/af_decnet.c
> +++ b/net/decnet/af_decnet.c
> @@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
>  	}
>  }
>  
> +static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
> +{
> +	return &decnet_memory_allocated;
> +}
> +
> +static int *memory_pressure_dn(struct kmem_cgroup *sg)
> +{
> +	return &dn_memory_pressure;
> +}
> +
> +static long *dn_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +	return sysctl_decnet_mem;
> +}
> +
>  static struct proto dn_proto = {
>  	.name			= "NSP",
>  	.owner			= THIS_MODULE,
>  	.enter_memory_pressure	= dn_enter_memory_pressure,
> -	.memory_pressure	= &dn_memory_pressure,
> -	.memory_allocated	= &decnet_memory_allocated,
> -	.sysctl_mem		= sysctl_decnet_mem,
> +	.memory_pressure	= memory_pressure_dn,
> +	.memory_allocated	= memory_allocated_dn,
> +	.prot_mem		= dn_sysctl_mem,
>  	.sysctl_wmem		= sysctl_decnet_wmem,
>  	.sysctl_rmem		= sysctl_decnet_rmem,
>  	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index b14ec7d..9c80acf 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -52,20 +52,22 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
>  {
>  	struct net *net = seq->private;
>  	int orphans, sockets;
> +	struct kmem_cgroup *sg = kcg_from_task(current);
> +	struct percpu_counter *allocated = sg_sockets_allocated(&tcp_prot, sg);
>  
>  	local_bh_disable();
>  	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
> -	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
> +	sockets = percpu_counter_sum_positive(allocated);
>  	local_bh_enable();
>  
>  	socket_seq_show(seq);
>  	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
>  		   sock_prot_inuse_get(net, &tcp_prot), orphans,
>  		   tcp_death_row.tw_count, sockets,
> -		   atomic_long_read(&tcp_memory_allocated));
> +		   atomic_long_read(sg_memory_allocated((&tcp_prot), sg)));
>  	seq_printf(seq, "UDP: inuse %d mem %ld\n",
>  		   sock_prot_inuse_get(net, &udp_prot),
> -		   atomic_long_read(&udp_memory_allocated));
> +		   atomic_long_read(sg_memory_allocated((&udp_prot), sg)));
>  	seq_printf(seq, "UDPLITE: inuse %d\n",
>  		   sock_prot_inuse_get(net, &udplite_prot));
>  	seq_printf(seq, "RAW: inuse %d\n",
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 69fd720..5e89480 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -14,6 +14,8 @@
>  #include <linux/init.h>
>  #include <linux/slab.h>
>  #include <linux/nsproxy.h>
> +#include <linux/kmem_cgroup.h>
> +#include <linux/swap.h>
>  #include <net/snmp.h>
>  #include <net/icmp.h>
>  #include <net/ip.h>
> @@ -174,6 +176,43 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>  	return ret;
>  }
>  
> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
> +			   void __user *buffer, size_t *lenp,
> +			   loff_t *ppos)
> +{
> +	int ret;
> +	unsigned long vec[3];
> +	struct kmem_cgroup *kmem = kcg_from_task(current);
> +	struct net *net = current->nsproxy->net_ns;
> +	int i;
> +
> +	ctl_table tmp = {
> +		.data = &vec,
> +		.maxlen = sizeof(vec),
> +		.mode = ctl->mode,
> +	};
> +
> +	if (!write) {
> +		ctl->data = &net->ipv4.sysctl_tcp_mem;
> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
> +	}
> +
> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < 3; i++)
> +		if (vec[i] > kmem->tcp_max_memory)
> +			return -EINVAL;
> +
> +	for (i = 0; i < 3; i++) {
> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
> +		kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
> +	}
> +
> +	return 0;
> +}
> +
>  static struct ctl_table ipv4_table[] = {
>  	{
>  		.procname	= "tcp_timestamps",
> @@ -433,13 +472,6 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler	= proc_dointvec
>  	},
>  	{
> -		.procname	= "tcp_mem",
> -		.data		= &sysctl_tcp_mem,
> -		.maxlen		= sizeof(sysctl_tcp_mem),
> -		.mode		= 0644,
> -		.proc_handler	= proc_doulongvec_minmax
> -	},
> -	{
>  		.procname	= "tcp_wmem",
>  		.data		= &sysctl_tcp_wmem,
>  		.maxlen		= sizeof(sysctl_tcp_wmem),
> @@ -721,6 +753,12 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= ipv4_ping_group_range,
>  	},
> +	{
> +		.procname	= "tcp_mem",
> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
> +		.mode		= 0644,
> +		.proc_handler	= ipv4_tcp_mem,
> +	},
>  	{ }
>  };
>  
> @@ -734,6 +772,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>  static __net_init int ipv4_sysctl_init_net(struct net *net)
>  {
>  	struct ctl_table *table;
> +	unsigned long limit;
>  
>  	table = ipv4_net_table;
>  	if (!net_eq(net, &init_net)) {
> @@ -769,6 +808,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>  
>  	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>  
> +	limit = nr_free_buffer_pages() / 8;
> +	limit = max(limit, 128UL);
> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
> +	net->ipv4.sysctl_tcp_mem[1] = limit;
> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
> +

What this calculation means ? Documented somewhere ?



>  	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>  			net_ipv4_ctl_path, table);
>  	if (net->ipv4.ipv4_hdr == NULL)
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 46febca..e1918fa 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -266,6 +266,7 @@
>  #include <linux/crypto.h>
>  #include <linux/time.h>
>  #include <linux/slab.h>
> +#include <linux/nsproxy.h>
>  
>  #include <net/icmp.h>
>  #include <net/tcp.h>
> @@ -282,23 +283,12 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>  
> -long sysctl_tcp_mem[3] __read_mostly;
>  int sysctl_tcp_wmem[3] __read_mostly;
>  int sysctl_tcp_rmem[3] __read_mostly;
>  
> -EXPORT_SYMBOL(sysctl_tcp_mem);
>  EXPORT_SYMBOL(sysctl_tcp_rmem);
>  EXPORT_SYMBOL(sysctl_tcp_wmem);
>  
> -atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
> -EXPORT_SYMBOL(tcp_memory_allocated);
> -
> -/*
> - * Current number of TCP sockets.
> - */
> -struct percpu_counter tcp_sockets_allocated;
> -EXPORT_SYMBOL(tcp_sockets_allocated);
> -
>  /*
>   * TCP splice context
>   */
> @@ -308,23 +298,157 @@ struct tcp_splice_state {
>  	unsigned int flags;
>  };
>  
> +#ifdef CONFIG_CGROUP_KMEM
>  /*
>   * Pressure flag: try to collapse.
>   * Technical note: it is used by multiple contexts non atomically.
>   * All the __sk_mem_schedule() is of this nature: accounting
>   * is strict, actions are advisory and have some latency.
>   */
> -int tcp_memory_pressure __read_mostly;
> -EXPORT_SYMBOL(tcp_memory_pressure);
> -
>  void tcp_enter_memory_pressure(struct sock *sk)
>  {
> +	struct kmem_cgroup *sg = sk->sk_cgrp;
> +	if (!sg->tcp_memory_pressure) {
> +		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
> +		sg->tcp_memory_pressure = 1;
> +	}
> +}
> +
> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +	return sg->tcp_prot_mem;
> +}
> +
> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +	return &(sg->tcp_memory_allocated);
> +}
> +
> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
> +	struct net *net = current->nsproxy->net_ns;
> +	int i;
> +
> +	if (!cgroup_lock_live_group(cgrp))
> +		return -ENODEV;
> +
> +	/*
> +	 * We can't allow more memory than our parents. Since this
> +	 * will be tested for all calls, by induction, there is no need
> +	 * to test any parent other than our own
> +	 * */
> +	if (sg->parent && (val > sg->parent->tcp_max_memory))
> +		val = sg->parent->tcp_max_memory;
> +
> +	sg->tcp_max_memory = val;
> +
> +	for (i = 0; i < 3; i++)
> +		sg->tcp_prot_mem[i]  = min_t(long, val,
> +					     net->ipv4.sysctl_tcp_mem[i]);
> +
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +

Do we really need cgroup_lock/unlock ?



> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
> +	u64 ret;
> +
> +	if (!cgroup_lock_live_group(cgrp))
> +		return -ENODEV;
> +	ret = sg->tcp_max_memory;
> +
> +	cgroup_unlock();
> +	return ret;
> +}
> +


Hmm, can't you implement this function as

	kmem_cgroup_read(SOCK_TCP_MAXMEM, sg);

? How do you think ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] af_packet: flush complete kernel cache in packet_sendmsg
From: Phil Sutter @ 2011-09-06 11:05 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Ben Hutchings, netdev, David S. Miller, linux-arm-kernel
In-Reply-To: <20110906095722.GK6619@n2100.arm.linux.org.uk>

On Tue, Sep 06, 2011 at 10:57:22AM +0100, Russell King - ARM Linux wrote:
> > The code in question uses __get_free_pages(), and if that fails uses
> > vmalloc() (see alloc_one_pg_vec_page() for reference). Both code paths
> > show result in the same faulty behaviour.
> 
> So, what you're wanting is cache coherency between vmalloc() and
> userspace.  There is no API in the kernel to do that, and you'll see
> the same failures of this interface not only on ARM but also other
> architectures with virtual caches.
> 
> It sounds like we need an API to flush the cache using both the
> userspace address, plus the kernel side address be that in the direct
> map or the vmalloc map areas.
> 
> Or maybe the right solution is to simply disable AF_PACKET MMAP support
> for virtual cached architectures - it may be that adding cache flushing
> calls makes the thing too expensive and the benefits of mmap over normal
> read/write are lost.

OK, that's horrible. Of course we depend on just this combination to
work flawlessly, i.e. PACKET_MMAP && VIVT. :(

Another userspace-interface I'm working on uses a different solution:
memory is allocated in userspace and accessed from kernelspace using
get_user_pages(). I did not explicitly search for the earlier described
fault pattern, but we didn't notice any problem with this approach on
the very same hardware either. I already see myself writing TPACKET_V3.
;)

What do you think?

Greetings, Phil

-- 
Viprinet GmbH
Mainzer Str. 43
55411 Bingen am Rhein
Germany

Zentrale:     +49-6721-49030-0
Durchwahl:    +49-6721-49030-134
Fax:          +49-6721-49030-209

phil.sutter@viprinet.com
http://www.viprinet.com

Sitz der Gesellschaft: Bingen am Rhein
Handelsregister: Amtsgericht Mainz HRB40380
Geschäftsführer: Simon Kissel


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* Re: [patch net-next-2.6 v3] net: consolidate and fix ethtool_ops->get_settings calling
From: Ralf Baechle @ 2011-09-06 12:20 UTC (permalink / raw)
  To: Jiri Pirko, tg
  Cc: Ben Hutchings, netdev, fubar, andy, kaber, bprakash, JBottomley,
	robert.w.love, davem, shemminger, decot, mirq-linux,
	alexander.h.duyck, amit.salecha, eric.dumazet, therbert, paulmck,
	laijs, xiaosuo, greearb, loke.chetan, linux-mips, linux-scsi,
	devel, bridge
In-Reply-To: <20110903133428.GA2821@minipsycho>

On Sat, Sep 03, 2011 at 03:34:30PM +0200, Jiri Pirko wrote:

>  arch/mips/txx9/generic/setup_tx4939.c |    2 +-

Acked-by: Ralf Baechle <ralf@linux-mips.org>

Feel free to merge this through the net tree.

  Ralf

^ permalink raw reply

* Re: [PATCH 1/2] iwlegacy: change IWL_WARN to IWL_DEBUG_HT in iwl4965_tx_agg_start
From: Stanislaw Gruszka @ 2011-09-06 15:01 UTC (permalink / raw)
  To: Greg Dietsche; +Cc: linville, linux-wireless, netdev, linux-kernel
In-Reply-To: <20110829140032.GA1573@redhat.com>

Hello

On Mon, Aug 29, 2011 at 04:00:33PM +0200, Stanislaw Gruszka wrote:
> On Mon, Aug 29, 2011 at 08:33:39AM -0500, Greg Dietsche wrote:
> > On 08/29/2011 07:20 AM, Stanislaw Gruszka wrote:
> > >On Sun, Aug 28, 2011 at 08:26:16AM -0500, Greg Dietsche wrote:
> > >>This message should be a debug message and not a warning. So
> > >>change it from IWL_WARN to IWL_DEBUG_HT.
> > >I'm currently doing massive iwlegacy driver cleanup. Would be easier
> > >for me to apply these patches on top of my changes instead of rebase
> > >my patches. I will queue these two patches and post them together with
> > >my pending patches.
> > >
> > That sounds good to me. I have the 4065 card in my laptop and want
> > to learn how it works. If you want someone to test your changes, I'm
> > willing.
> > 
> > I have six other patches that are trivial in nature for the iwlegacy
> > driver. One of those also applies to the iwlagn driver, so seven
> > patches in total. They remove some null checks that aren't necessary
> > and also cleanup a few unused variables. There are two patches in
> > the set that I'm not 100% sure about. They remove null checks and I
> > haven't been able to prove to myself that they are correct. However,
> > if they aren't correct, then there are some null checks in other
> > places that need to be added....
> > 
> > Anyway, I can hold off on these until you've done your cleanup and
> > see what still applies,
> 
> That would be great.
> 
> > or if you have a tree someplace, I'd be
> > happy to rebase them for you.
> 
> I do not have publicly available tree. I'll probably try to get
> one on git.kernel.org. I will let you know.

I requested for kernel.org account but admins there have much more
troubles currently than adding new account :-/ 

I put patches here:
http://people.redhat.com/sgruszka/iwlegacy_cleanup.tar.bz2

They are on top of wireless-testing tree. You can apply them 
by something like that:

cd wireless-testing
for i in `ls -X ~/iwlegacy_cleanup/*.patch` ; do git am $i ; done

Series include your 2 patches. You can test this cleanup and
apply your new changes on top. I'll not do any further cleanup
for some time now, perhaps continue when I got public git tree.

Thanks
Stanislaw

^ permalink raw reply

* re add support for bcm5750
From: Florian Mickler @ 2011-09-06 15:03 UTC (permalink / raw)
  To: mcarlson; +Cc: netdev, linux-kernel, benli, mchan, davem, Francesco Piccinno
In-Reply-To: <1280784368-4226-8-git-send-email-mcarlson@broadcom.com>

Hi,

in https://bugzilla.kernel.org/show_bug.cgi?id=42132 Francesco wrote: 

> I have a notebook (HP TC4400) which has a BCM5750 ethernet card inside. The
> ouput of lspci is:
> 
> 08:00.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM5750M
> Gigabit Ethernet [14e4:167c]
> 
> Commit 67b284d476bcb3d100e946da23d6cf9acfd0465c removed the support for this
> device. I wish to have the support for this network card back again. Thanks!

Regards,
Flo

^ permalink raw reply

* RE: [PATCH] net: Prefer non link-local source addresses
From: Harris, Jeff @ 2011-09-06 15:12 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1109020041140.1799@ja.ssi.bg>

In this case, the address scope values are being set properly.  The link-local address has link scope and the routable address has global scope.  When the inet_select_addr function is called, the dst address is 0 and the scope is 253 (link scope).  So, in this case, both address could match.  It just happens that the link local address is first in the list for the device.

This condition looks to be arising from the use of interface routes on our device (e.g. ip route add default dev eth0).  The routes are being installed with link scope.  Forcing a scope of global causes a scope of 0 in inet_select_addr which then selects the routable address always.  I have not found any definite documentation on whether local or global should be used for the route, but the default behavior of the 'ip' command is to use link scope on these routes and global on routes with a gateway address. 

Also, I have only been able to test against 2.6.33 which we use on our embedded device.  It is not easy to update to a more recent version.  The patch, though, applied cleanly to the latest stable version.

Jeff

-----Original Message-----
From: Julian Anastasov [mailto:ja@ssi.bg] 
Sent: Thursday, September 01, 2011 6:15 PM
To: Harris, Jeff
Cc: David S. Miller; Alexey Kuznetsov; James Morris; Hideaki YOSHIFUJI; Patrick McHardy; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH] net: Prefer non link-local source addresses


	Hello,

On Thu, 1 Sep 2011, Jeff Harris wrote:

> Section 2.6.1 of RFC 3927 specifies that if link-local and routable addresses
> are available on an interface, a routable address is preferred.  Update the
> IPv4 source address selection algorithm to use a 169.254.x.x address only if
> another matching address is not found.
> 
> Tested combinations of configured IP addresses with and without link-local to
> verify a link-local address was chosen only if no routable address was
> present.

	As David Lamparter already said, isn't the scope value
suitable for this purpose? Eg.
ip addr add 169.254.5.5/16 brd + dev eth0 scope link

	iproute2 already has function default_scope() in
ip/ipaddress.c that assigns scope if it is not specified
while adding address. May be we can add RT_SCOPE_LINK for
169.254 there?

	Another such place is inet_set_ifa() in
net/ipv4/devinet.c where we can assign scope, so that
ifconfig works too.

	I see also that net/ipv6/addrconf.c (sit_add_v4_addrs)
avoids link-local addresses. What I mean is that the scope
can be checked at many places and it is a mechanism that
already works.

	As result, we will not complicate inet_select_addr.

> Signed-off-by: Jeff Harris <jeff_harris@kentrox.com>
> ---
>  net/ipv4/devinet.c |   18 ++++++++++++++++--
>  1 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index bc19bd0..70ddf37 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -965,6 +965,8 @@ out:
>  __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope)
>  {
>  	__be32 addr = 0;
> +	__be32 lladdr = 0;
> +	__be32 firstaddr = 0;
>  	struct in_device *in_dev;
>  	struct net *net = dev_net(dev);
>  
> @@ -977,15 +979,27 @@ __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope)
>  		if (ifa->ifa_scope > scope)
>  			continue;
>  		if (!dst || inet_ifa_match(dst, ifa)) {
> +			if (ipv4_is_linklocal_169(ifa->ifa_address)) {
> +				lladdr = ifa->ifa_local;
> +				continue;
> +			}
>  			addr = ifa->ifa_local;
>  			break;
>  		}
> -		if (!addr)
> -			addr = ifa->ifa_local;
> +		if (!firstaddr)
> +			firstaddr = ifa->ifa_local;
>  	} endfor_ifa(in_dev);
>  
>  	if (addr)
>  		goto out_unlock;
> +	if (lladdr) {
> +		addr = lladdr;
> +		goto out_unlock;
> +	}
> +	if (firstaddr) {
> +		addr = firstaddr;
> +		goto out_unlock;
> +	}
>  no_in_dev:
>  
>  	/* Not loopback addresses on loopback should be preferred
> -- 
> 1.7.0.5

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net 6/6] bnx2x: Fix ethtool advertisement
From: Ben Hutchings @ 2011-09-06 15:19 UTC (permalink / raw)
  To: Yaniv Rosner; +Cc: davem, eilong, netdev
In-Reply-To: <1315291645.28564.20.camel@lb-tlvb-dmitry>

On Tue, 2011-09-06 at 09:47 +0300, Yaniv Rosner wrote:
> Enable changing advertisement settings via ethtool and fix
> flow-control advertisement when autoneg flow-control is disabled.
[...]
> diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
> index f74582a..42c7be1 100644
> --- a/drivers/net/bnx2x/bnx2x_main.c
> +++ b/drivers/net/bnx2x/bnx2x_main.c
> @@ -2125,6 +2125,12 @@ static int bnx2x_set_spio(struct bnx2x *bp, int spio_num, u32 mode)
>  void bnx2x_calc_fc_adv(struct bnx2x *bp)
>  {
>  	u8 cfg_idx = bnx2x_get_link_cfg_idx(bp);
> +	if (bp->link_params.req_flow_ctrl[cfg_idx] != BNX2X_FLOW_CTRL_AUTO) {
> +		bp->port.advertising[cfg_idx] &= ~(ADVERTISED_Asym_Pause |
> +						   ADVERTISED_Pause);
> +		return;
> +	}
[...]

I think you should still advertise the flow control behaviour you want,
even if you will override the result of autonegotiation.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] per-cgroup tcp buffer limitation
From: Glauber Costa @ 2011-09-06 15:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman
In-Reply-To: <20110906190038.7a0a8807.kamezawa.hiroyu@jp.fujitsu.com>

On 09/06/2011 07:00 AM, KAMEZAWA Hiroyuki wrote:
> On Mon,  5 Sep 2011 23:35:56 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch introduces per-cgroup tcp buffers limitation. This allows
>> sysadmins to specify a maximum amount of kernel memory that
>> tcp connections can use at any point in time. TCP is the main interest
>> in this work, but extending it to other protocols would be easy.
>>
>> It piggybacks in the memory control mechanism already present in
>> /proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
>> that will suppress allocation when reached. For each cgroup, however,
>> the file kmem.tcp_maxmem will be used to cap those values.
>>
>> The usage I have in mind here is containers. Each container will
>> define its own values for soft and hard limits, but none of them will
>> be possibly bigger than the value the box' sysadmin specified from
>> the outside.
>>
>> To test for any performance impacts of this patch, I used netperf's
>> TCP_RR benchmark on localhost, so we can have both recv and snd in action.
>>
>> Command line used was ./src/netperf -t TCP_RR -H localhost, and the
>> results:
>>
>> Without the patch
>> =================
>>
>> Socket Size   Request  Resp.   Elapsed  Trans.
>> Send   Recv   Size     Size    Time     Rate
>> bytes  Bytes  bytes    bytes   secs.    per sec
>>
>> 16384  87380  1        1       10.00    26996.35
>> 16384  87380
>>
>> With the patch
>> ===============
>>
>> Local /Remote
>> Socket Size   Request  Resp.   Elapsed  Trans.
>> Send   Recv   Size     Size    Time     Rate
>> bytes  Bytes  bytes    bytes   secs.    per sec
>>
>> 16384  87380  1        1       10.00    27291.86
>> 16384  87380
>>
>> The difference is within a one-percent range.
>>
>> Nesting cgroups doesn't seem to be the dominating factor as well,
>> with nestings up to 10 levels not showing a significant performance
>> difference.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   crypto/af_alg.c               |    7 ++-
>>   include/linux/cgroup_subsys.h |    4 +
>>   include/net/netns/ipv4.h      |    1 +
>>   include/net/sock.h            |   66 +++++++++++++++-
>>   include/net/tcp.h             |   12 ++-
>>   include/net/udp.h             |    3 +-
>>   include/trace/events/sock.h   |   10 +-
>>   init/Kconfig                  |   11 +++
>>   mm/Makefile                   |    1 +
>>   net/core/sock.c               |  136 +++++++++++++++++++++++++++-------
>>   net/decnet/af_decnet.c        |   21 +++++-
>>   net/ipv4/proc.c               |    8 +-
>>   net/ipv4/sysctl_net_ipv4.c    |   59 +++++++++++++--
>>   net/ipv4/tcp.c                |  164 +++++++++++++++++++++++++++++++++++-----
>>   net/ipv4/tcp_input.c          |   17 ++--
>>   net/ipv4/tcp_ipv4.c           |   27 +++++--
>>   net/ipv4/tcp_output.c         |    2 +-
>>   net/ipv4/tcp_timer.c          |    2 +-
>>   net/ipv4/udp.c                |   20 ++++-
>>   net/ipv6/tcp_ipv6.c           |   16 +++-
>>   net/ipv6/udp.c                |    4 +-
>>   net/sctp/socket.c             |   35 +++++++--
>>   22 files changed, 514 insertions(+), 112 deletions(-)
>
> Hmm...could you please devide patches into a few patches ?
>
> If I was you, I'll devide the patches into
>
>   - Kconfig/Makefile/kmem_cgroup.c skelton.
>   - changes to struct sock and macro definition
>   - hooks to tcp.
>   - hooks to udp
>   - hooks to sctp

Sure, I can do it.

> And why not including mm/kmem_cgroup.c ?
Because I am an idiot and forgot to git add.

>
> some comments below.
>
>
>>
>> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
>> index ac33d5f..df168d8 100644
>> --- a/crypto/af_alg.c
>> +++ b/crypto/af_alg.c
>> @@ -29,10 +29,15 @@ struct alg_type_list {
>>
>>   static atomic_long_t alg_memory_allocated;
>>
>> +static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
>> +{
>> +	return&alg_memory_allocated;
>> +}
>> +
>>   static struct proto alg_proto = {
>>   	.name			= "ALG",
>>   	.owner			= THIS_MODULE,
>> -	.memory_allocated	=&alg_memory_allocated,
>> +	.memory_allocated	= memory_allocated_alg,
>>   	.obj_size		= sizeof(struct alg_sock),
>>   };
>>
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index ac663c1..363b8e8 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -35,6 +35,10 @@ SUBSYS(cpuacct)
>>   SUBSYS(mem_cgroup)
>>   #endif
>>
>> +#ifdef CONFIG_CGROUP_KMEM
>> +SUBSYS(kmem)
>> +#endif
>> +
>>   /* */
>>
>>   #ifdef CONFIG_CGROUP_DEVICE
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index d786b4f..bbd023a 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>>   	int current_rt_cache_rebuild_count;
>>
>>   	unsigned int sysctl_ping_group_range[2];
>> +	long sysctl_tcp_mem[3];
>>
>>   	atomic_t rt_genid;
>>   	atomic_t dev_addr_genid;
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 8e4062f..e085148 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -62,7 +62,9 @@
>>   #include<linux/atomic.h>
>>   #include<net/dst.h>
>>   #include<net/checksum.h>
>> +#include<linux/kmem_cgroup.h>
>>
>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>>   /*
>>    * This structure really needs to be cleaned up.
>>    * Most of it is for TCP, and not used by any of
>> @@ -339,6 +341,7 @@ struct sock {
>>   #endif
>>   	__u32			sk_mark;
>>   	u32			sk_classid;
>> +	struct kmem_cgroup	*sk_cgrp;
>>   	void			(*sk_state_change)(struct sock *sk);
>>   	void			(*sk_data_ready)(struct sock *sk, int bytes);
>>   	void			(*sk_write_space)(struct sock *sk);
>> @@ -786,16 +789,21 @@ struct proto {
>>
>>   	/* Memory pressure */
>>   	void			(*enter_memory_pressure)(struct sock *sk);
>> -	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
>> -	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
>> +	/* Current allocated memory. */
>> +	atomic_long_t		*(*memory_allocated)(struct kmem_cgroup *sg);
>> +	/* Current number of sockets. */
>> +	struct percpu_counter	*(*sockets_allocated)(struct kmem_cgroup *sg);
>> +
>> +	int			(*init_cgroup)(struct cgroup *cgrp,
>> +					       struct cgroup_subsys *ss);
>>   	/*
>>   	 * Pressure flag: try to collapse.
>>   	 * Technical note: it is used by multiple contexts non atomically.
>>   	 * All the __sk_mem_schedule() is of this nature: accounting
>>   	 * is strict, actions are advisory and have some latency.
>>   	 */
>> -	int			*memory_pressure;
>> -	long			*sysctl_mem;
>> +	int			*(*memory_pressure)(struct kmem_cgroup *sg);
>> +	long			*(*prot_mem)(struct kmem_cgroup *sg);
>
> Hmm. Socket interface callbacks doesn't have documentation ?
> Adding explanation in Documenation is better, isn't it ?

Okay, sure thing.

>
>>   	int			*sysctl_wmem;
>>   	int			*sysctl_rmem;
>>   	int			max_header;
>> @@ -826,6 +834,56 @@ struct proto {
>>   #endif
>>   };
>>
>> +#define sk_memory_pressure(sk)						\
>> +({									\
>> +	int *__ret = NULL;						\
>> +	if ((sk)->sk_prot->memory_pressure)				\
>> +		__ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);	\
>> +	__ret;								\
>> +})
>> +
>> +#define sk_sockets_allocated(sk)				\
>> +({								\
>> +	struct percpu_counter *__p;				\
>> +	__p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);	\
>> +	__p;							\
>> +})
>> +
>> +#define sk_memory_allocated(sk)					\
>> +({								\
>> +	atomic_long_t *__mem;					\
>> +	__mem = (sk)->sk_prot->memory_allocated(sk->sk_cgrp);	\
>> +	__mem;							\
>> +})
>> +
>> +#define sk_prot_mem(sk)						\
>> +({								\
>> +	long *__mem = (sk)->sk_prot->prot_mem(sk->sk_cgrp);	\
>> +	__mem;							\
>> +})
>> +
>> +#define sg_memory_pressure(prot, sg)				\
>> +({								\
>> +	int *__ret = NULL;					\
>> +	if (prot->memory_pressure)				\
>> +		__ret = (prot)->memory_pressure(sg);		\
>> +	__ret;							\
>> +})
>> +
>> +#define sg_memory_allocated(prot, sg)				\
>> +({								\
>> +	atomic_long_t *__mem;					\
>> +	__mem = (prot)->memory_allocated(sg);			\
>> +	__mem;							\
>> +})
>> +
>> +#define sg_sockets_allocated(prot, sg)				\
>> +({								\
>> +	struct percpu_counter *__p;				\
>> +	__p = (prot)->sockets_allocated(sg);			\
>> +	__p;							\
>> +})
>> +
>
> All functions are worth to be inlined ?
Using the law of minimum disruption, I wanted to make them all
valid left values for any expressions, that's why it is this way.
I see that below, you suggest using things like 
sg_sockets_allocated_inc() , dec() and read(). I prefer them to be 
lvalues, but I am fine with either.


>
>
>>   extern int proto_register(struct proto *prot, int alloc_slab);
>>   extern void proto_unregister(struct proto *prot);
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 149a415..8e1ec4a 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>>   extern int sysctl_tcp_reordering;
>>   extern int sysctl_tcp_ecn;
>>   extern int sysctl_tcp_dsack;
>> -extern long sysctl_tcp_mem[3];
>>   extern int sysctl_tcp_wmem[3];
>>   extern int sysctl_tcp_rmem[3];
>>   extern int sysctl_tcp_app_win;
>> @@ -253,9 +252,12 @@ extern int sysctl_tcp_cookie_size;
>>   extern int sysctl_tcp_thin_linear_timeouts;
>>   extern int sysctl_tcp_thin_dupack;
>>
>> -extern atomic_long_t tcp_memory_allocated;
>> -extern struct percpu_counter tcp_sockets_allocated;
>> -extern int tcp_memory_pressure;
>> +struct kmem_cgroup;
>> +extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
>> +int *memory_pressure_tcp(struct kmem_cgroup *sg);
>> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
>>
>>   /*
>>    * The next routines deal with comparing 32 bit unsigned ints
>> @@ -286,7 +288,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
>>   	}
>>
>>   	if (sk->sk_wmem_queued>  SOCK_MIN_SNDBUF&&
>> -	    atomic_long_read(&tcp_memory_allocated)>  sysctl_tcp_mem[2])
>> +	    atomic_long_read(sk_memory_allocated(sk))>  sk_prot_mem(sk)[2])
>
> Why not sk_memory_allocated() returns the value ?
> Is it required to return pointer ?

Same thing. I don't feel strongly about this, and can change it.

>
>>   		return true;
>>   	return false;
>>   }
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index 67ea6fc..0e27388 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
>>
>>   extern struct proto udp_prot;
>>
>> -extern atomic_long_t udp_memory_allocated;
>> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
>> +long *udp_sysctl_mem(struct kmem_cgroup *sg);
>>
>>   /* sysctl variables for udp */
>>   extern long sysctl_udp_mem[3];
>> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
>> index 779abb9..12a6083 100644
>> --- a/include/trace/events/sock.h
>> +++ b/include/trace/events/sock.h
>> @@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>
>>   	TP_STRUCT__entry(
>>   		__array(char, name, 32)
>> -		__field(long *, sysctl_mem)
>> +		__field(long *, prot_mem)
>>   		__field(long, allocated)
>>   		__field(int, sysctl_rmem)
>>   		__field(int, rmem_alloc)
>> @@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>
>>   	TP_fast_assign(
>>   		strncpy(__entry->name, prot->name, 32);
>> -		__entry->sysctl_mem = prot->sysctl_mem;
>> +		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
>>   		__entry->allocated = allocated;
>>   		__entry->sysctl_rmem = prot->sysctl_rmem[0];
>>   		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
>> @@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>   	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
>>   		"sysctl_rmem=%d rmem_alloc=%d",
>>   		__entry->name,
>> -		__entry->sysctl_mem[0],
>> -		__entry->sysctl_mem[1],
>> -		__entry->sysctl_mem[2],
>> +		__entry->prot_mem[0],
>> +		__entry->prot_mem[1],
>> +		__entry->prot_mem[2],
>>   		__entry->allocated,
>>   		__entry->sysctl_rmem,
>>   		__entry->rmem_alloc)
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..5955ac2 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>   	  select this option (if, for some reason, they need to disable it
>>   	  then swapaccount=0 does the trick).
>>
>> +config CGROUP_KMEM
>> +	bool "Kernel Memory Resource Controller for Control Groups"
>> +	depends on CGROUPS
>> +	help
>> +	  The Kernel Memory cgroup can limit the amount of memory used by
>> +	  certain kernel objects in the system. Those are fundamentally
>> +	  different from the entities handled by the Memory Controller,
>> +	  which are page-based, and can be swapped. Users of the kmem
>> +	  cgroup can use it to guarantee that no group of processes will
>> +	  ever exhaust kernel resources alone.
>> +
>
> This help seems nice but please add Documentation/cgroup/kmem.
Yes, sure.

>
>
>
>
>>   config CGROUP_PERF
>>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>   	depends on PERF_EVENTS&&  CGROUPS
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 836e416..1b1aa24 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>   obj-$(CONFIG_QUICKLIST) += quicklist.o
>>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>>   obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>> +obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
>>   obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>>   obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>>   obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 3449df8..2d968ea 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -134,6 +134,24 @@
>>   #include<net/tcp.h>
>>   #endif
>>
>> +static DEFINE_RWLOCK(proto_list_lock);
>> +static LIST_HEAD(proto_list);
>> +
>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
>> +{
>> +	struct proto *proto;
>> +	int ret = 0;
>> +
>> +	read_lock(&proto_list_lock);
>> +	list_for_each_entry(proto,&proto_list, node) {
>> +		if (proto->init_cgroup)
>> +			ret |= proto->init_cgroup(cgrp, ss);
>> +	}
>> +	read_unlock(&proto_list_lock);
>> +
>> +	return ret;
>> +}
>> +
>>   /*
>>    * Each address family might have different locking rules, so we have
>>    * one slock key per address family:
>> @@ -1114,6 +1132,31 @@ void sock_update_classid(struct sock *sk)
>>   EXPORT_SYMBOL(sock_update_classid);
>>   #endif
>>
>> +void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>
> I'm not sure this kind of bare cgroup code in core/sock.c will be
> welcomed by network guys.

What is your suggestion then? Just a wrapper or do you have something 
else in mind ?

>
>
>
>
>> +
>> +void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>>   /**
>>    *	sk_alloc - All socket objects are allocated here
>>    *	@net: the applicable net namespace
>> @@ -1139,6 +1182,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>>   		atomic_set(&sk->sk_wmem_alloc, 1);
>>
>>   		sock_update_classid(sk);
>> +		sock_update_kmem_cgrp(sk);
>>   	}
>>
>>   	return sk;
>> @@ -1170,6 +1214,7 @@ static void __sk_free(struct sock *sk)
>>   		put_cred(sk->sk_peer_cred);
>>   	put_pid(sk->sk_peer_pid);
>>   	put_net(sock_net(sk));
>> +	sock_release_kmem_cgrp(sk);
>>   	sk_prot_free(sk->sk_prot_creator, sk);
>>   }
>>
>> @@ -1287,8 +1332,8 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
>>   		sk_set_socket(newsk, NULL);
>>   		newsk->sk_wq = NULL;
>>
>> -		if (newsk->sk_prot->sockets_allocated)
>> -			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
>> +		if (sk_sockets_allocated(sk))
>> +			percpu_counter_inc(sk_sockets_allocated(sk));
> How about
> 	sk_sockets_allocated_inc(sk);
> ?
>
>
>>
>>   		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
>>   		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
>> @@ -1676,29 +1721,51 @@ EXPORT_SYMBOL(sk_wait_data);
>>    */
>>   int __sk_mem_schedule(struct sock *sk, int size, int kind)
>>   {
>> -	struct proto *prot = sk->sk_prot;
>>   	int amt = sk_mem_pages(size);
>> +	struct proto *prot = sk->sk_prot;
>>   	long allocated;
>> +	int *memory_pressure;
>> +	long *prot_mem;
>> +	int parent_failure = 0;
>> +	struct kmem_cgroup *sg;
>>
>>   	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
>> -	allocated = atomic_long_add_return(amt, prot->memory_allocated);
>> +
>> +	memory_pressure = sk_memory_pressure(sk);
>> +	prot_mem = sk_prot_mem(sk);
>> +
>> +	allocated = atomic_long_add_return(amt, sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>> +		long alloc;
>> +		/*
>> +		 * Large nestings are not the common case, and stopping in the
>> +		 * middle would be complicated enough, that we bill it all the
>> +		 * way through the root, and if needed, unbill everything later
>> +		 */
>> +		alloc = atomic_long_add_return(amt,
>> +					       sg_memory_allocated(prot, sg));
>> +		parent_failure |= (alloc>  sk_prot_mem(sk)[2]);
>> +	}
>> +#endif
>> +
>> +	/* Over hard limit (we, or our parents) */
>> +	if (parent_failure || (allocated>  prot_mem[2]))
>> +		goto suppress_allocation;
>>
>>   	/* Under limit. */
>> -	if (allocated<= prot->sysctl_mem[0]) {
>> -		if (prot->memory_pressure&&  *prot->memory_pressure)
>> -			*prot->memory_pressure = 0;
>> +	if (allocated<= prot_mem[0]) {
>> +		if (memory_pressure&&  *memory_pressure)
>> +			*memory_pressure = 0;
>>   		return 1;
>>   	}
>>
>>   	/* Under pressure. */
>> -	if (allocated>  prot->sysctl_mem[1])
>> +	if (allocated>  prot_mem[1])
>>   		if (prot->enter_memory_pressure)
>>   			prot->enter_memory_pressure(sk);
>>
>> -	/* Over hard limit. */
>> -	if (allocated>  prot->sysctl_mem[2])
>> -		goto suppress_allocation;
>> -
>>   	/* guarantee minimum buffer size under pressure */
>>   	if (kind == SK_MEM_RECV) {
>>   		if (atomic_read(&sk->sk_rmem_alloc)<  prot->sysctl_rmem[0])
>> @@ -1712,13 +1779,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
>>   				return 1;
>>   	}
>>
>> -	if (prot->memory_pressure) {
>> +	if (memory_pressure) {
>>   		int alloc;
>>
>> -		if (!*prot->memory_pressure)
>> +		if (!*memory_pressure)
>>   			return 1;
>> -		alloc = percpu_counter_read_positive(prot->sockets_allocated);
>> -		if (prot->sysctl_mem[2]>  alloc *
>> +		alloc = percpu_counter_read_positive(sk_sockets_allocated(sk));
>> +		if (prot_mem[2]>  alloc *
>>   		    sk_mem_pages(sk->sk_wmem_queued +
>>   				 atomic_read(&sk->sk_rmem_alloc) +
>>   				 sk->sk_forward_alloc))
>> @@ -1741,7 +1808,13 @@ suppress_allocation:
>>
>>   	/* Alas. Undo changes. */
>>   	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
>> -	atomic_long_sub(amt, prot->memory_allocated);
>> +
>> +	atomic_long_sub(amt, sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent)
>> +		atomic_long_sub(amt, sg_memory_allocated(prot, sg));
>> +#endif
>>   	return 0;
>>   }
>>   EXPORT_SYMBOL(__sk_mem_schedule);
>> @@ -1753,14 +1826,24 @@ EXPORT_SYMBOL(__sk_mem_schedule);
>>   void __sk_mem_reclaim(struct sock *sk)
>>   {
>>   	struct proto *prot = sk->sk_prot;
>> +	struct kmem_cgroup *sg = sk->sk_cgrp;
>> +	int *memory_pressure = sk_memory_pressure(sk);
>>
>>   	atomic_long_sub(sk->sk_forward_alloc>>  SK_MEM_QUANTUM_SHIFT,
>> -		   prot->memory_allocated);
>> +		   sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>> +		atomic_long_sub(sk->sk_forward_alloc>>  SK_MEM_QUANTUM_SHIFT,
>> +						sg_memory_allocated(prot, sg));
>> +	}
>> +#endif
>> +
>>   	sk->sk_forward_alloc&= SK_MEM_QUANTUM - 1;
>>
>> -	if (prot->memory_pressure&&  *prot->memory_pressure&&
>> -	    (atomic_long_read(prot->memory_allocated)<  prot->sysctl_mem[0]))
>> -		*prot->memory_pressure = 0;
>> +	if (memory_pressure&&  *memory_pressure&&
>> +	    (atomic_long_read(sk_memory_allocated(sk))<  sk_prot_mem(sk)[0]))
>> +		*memory_pressure = 0;
>>   }
>>   EXPORT_SYMBOL(__sk_mem_reclaim);
>>
>
> IMHO, I like to hide atomic_long_xxxx ops under kmem cgroup ops.
>
> And use callbacks like
> 	kmem_cgroup_read(SOCKET_MEM_ALLOCATED, sk)
>
> If other component uses kmem_cgroup, a generic interface will be
> helpful because optimization/fix in generic interface will improve
> all users of kmem_cgroup.

I honestly don't like it too much.
It introduces at least one conditional to test for the kind of variable,
which can harm some fast paths, and the return value will not always be 
of the same type. Also, kmem_cgroup_read would not do too much besides 
acessing a data. Not a likely hot target for future optimizations.

>
>
>> @@ -2252,9 +2335,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>>   #ifdef CONFIG_PROC_FS
>>   #define PROTO_INUSE_NR	64	/* should be enough for the first time */
>>   struct prot_inuse {
>> @@ -2479,13 +2559,17 @@ static char proto_method_implemented(const void *method)
>>
>>   static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
>>   {
>> +	struct kmem_cgroup *sg = kcg_from_task(current);
>> +
>>   	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
>>   			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
>>   		   proto->name,
>>   		   proto->obj_size,
>>   		   sock_prot_inuse_get(seq_file_net(seq), proto),
>> -		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
>> -		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
>> +		   proto->memory_allocated != NULL ?
>> +			atomic_long_read(sg_memory_allocated(proto, sg)) : -1L,
>> +		   proto->memory_pressure != NULL ?
>> +			*sg_memory_pressure(proto, sg) ? "yes" : "no" : "NI",
>>   		   proto->max_header,
>>   		   proto->slab == NULL ? "no" : "yes",
>>   		   module_name(proto->owner),
>> diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
>> index 19acd00..463b299 100644
>> --- a/net/decnet/af_decnet.c
>> +++ b/net/decnet/af_decnet.c
>> @@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
>>   	}
>>   }
>>
>> +static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
>> +{
>> +	return&decnet_memory_allocated;
>> +}
>> +
>> +static int *memory_pressure_dn(struct kmem_cgroup *sg)
>> +{
>> +	return&dn_memory_pressure;
>> +}
>> +
>> +static long *dn_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +	return sysctl_decnet_mem;
>> +}
>> +
>>   static struct proto dn_proto = {
>>   	.name			= "NSP",
>>   	.owner			= THIS_MODULE,
>>   	.enter_memory_pressure	= dn_enter_memory_pressure,
>> -	.memory_pressure	=&dn_memory_pressure,
>> -	.memory_allocated	=&decnet_memory_allocated,
>> -	.sysctl_mem		= sysctl_decnet_mem,
>> +	.memory_pressure	= memory_pressure_dn,
>> +	.memory_allocated	= memory_allocated_dn,
>> +	.prot_mem		= dn_sysctl_mem,
>>   	.sysctl_wmem		= sysctl_decnet_wmem,
>>   	.sysctl_rmem		= sysctl_decnet_rmem,
>>   	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
>> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
>> index b14ec7d..9c80acf 100644
>> --- a/net/ipv4/proc.c
>> +++ b/net/ipv4/proc.c
>> @@ -52,20 +52,22 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
>>   {
>>   	struct net *net = seq->private;
>>   	int orphans, sockets;
>> +	struct kmem_cgroup *sg = kcg_from_task(current);
>> +	struct percpu_counter *allocated = sg_sockets_allocated(&tcp_prot, sg);
>>
>>   	local_bh_disable();
>>   	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
>> -	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
>> +	sockets = percpu_counter_sum_positive(allocated);
>>   	local_bh_enable();
>>
>>   	socket_seq_show(seq);
>>   	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
>>   		   sock_prot_inuse_get(net,&tcp_prot), orphans,
>>   		   tcp_death_row.tw_count, sockets,
>> -		   atomic_long_read(&tcp_memory_allocated));
>> +		   atomic_long_read(sg_memory_allocated((&tcp_prot), sg)));
>>   	seq_printf(seq, "UDP: inuse %d mem %ld\n",
>>   		   sock_prot_inuse_get(net,&udp_prot),
>> -		   atomic_long_read(&udp_memory_allocated));
>> +		   atomic_long_read(sg_memory_allocated((&udp_prot), sg)));
>>   	seq_printf(seq, "UDPLITE: inuse %d\n",
>>   		   sock_prot_inuse_get(net,&udplite_prot));
>>   	seq_printf(seq, "RAW: inuse %d\n",
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 69fd720..5e89480 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,8 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/kmem_cgroup.h>
>> +#include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>>   #include<net/ip.h>
>> @@ -174,6 +176,43 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>>   	return ret;
>>   }
>>
>> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> +			   void __user *buffer, size_t *lenp,
>> +			   loff_t *ppos)
>> +{
>> +	int ret;
>> +	unsigned long vec[3];
>> +	struct kmem_cgroup *kmem = kcg_from_task(current);
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int i;
>> +
>> +	ctl_table tmp = {
>> +		.data =&vec,
>> +		.maxlen = sizeof(vec),
>> +		.mode = ctl->mode,
>> +	};
>> +
>> +	if (!write) {
>> +		ctl->data =&net->ipv4.sysctl_tcp_mem;
>> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
>> +	}
>> +
>> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
>> +	if (ret)
>> +		return ret;
>> +
>> +	for (i = 0; i<  3; i++)
>> +		if (vec[i]>  kmem->tcp_max_memory)
>> +			return -EINVAL;
>> +
>> +	for (i = 0; i<  3; i++) {
>> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
>> +		kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static struct ctl_table ipv4_table[] = {
>>   	{
>>   		.procname	= "tcp_timestamps",
>> @@ -433,13 +472,6 @@ static struct ctl_table ipv4_table[] = {
>>   		.proc_handler	= proc_dointvec
>>   	},
>>   	{
>> -		.procname	= "tcp_mem",
>> -		.data		=&sysctl_tcp_mem,
>> -		.maxlen		= sizeof(sysctl_tcp_mem),
>> -		.mode		= 0644,
>> -		.proc_handler	= proc_doulongvec_minmax
>> -	},
>> -	{
>>   		.procname	= "tcp_wmem",
>>   		.data		=&sysctl_tcp_wmem,
>>   		.maxlen		= sizeof(sysctl_tcp_wmem),
>> @@ -721,6 +753,12 @@ static struct ctl_table ipv4_net_table[] = {
>>   		.mode		= 0644,
>>   		.proc_handler	= ipv4_ping_group_range,
>>   	},
>> +	{
>> +		.procname	= "tcp_mem",
>> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
>> +		.mode		= 0644,
>> +		.proc_handler	= ipv4_tcp_mem,
>> +	},
>>   	{ }
>>   };
>>
>> @@ -734,6 +772,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>>   static __net_init int ipv4_sysctl_init_net(struct net *net)
>>   {
>>   	struct ctl_table *table;
>> +	unsigned long limit;
>>
>>   	table = ipv4_net_table;
>>   	if (!net_eq(net,&init_net)) {
>> @@ -769,6 +808,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>>
>>   	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>>
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>> +	net->ipv4.sysctl_tcp_mem[1] = limit;
>> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>> +
>
> What this calculation means ? Documented somewhere ?
I just copied the numbers from the socket initialization, because I 
wanted to keep the same behaviour. No documentation there, no 
documentation here. I don't know if there is any particular reason to 
choose them over any other. If someone more knowleadgeable would
comment, I can use the chance to introduce a comment.

>
>
>
>>   	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>>   			net_ipv4_ctl_path, table);
>>   	if (net->ipv4.ipv4_hdr == NULL)
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 46febca..e1918fa 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -266,6 +266,7 @@
>>   #include<linux/crypto.h>
>>   #include<linux/time.h>
>>   #include<linux/slab.h>
>> +#include<linux/nsproxy.h>
>>
>>   #include<net/icmp.h>
>>   #include<net/tcp.h>
>> @@ -282,23 +283,12 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>>   struct percpu_counter tcp_orphan_count;
>>   EXPORT_SYMBOL_GPL(tcp_orphan_count);
>>
>> -long sysctl_tcp_mem[3] __read_mostly;
>>   int sysctl_tcp_wmem[3] __read_mostly;
>>   int sysctl_tcp_rmem[3] __read_mostly;
>>
>> -EXPORT_SYMBOL(sysctl_tcp_mem);
>>   EXPORT_SYMBOL(sysctl_tcp_rmem);
>>   EXPORT_SYMBOL(sysctl_tcp_wmem);
>>
>> -atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
>> -EXPORT_SYMBOL(tcp_memory_allocated);
>> -
>> -/*
>> - * Current number of TCP sockets.
>> - */
>> -struct percpu_counter tcp_sockets_allocated;
>> -EXPORT_SYMBOL(tcp_sockets_allocated);
>> -
>>   /*
>>    * TCP splice context
>>    */
>> @@ -308,23 +298,157 @@ struct tcp_splice_state {
>>   	unsigned int flags;
>>   };
>>
>> +#ifdef CONFIG_CGROUP_KMEM
>>   /*
>>    * Pressure flag: try to collapse.
>>    * Technical note: it is used by multiple contexts non atomically.
>>    * All the __sk_mem_schedule() is of this nature: accounting
>>    * is strict, actions are advisory and have some latency.
>>    */
>> -int tcp_memory_pressure __read_mostly;
>> -EXPORT_SYMBOL(tcp_memory_pressure);
>> -
>>   void tcp_enter_memory_pressure(struct sock *sk)
>>   {
>> +	struct kmem_cgroup *sg = sk->sk_cgrp;
>> +	if (!sg->tcp_memory_pressure) {
>> +		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
>> +		sg->tcp_memory_pressure = 1;
>> +	}
>> +}
>> +
>> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +	return sg->tcp_prot_mem;
>> +}
>> +
>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
>> +{
>> +	return&(sg->tcp_memory_allocated);
>> +}
>> +
>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int i;
>> +
>> +	if (!cgroup_lock_live_group(cgrp))
>> +		return -ENODEV;
>> +
>> +	/*
>> +	 * We can't allow more memory than our parents. Since this
>> +	 * will be tested for all calls, by induction, there is no need
>> +	 * to test any parent other than our own
>> +	 * */
>> +	if (sg->parent&&  (val>  sg->parent->tcp_max_memory))
>> +		val = sg->parent->tcp_max_memory;
>> +
>> +	sg->tcp_max_memory = val;
>> +
>> +	for (i = 0; i<  3; i++)
>> +		sg->tcp_prot_mem[i]  = min_t(long, val,
>> +					     net->ipv4.sysctl_tcp_mem[i]);
>> +
>> +	cgroup_unlock();
>> +
>> +	return 0;
>> +}
>> +
>
> Do we really need cgroup_lock/unlock ?
The task writing to a cgroup file not necessarily belongs to it. This 
means there is no guarantee that the cgroup will exist through the rest 
of this operation. Maybe this could be done with rcu, since all we need 
is a reference to the cgroup (and the parent). But other cgroup file 
writers seem to use this lock...
>
>
>
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>> +	u64 ret;
>> +
>> +	if (!cgroup_lock_live_group(cgrp))
>> +		return -ENODEV;
>> +	ret = sg->tcp_max_memory;
>> +
>> +	cgroup_unlock();
>> +	return ret;
>> +}
>> +
>
>
> Hmm, can't you implement this function as
>
> 	kmem_cgroup_read(SOCK_TCP_MAXMEM, sg);

I can, but as I stated above, I don't think it is a nice change.

>
> Thanks,
> -Kame

Thank you very much for you attention.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] per-cgroup tcp buffer limitation
From: Greg Thelen @ 2011-09-06 16:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman
In-Reply-To: <1315276556-10970-1-git-send-email-glommer@parallels.com>

On Mon, Sep 5, 2011 at 7:35 PM, Glauber Costa <glommer@parallels.com> wrote:
> This patch introduces per-cgroup tcp buffers limitation. This allows
> sysadmins to specify a maximum amount of kernel memory that
> tcp connections can use at any point in time. TCP is the main interest
> in this work, but extending it to other protocols would be easy.

With this approach we would be giving admins the ability to
independently limit user memory with memcg and kernel memory with this
new kmem cgroup.

At least in some situations admins prefer to give a particular
container X bytes without thinking about the kernel vs user split.
Sometimes the admin would prefer the kernel to keep the total
user+kernel memory below a certain threshold.  To achieve this with
this approach would we need a user space agent to monitor both kernel
and user usage for a container and grow/shrink memcg/kmem limits?

Do you foresee the kmem cgroup growing to include reclaimable slab,
where freeing one type of memory allows for reclaim of the other?

> It piggybacks in the memory control mechanism already present in
> /proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
> that will suppress allocation when reached. For each cgroup, however,
> the file kmem.tcp_maxmem will be used to cap those values.
>
> The usage I have in mind here is containers. Each container will
> define its own values for soft and hard limits, but none of them will
> be possibly bigger than the value the box' sysadmin specified from
> the outside.
>
> To test for any performance impacts of this patch, I used netperf's
> TCP_RR benchmark on localhost, so we can have both recv and snd in action.
>
> Command line used was ./src/netperf -t TCP_RR -H localhost, and the
> results:
>
> Without the patch
> =================
>
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
>
> 16384  87380  1        1       10.00    26996.35
> 16384  87380
>
> With the patch
> ===============
>
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
>
> 16384  87380  1        1       10.00    27291.86
> 16384  87380
>
> The difference is within a one-percent range.
>
> Nesting cgroups doesn't seem to be the dominating factor as well,
> with nestings up to 10 levels not showing a significant performance
> difference.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  crypto/af_alg.c               |    7 ++-
>  include/linux/cgroup_subsys.h |    4 +
>  include/net/netns/ipv4.h      |    1 +
>  include/net/sock.h            |   66 +++++++++++++++-
>  include/net/tcp.h             |   12 ++-
>  include/net/udp.h             |    3 +-
>  include/trace/events/sock.h   |   10 +-
>  init/Kconfig                  |   11 +++
>  mm/Makefile                   |    1 +
>  net/core/sock.c               |  136 +++++++++++++++++++++++++++-------
>  net/decnet/af_decnet.c        |   21 +++++-
>  net/ipv4/proc.c               |    8 +-
>  net/ipv4/sysctl_net_ipv4.c    |   59 +++++++++++++--
>  net/ipv4/tcp.c                |  164 +++++++++++++++++++++++++++++++++++-----
>  net/ipv4/tcp_input.c          |   17 ++--
>  net/ipv4/tcp_ipv4.c           |   27 +++++--
>  net/ipv4/tcp_output.c         |    2 +-
>  net/ipv4/tcp_timer.c          |    2 +-
>  net/ipv4/udp.c                |   20 ++++-
>  net/ipv6/tcp_ipv6.c           |   16 +++-
>  net/ipv6/udp.c                |    4 +-
>  net/sctp/socket.c             |   35 +++++++--
>  22 files changed, 514 insertions(+), 112 deletions(-)
>
> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
> index ac33d5f..df168d8 100644
> --- a/crypto/af_alg.c
> +++ b/crypto/af_alg.c
> @@ -29,10 +29,15 @@ struct alg_type_list {
>
>  static atomic_long_t alg_memory_allocated;
>
> +static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
> +{
> +       return &alg_memory_allocated;
> +}
> +
>  static struct proto alg_proto = {
>        .name                   = "ALG",
>        .owner                  = THIS_MODULE,
> -       .memory_allocated       = &alg_memory_allocated,
> +       .memory_allocated       = memory_allocated_alg,
>        .obj_size               = sizeof(struct alg_sock),
>  };
>
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index ac663c1..363b8e8 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -35,6 +35,10 @@ SUBSYS(cpuacct)
>  SUBSYS(mem_cgroup)
>  #endif
>
> +#ifdef CONFIG_CGROUP_KMEM
> +SUBSYS(kmem)
> +#endif
> +
>  /* */
>
>  #ifdef CONFIG_CGROUP_DEVICE
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index d786b4f..bbd023a 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>        int current_rt_cache_rebuild_count;
>
>        unsigned int sysctl_ping_group_range[2];
> +       long sysctl_tcp_mem[3];
>
>        atomic_t rt_genid;
>        atomic_t dev_addr_genid;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 8e4062f..e085148 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -62,7 +62,9 @@
>  #include <linux/atomic.h>
>  #include <net/dst.h>
>  #include <net/checksum.h>
> +#include <linux/kmem_cgroup.h>
>
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  /*
>  * This structure really needs to be cleaned up.
>  * Most of it is for TCP, and not used by any of
> @@ -339,6 +341,7 @@ struct sock {
>  #endif
>        __u32                   sk_mark;
>        u32                     sk_classid;
> +       struct kmem_cgroup      *sk_cgrp;
>        void                    (*sk_state_change)(struct sock *sk);
>        void                    (*sk_data_ready)(struct sock *sk, int bytes);
>        void                    (*sk_write_space)(struct sock *sk);
> @@ -786,16 +789,21 @@ struct proto {
>
>        /* Memory pressure */
>        void                    (*enter_memory_pressure)(struct sock *sk);
> -       atomic_long_t           *memory_allocated;      /* Current allocated memory. */
> -       struct percpu_counter   *sockets_allocated;     /* Current number of sockets. */
> +       /* Current allocated memory. */
> +       atomic_long_t           *(*memory_allocated)(struct kmem_cgroup *sg);
> +       /* Current number of sockets. */
> +       struct percpu_counter   *(*sockets_allocated)(struct kmem_cgroup *sg);
> +
> +       int                     (*init_cgroup)(struct cgroup *cgrp,
> +                                              struct cgroup_subsys *ss);
>        /*
>         * Pressure flag: try to collapse.
>         * Technical note: it is used by multiple contexts non atomically.
>         * All the __sk_mem_schedule() is of this nature: accounting
>         * is strict, actions are advisory and have some latency.
>         */
> -       int                     *memory_pressure;
> -       long                    *sysctl_mem;
> +       int                     *(*memory_pressure)(struct kmem_cgroup *sg);
> +       long                    *(*prot_mem)(struct kmem_cgroup *sg);
>        int                     *sysctl_wmem;
>        int                     *sysctl_rmem;
>        int                     max_header;
> @@ -826,6 +834,56 @@ struct proto {
>  #endif
>  };
>
> +#define sk_memory_pressure(sk)                                         \
> +({                                                                     \
> +       int *__ret = NULL;                                              \
> +       if ((sk)->sk_prot->memory_pressure)                             \
> +               __ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);    \
> +       __ret;                                                          \
> +})
> +
> +#define sk_sockets_allocated(sk)                               \
> +({                                                             \
> +       struct percpu_counter *__p;                             \
> +       __p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);    \
> +       __p;                                                    \
> +})
> +
> +#define sk_memory_allocated(sk)                                        \
> +({                                                             \
> +       atomic_long_t *__mem;                                   \
> +       __mem = (sk)->sk_prot->memory_allocated(sk->sk_cgrp);   \
> +       __mem;                                                  \
> +})
> +
> +#define sk_prot_mem(sk)                                                \
> +({                                                             \
> +       long *__mem = (sk)->sk_prot->prot_mem(sk->sk_cgrp);     \
> +       __mem;                                                  \
> +})
> +
> +#define sg_memory_pressure(prot, sg)                           \
> +({                                                             \
> +       int *__ret = NULL;                                      \
> +       if (prot->memory_pressure)                              \
> +               __ret = (prot)->memory_pressure(sg);            \
> +       __ret;                                                  \
> +})
> +
> +#define sg_memory_allocated(prot, sg)                          \
> +({                                                             \
> +       atomic_long_t *__mem;                                   \
> +       __mem = (prot)->memory_allocated(sg);                   \
> +       __mem;                                                  \
> +})
> +
> +#define sg_sockets_allocated(prot, sg)                         \
> +({                                                             \
> +       struct percpu_counter *__p;                             \
> +       __p = (prot)->sockets_allocated(sg);                    \
> +       __p;                                                    \
> +})
> +
>  extern int proto_register(struct proto *prot, int alloc_slab);
>  extern void proto_unregister(struct proto *prot);
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 149a415..8e1ec4a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>  extern int sysctl_tcp_reordering;
>  extern int sysctl_tcp_ecn;
>  extern int sysctl_tcp_dsack;
> -extern long sysctl_tcp_mem[3];
>  extern int sysctl_tcp_wmem[3];
>  extern int sysctl_tcp_rmem[3];
>  extern int sysctl_tcp_app_win;
> @@ -253,9 +252,12 @@ extern int sysctl_tcp_cookie_size;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
>
> -extern atomic_long_t tcp_memory_allocated;
> -extern struct percpu_counter tcp_sockets_allocated;
> -extern int tcp_memory_pressure;
> +struct kmem_cgroup;
> +extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
> +int *memory_pressure_tcp(struct kmem_cgroup *sg);
> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
>
>  /*
>  * The next routines deal with comparing 32 bit unsigned ints
> @@ -286,7 +288,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
>        }
>
>        if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
> -           atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
> +           atomic_long_read(sk_memory_allocated(sk)) > sk_prot_mem(sk)[2])
>                return true;
>        return false;
>  }
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 67ea6fc..0e27388 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
>
>  extern struct proto udp_prot;
>
> -extern atomic_long_t udp_memory_allocated;
> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
> +long *udp_sysctl_mem(struct kmem_cgroup *sg);
>
>  /* sysctl variables for udp */
>  extern long sysctl_udp_mem[3];
> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
> index 779abb9..12a6083 100644
> --- a/include/trace/events/sock.h
> +++ b/include/trace/events/sock.h
> @@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>
>        TP_STRUCT__entry(
>                __array(char, name, 32)
> -               __field(long *, sysctl_mem)
> +               __field(long *, prot_mem)
>                __field(long, allocated)
>                __field(int, sysctl_rmem)
>                __field(int, rmem_alloc)
> @@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>
>        TP_fast_assign(
>                strncpy(__entry->name, prot->name, 32);
> -               __entry->sysctl_mem = prot->sysctl_mem;
> +               __entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
>                __entry->allocated = allocated;
>                __entry->sysctl_rmem = prot->sysctl_rmem[0];
>                __entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
> @@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>        TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
>                "sysctl_rmem=%d rmem_alloc=%d",
>                __entry->name,
> -               __entry->sysctl_mem[0],
> -               __entry->sysctl_mem[1],
> -               __entry->sysctl_mem[2],
> +               __entry->prot_mem[0],
> +               __entry->prot_mem[1],
> +               __entry->prot_mem[2],
>                __entry->allocated,
>                __entry->sysctl_rmem,
>                __entry->rmem_alloc)
> diff --git a/init/Kconfig b/init/Kconfig
> index d627783..5955ac2 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>          select this option (if, for some reason, they need to disable it
>          then swapaccount=0 does the trick).
>
> +config CGROUP_KMEM
> +       bool "Kernel Memory Resource Controller for Control Groups"
> +       depends on CGROUPS
> +       help
> +         The Kernel Memory cgroup can limit the amount of memory used by
> +         certain kernel objects in the system. Those are fundamentally
> +         different from the entities handled by the Memory Controller,
> +         which are page-based, and can be swapped. Users of the kmem
> +         cgroup can use it to guarantee that no group of processes will
> +         ever exhaust kernel resources alone.
> +
>  config CGROUP_PERF
>        bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>        depends on PERF_EVENTS && CGROUPS
> diff --git a/mm/Makefile b/mm/Makefile
> index 836e416..1b1aa24 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
>  obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 3449df8..2d968ea 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -134,6 +134,24 @@
>  #include <net/tcp.h>
>  #endif
>
> +static DEFINE_RWLOCK(proto_list_lock);
> +static LIST_HEAD(proto_list);
> +
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       struct proto *proto;
> +       int ret = 0;
> +
> +       read_lock(&proto_list_lock);
> +       list_for_each_entry(proto, &proto_list, node) {
> +               if (proto->init_cgroup)
> +                       ret |= proto->init_cgroup(cgrp, ss);
> +       }
> +       read_unlock(&proto_list_lock);
> +
> +       return ret;
> +}
> +
>  /*
>  * Each address family might have different locking rules, so we have
>  * one slock key per address family:
> @@ -1114,6 +1132,31 @@ void sock_update_classid(struct sock *sk)
>  EXPORT_SYMBOL(sock_update_classid);
>  #endif
>
> +void sock_update_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +       sk->sk_cgrp = kcg_from_task(current);
> +
> +       /*
> +        * We don't need to protect against anything task-related, because
> +        * we are basically stuck with the sock pointer that won't change,
> +        * even if the task that originated the socket changes cgroups.
> +        *
> +        * What we do have to guarantee, is that the chain leading us to
> +        * the top level won't change under our noses. Incrementing the
> +        * reference count via cgroup_exclude_rmdir guarantees that.
> +        */
> +       cgroup_exclude_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}
> +
> +void sock_release_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +       cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}
> +
>  /**
>  *     sk_alloc - All socket objects are allocated here
>  *     @net: the applicable net namespace
> @@ -1139,6 +1182,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>                atomic_set(&sk->sk_wmem_alloc, 1);
>
>                sock_update_classid(sk);
> +               sock_update_kmem_cgrp(sk);
>        }
>
>        return sk;
> @@ -1170,6 +1214,7 @@ static void __sk_free(struct sock *sk)
>                put_cred(sk->sk_peer_cred);
>        put_pid(sk->sk_peer_pid);
>        put_net(sock_net(sk));
> +       sock_release_kmem_cgrp(sk);
>        sk_prot_free(sk->sk_prot_creator, sk);
>  }
>
> @@ -1287,8 +1332,8 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
>                sk_set_socket(newsk, NULL);
>                newsk->sk_wq = NULL;
>
> -               if (newsk->sk_prot->sockets_allocated)
> -                       percpu_counter_inc(newsk->sk_prot->sockets_allocated);
> +               if (sk_sockets_allocated(sk))
> +                       percpu_counter_inc(sk_sockets_allocated(sk));
>
>                if (sock_flag(newsk, SOCK_TIMESTAMP) ||
>                    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
> @@ -1676,29 +1721,51 @@ EXPORT_SYMBOL(sk_wait_data);
>  */
>  int __sk_mem_schedule(struct sock *sk, int size, int kind)
>  {
> -       struct proto *prot = sk->sk_prot;
>        int amt = sk_mem_pages(size);
> +       struct proto *prot = sk->sk_prot;
>        long allocated;
> +       int *memory_pressure;
> +       long *prot_mem;
> +       int parent_failure = 0;
> +       struct kmem_cgroup *sg;
>
>        sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
> -       allocated = atomic_long_add_return(amt, prot->memory_allocated);
> +
> +       memory_pressure = sk_memory_pressure(sk);
> +       prot_mem = sk_prot_mem(sk);
> +
> +       allocated = atomic_long_add_return(amt, sk_memory_allocated(sk));
> +
> +#ifdef CONFIG_CGROUP_KMEM
> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
> +               long alloc;
> +               /*
> +                * Large nestings are not the common case, and stopping in the
> +                * middle would be complicated enough, that we bill it all the
> +                * way through the root, and if needed, unbill everything later
> +                */
> +               alloc = atomic_long_add_return(amt,
> +                                              sg_memory_allocated(prot, sg));
> +               parent_failure |= (alloc > sk_prot_mem(sk)[2]);
> +       }
> +#endif
> +
> +       /* Over hard limit (we, or our parents) */
> +       if (parent_failure || (allocated > prot_mem[2]))
> +               goto suppress_allocation;
>
>        /* Under limit. */
> -       if (allocated <= prot->sysctl_mem[0]) {
> -               if (prot->memory_pressure && *prot->memory_pressure)
> -                       *prot->memory_pressure = 0;
> +       if (allocated <= prot_mem[0]) {
> +               if (memory_pressure && *memory_pressure)
> +                       *memory_pressure = 0;
>                return 1;
>        }
>
>        /* Under pressure. */
> -       if (allocated > prot->sysctl_mem[1])
> +       if (allocated > prot_mem[1])
>                if (prot->enter_memory_pressure)
>                        prot->enter_memory_pressure(sk);
>
> -       /* Over hard limit. */
> -       if (allocated > prot->sysctl_mem[2])
> -               goto suppress_allocation;
> -
>        /* guarantee minimum buffer size under pressure */
>        if (kind == SK_MEM_RECV) {
>                if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
> @@ -1712,13 +1779,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
>                                return 1;
>        }
>
> -       if (prot->memory_pressure) {
> +       if (memory_pressure) {
>                int alloc;
>
> -               if (!*prot->memory_pressure)
> +               if (!*memory_pressure)
>                        return 1;
> -               alloc = percpu_counter_read_positive(prot->sockets_allocated);
> -               if (prot->sysctl_mem[2] > alloc *
> +               alloc = percpu_counter_read_positive(sk_sockets_allocated(sk));
> +               if (prot_mem[2] > alloc *
>                    sk_mem_pages(sk->sk_wmem_queued +
>                                 atomic_read(&sk->sk_rmem_alloc) +
>                                 sk->sk_forward_alloc))
> @@ -1741,7 +1808,13 @@ suppress_allocation:
>
>        /* Alas. Undo changes. */
>        sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
> -       atomic_long_sub(amt, prot->memory_allocated);
> +
> +       atomic_long_sub(amt, sk_memory_allocated(sk));
> +
> +#ifdef CONFIG_CGROUP_KMEM
> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent)
> +               atomic_long_sub(amt, sg_memory_allocated(prot, sg));
> +#endif
>        return 0;
>  }
>  EXPORT_SYMBOL(__sk_mem_schedule);
> @@ -1753,14 +1826,24 @@ EXPORT_SYMBOL(__sk_mem_schedule);
>  void __sk_mem_reclaim(struct sock *sk)
>  {
>        struct proto *prot = sk->sk_prot;
> +       struct kmem_cgroup *sg = sk->sk_cgrp;
> +       int *memory_pressure = sk_memory_pressure(sk);
>
>        atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
> -                  prot->memory_allocated);
> +                  sk_memory_allocated(sk));
> +
> +#ifdef CONFIG_CGROUP_KMEM
> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
> +               atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
> +                                               sg_memory_allocated(prot, sg));
> +       }
> +#endif
> +
>        sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
>
> -       if (prot->memory_pressure && *prot->memory_pressure &&
> -           (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
> -               *prot->memory_pressure = 0;
> +       if (memory_pressure && *memory_pressure &&
> +           (atomic_long_read(sk_memory_allocated(sk)) < sk_prot_mem(sk)[0]))
> +               *memory_pressure = 0;
>  }
>  EXPORT_SYMBOL(__sk_mem_reclaim);
>
> @@ -2252,9 +2335,6 @@ void sk_common_release(struct sock *sk)
>  }
>  EXPORT_SYMBOL(sk_common_release);
>
> -static DEFINE_RWLOCK(proto_list_lock);
> -static LIST_HEAD(proto_list);
> -
>  #ifdef CONFIG_PROC_FS
>  #define PROTO_INUSE_NR 64      /* should be enough for the first time */
>  struct prot_inuse {
> @@ -2479,13 +2559,17 @@ static char proto_method_implemented(const void *method)
>
>  static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
>  {
> +       struct kmem_cgroup *sg = kcg_from_task(current);
> +
>        seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
>                        "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
>                   proto->name,
>                   proto->obj_size,
>                   sock_prot_inuse_get(seq_file_net(seq), proto),
> -                  proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
> -                  proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
> +                  proto->memory_allocated != NULL ?
> +                       atomic_long_read(sg_memory_allocated(proto, sg)) : -1L,
> +                  proto->memory_pressure != NULL ?
> +                       *sg_memory_pressure(proto, sg) ? "yes" : "no" : "NI",
>                   proto->max_header,
>                   proto->slab == NULL ? "no" : "yes",
>                   module_name(proto->owner),
> diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
> index 19acd00..463b299 100644
> --- a/net/decnet/af_decnet.c
> +++ b/net/decnet/af_decnet.c
> @@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
>        }
>  }
>
> +static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
> +{
> +       return &decnet_memory_allocated;
> +}
> +
> +static int *memory_pressure_dn(struct kmem_cgroup *sg)
> +{
> +       return &dn_memory_pressure;
> +}
> +
> +static long *dn_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +       return sysctl_decnet_mem;
> +}
> +
>  static struct proto dn_proto = {
>        .name                   = "NSP",
>        .owner                  = THIS_MODULE,
>        .enter_memory_pressure  = dn_enter_memory_pressure,
> -       .memory_pressure        = &dn_memory_pressure,
> -       .memory_allocated       = &decnet_memory_allocated,
> -       .sysctl_mem             = sysctl_decnet_mem,
> +       .memory_pressure        = memory_pressure_dn,
> +       .memory_allocated       = memory_allocated_dn,
> +       .prot_mem               = dn_sysctl_mem,
>        .sysctl_wmem            = sysctl_decnet_wmem,
>        .sysctl_rmem            = sysctl_decnet_rmem,
>        .max_header             = DN_MAX_NSP_DATA_HEADER + 64,
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index b14ec7d..9c80acf 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -52,20 +52,22 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
>  {
>        struct net *net = seq->private;
>        int orphans, sockets;
> +       struct kmem_cgroup *sg = kcg_from_task(current);
> +       struct percpu_counter *allocated = sg_sockets_allocated(&tcp_prot, sg);
>
>        local_bh_disable();
>        orphans = percpu_counter_sum_positive(&tcp_orphan_count);
> -       sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
> +       sockets = percpu_counter_sum_positive(allocated);
>        local_bh_enable();
>
>        socket_seq_show(seq);
>        seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
>                   sock_prot_inuse_get(net, &tcp_prot), orphans,
>                   tcp_death_row.tw_count, sockets,
> -                  atomic_long_read(&tcp_memory_allocated));
> +                  atomic_long_read(sg_memory_allocated((&tcp_prot), sg)));
>        seq_printf(seq, "UDP: inuse %d mem %ld\n",
>                   sock_prot_inuse_get(net, &udp_prot),
> -                  atomic_long_read(&udp_memory_allocated));
> +                  atomic_long_read(sg_memory_allocated((&udp_prot), sg)));
>        seq_printf(seq, "UDPLITE: inuse %d\n",
>                   sock_prot_inuse_get(net, &udplite_prot));
>        seq_printf(seq, "RAW: inuse %d\n",
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 69fd720..5e89480 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -14,6 +14,8 @@
>  #include <linux/init.h>
>  #include <linux/slab.h>
>  #include <linux/nsproxy.h>
> +#include <linux/kmem_cgroup.h>
> +#include <linux/swap.h>
>  #include <net/snmp.h>
>  #include <net/icmp.h>
>  #include <net/ip.h>
> @@ -174,6 +176,43 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>        return ret;
>  }
>
> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
> +                          void __user *buffer, size_t *lenp,
> +                          loff_t *ppos)
> +{
> +       int ret;
> +       unsigned long vec[3];
> +       struct kmem_cgroup *kmem = kcg_from_task(current);
> +       struct net *net = current->nsproxy->net_ns;
> +       int i;
> +
> +       ctl_table tmp = {
> +               .data = &vec,
> +               .maxlen = sizeof(vec),
> +               .mode = ctl->mode,
> +       };
> +
> +       if (!write) {
> +               ctl->data = &net->ipv4.sysctl_tcp_mem;
> +               return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
> +       }
> +
> +       ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
> +       if (ret)
> +               return ret;
> +
> +       for (i = 0; i < 3; i++)
> +               if (vec[i] > kmem->tcp_max_memory)
> +                       return -EINVAL;
> +
> +       for (i = 0; i < 3; i++) {
> +               net->ipv4.sysctl_tcp_mem[i] = vec[i];
> +               kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
> +       }
> +
> +       return 0;
> +}
> +
>  static struct ctl_table ipv4_table[] = {
>        {
>                .procname       = "tcp_timestamps",
> @@ -433,13 +472,6 @@ static struct ctl_table ipv4_table[] = {
>                .proc_handler   = proc_dointvec
>        },
>        {
> -               .procname       = "tcp_mem",
> -               .data           = &sysctl_tcp_mem,
> -               .maxlen         = sizeof(sysctl_tcp_mem),
> -               .mode           = 0644,
> -               .proc_handler   = proc_doulongvec_minmax
> -       },
> -       {
>                .procname       = "tcp_wmem",
>                .data           = &sysctl_tcp_wmem,
>                .maxlen         = sizeof(sysctl_tcp_wmem),
> @@ -721,6 +753,12 @@ static struct ctl_table ipv4_net_table[] = {
>                .mode           = 0644,
>                .proc_handler   = ipv4_ping_group_range,
>        },
> +       {
> +               .procname       = "tcp_mem",
> +               .maxlen         = sizeof(init_net.ipv4.sysctl_tcp_mem),
> +               .mode           = 0644,
> +               .proc_handler   = ipv4_tcp_mem,
> +       },
>        { }
>  };
>
> @@ -734,6 +772,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>  static __net_init int ipv4_sysctl_init_net(struct net *net)
>  {
>        struct ctl_table *table;
> +       unsigned long limit;
>
>        table = ipv4_net_table;
>        if (!net_eq(net, &init_net)) {
> @@ -769,6 +808,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>
>        net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>
> +       limit = nr_free_buffer_pages() / 8;
> +       limit = max(limit, 128UL);
> +       net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
> +       net->ipv4.sysctl_tcp_mem[1] = limit;
> +       net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
> +
>        net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>                        net_ipv4_ctl_path, table);
>        if (net->ipv4.ipv4_hdr == NULL)
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 46febca..e1918fa 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -266,6 +266,7 @@
>  #include <linux/crypto.h>
>  #include <linux/time.h>
>  #include <linux/slab.h>
> +#include <linux/nsproxy.h>
>
>  #include <net/icmp.h>
>  #include <net/tcp.h>
> @@ -282,23 +283,12 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>
> -long sysctl_tcp_mem[3] __read_mostly;
>  int sysctl_tcp_wmem[3] __read_mostly;
>  int sysctl_tcp_rmem[3] __read_mostly;
>
> -EXPORT_SYMBOL(sysctl_tcp_mem);
>  EXPORT_SYMBOL(sysctl_tcp_rmem);
>  EXPORT_SYMBOL(sysctl_tcp_wmem);
>
> -atomic_long_t tcp_memory_allocated;    /* Current allocated memory. */
> -EXPORT_SYMBOL(tcp_memory_allocated);
> -
> -/*
> - * Current number of TCP sockets.
> - */
> -struct percpu_counter tcp_sockets_allocated;
> -EXPORT_SYMBOL(tcp_sockets_allocated);
> -
>  /*
>  * TCP splice context
>  */
> @@ -308,23 +298,157 @@ struct tcp_splice_state {
>        unsigned int flags;
>  };
>
> +#ifdef CONFIG_CGROUP_KMEM
>  /*
>  * Pressure flag: try to collapse.
>  * Technical note: it is used by multiple contexts non atomically.
>  * All the __sk_mem_schedule() is of this nature: accounting
>  * is strict, actions are advisory and have some latency.
>  */
> -int tcp_memory_pressure __read_mostly;
> -EXPORT_SYMBOL(tcp_memory_pressure);
> -
>  void tcp_enter_memory_pressure(struct sock *sk)
>  {
> +       struct kmem_cgroup *sg = sk->sk_cgrp;
> +       if (!sg->tcp_memory_pressure) {
> +               NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
> +               sg->tcp_memory_pressure = 1;
> +       }
> +}
> +
> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +       return sg->tcp_prot_mem;
> +}
> +
> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +       return &(sg->tcp_memory_allocated);
> +}
> +
> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
> +       struct net *net = current->nsproxy->net_ns;
> +       int i;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +
> +       /*
> +        * We can't allow more memory than our parents. Since this
> +        * will be tested for all calls, by induction, there is no need
> +        * to test any parent other than our own
> +        * */
> +       if (sg->parent && (val > sg->parent->tcp_max_memory))
> +               val = sg->parent->tcp_max_memory;
> +
> +       sg->tcp_max_memory = val;
> +
> +       for (i = 0; i < 3; i++)
> +               sg->tcp_prot_mem[i]  = min_t(long, val,
> +                                            net->ipv4.sysctl_tcp_mem[i]);
> +
> +       cgroup_unlock();
> +
> +       return 0;
> +}
> +
> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
> +       u64 ret;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +       ret = sg->tcp_max_memory;
> +
> +       cgroup_unlock();
> +       return ret;
> +}
> +
> +static struct cftype tcp_files[] = {
> +       {
> +               .name = "tcp_maxmem",
> +               .write_u64 = tcp_write_maxmem,
> +               .read_u64 = tcp_read_maxmem,
> +       },
> +};
> +
> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
> +{
> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
> +       unsigned long limit;
> +       struct net *net = current->nsproxy->net_ns;
> +
> +       sg->tcp_memory_pressure = 0;
> +       atomic_long_set(&sg->tcp_memory_allocated, 0);
> +       percpu_counter_init(&sg->tcp_sockets_allocated, 0);
> +
> +       limit = nr_free_buffer_pages() / 8;
> +       limit = max(limit, 128UL);
> +
> +       if (sg->parent)
> +               sg->tcp_max_memory = sg->parent->tcp_max_memory;
> +       else
> +               sg->tcp_max_memory = limit * 2;
> +
> +       sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
> +       sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
> +       sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
> +
> +       return cgroup_add_files(cgrp, ss, tcp_files, ARRAY_SIZE(tcp_files));
> +}
> +EXPORT_SYMBOL(tcp_init_cgroup);
> +
> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
> +{
> +       return &sg->tcp_memory_pressure;
> +}
> +
> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +       return &sg->tcp_sockets_allocated;
> +}
> +#else
> +
> +/* Current number of TCP sockets. */
> +struct percpu_counter tcp_sockets_allocated;
> +atomic_long_t tcp_memory_allocated;    /* Current allocated memory. */
> +int tcp_memory_pressure;
> +
> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
> +{
> +       return &tcp_memory_pressure;
> +}
> +
> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +       return &tcp_sockets_allocated;
> +}
> +
> +void tcp_enter_memory_pressure(struct sock *sock)
> +{
>        if (!tcp_memory_pressure) {
>                NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
>                tcp_memory_pressure = 1;
>        }
>  }
> +
> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +       return init_net.ipv4.sysctl_tcp_mem;
> +}
> +
> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +       return &tcp_memory_allocated;
> +}
> +#endif /* CONFIG_CGROUP_KMEM */
> +
> +EXPORT_SYMBOL(memory_pressure_tcp);
> +EXPORT_SYMBOL(sockets_allocated_tcp);
>  EXPORT_SYMBOL(tcp_enter_memory_pressure);
> +EXPORT_SYMBOL(tcp_sysctl_mem);
> +EXPORT_SYMBOL(memory_allocated_tcp);
>
>  /* Convert seconds to retransmits based on initial and max timeout */
>  static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
> @@ -3226,7 +3350,9 @@ void __init tcp_init(void)
>
>        BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
>
> +#ifndef CONFIG_CGROUP_KMEM
>        percpu_counter_init(&tcp_sockets_allocated, 0);
> +#endif
>        percpu_counter_init(&tcp_orphan_count, 0);
>        tcp_hashinfo.bind_bucket_cachep =
>                kmem_cache_create("tcp_bind_bucket",
> @@ -3277,14 +3403,10 @@ void __init tcp_init(void)
>        sysctl_tcp_max_orphans = cnt / 2;
>        sysctl_max_syn_backlog = max(128, cnt / 256);
>
> -       limit = nr_free_buffer_pages() / 8;
> -       limit = max(limit, 128UL);
> -       sysctl_tcp_mem[0] = limit / 4 * 3;
> -       sysctl_tcp_mem[1] = limit;
> -       sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
> -
>        /* Set per-socket limits to no more than 1/128 the pressure threshold */
> -       limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
> +       limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
> +       limit <<= (PAGE_SHIFT - 7);
> +
>        max_share = min(4UL*1024*1024, limit);
>
>        sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ea0d218..c44e830 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
>        /* Check #1 */
>        if (tp->rcv_ssthresh < tp->window_clamp &&
>            (int)tp->rcv_ssthresh < tcp_space(sk) &&
> -           !tcp_memory_pressure) {
> +           !sk_memory_pressure(sk)) {
>                int incr;
>
>                /* Check #2. Increase window, if skb with such overhead
> @@ -393,15 +393,16 @@ static void tcp_clamp_window(struct sock *sk)
>  {
>        struct tcp_sock *tp = tcp_sk(sk);
>        struct inet_connection_sock *icsk = inet_csk(sk);
> +       struct proto *prot = sk->sk_prot;
>
>        icsk->icsk_ack.quick = 0;
>
> -       if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
> +       if (sk->sk_rcvbuf < prot->sysctl_rmem[2] &&
>            !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
> -           !tcp_memory_pressure &&
> -           atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
> +           !sk_memory_pressure(sk) &&
> +           atomic_long_read(sk_memory_allocated(sk)) < sk_prot_mem(sk)[0]) {
>                sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
> -                                   sysctl_tcp_rmem[2]);
> +                                   prot->sysctl_rmem[2]);
>        }
>        if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
>                tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
> @@ -4806,7 +4807,7 @@ static int tcp_prune_queue(struct sock *sk)
>
>        if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
>                tcp_clamp_window(sk);
> -       else if (tcp_memory_pressure)
> +       else if (sk_memory_pressure(sk))
>                tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
>
>        tcp_collapse_ofo_queue(sk);
> @@ -4872,11 +4873,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
>                return 0;
>
>        /* If we are under global TCP memory pressure, do not expand.  */
> -       if (tcp_memory_pressure)
> +       if (sk_memory_pressure(sk))
>                return 0;
>
>        /* If we are under soft global TCP memory pressure, do not expand.  */
> -       if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
> +       if (atomic_long_read(sk_memory_allocated(sk)) >= sk_prot_mem(sk)[0])
>                return 0;
>
>        /* If we filled the congestion window, do not expand.  */
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 1c12b8e..af6c095 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1848,6 +1848,7 @@ static int tcp_v4_init_sock(struct sock *sk)
>  {
>        struct inet_connection_sock *icsk = inet_csk(sk);
>        struct tcp_sock *tp = tcp_sk(sk);
> +       struct kmem_cgroup *sg;
>
>        skb_queue_head_init(&tp->out_of_order_queue);
>        tcp_init_xmit_timers(sk);
> @@ -1901,7 +1902,13 @@ static int tcp_v4_init_sock(struct sock *sk)
>        sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>
>        local_bh_disable();
> -       percpu_counter_inc(&tcp_sockets_allocated);
> +       percpu_counter_inc(sk_sockets_allocated(sk));
> +
> +#ifdef CONFIG_CGROUP_KMEM
> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
> +               percpu_counter_inc(sg_sockets_allocated(sk->sk_prot, sg));
> +#endif
> +
>        local_bh_enable();
>
>        return 0;
> @@ -1910,6 +1917,7 @@ static int tcp_v4_init_sock(struct sock *sk)
>  void tcp_v4_destroy_sock(struct sock *sk)
>  {
>        struct tcp_sock *tp = tcp_sk(sk);
> +       struct kmem_cgroup *sg;
>
>        tcp_clear_xmit_timers(sk);
>
> @@ -1957,7 +1965,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
>                tp->cookie_values = NULL;
>        }
>
> -       percpu_counter_dec(&tcp_sockets_allocated);
> +       percpu_counter_dec(sk_sockets_allocated(sk));
> +#ifdef CONFIG_CGROUP_KMEM
> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
> +               percpu_counter_dec(sg_sockets_allocated(sk->sk_prot, sg));
> +#endif
>  }
>  EXPORT_SYMBOL(tcp_v4_destroy_sock);
>
> @@ -2598,11 +2610,14 @@ struct proto tcp_prot = {
>        .unhash                 = inet_unhash,
>        .get_port               = inet_csk_get_port,
>        .enter_memory_pressure  = tcp_enter_memory_pressure,
> -       .sockets_allocated      = &tcp_sockets_allocated,
> +       .memory_pressure        = memory_pressure_tcp,
> +       .sockets_allocated      = sockets_allocated_tcp,
>        .orphan_count           = &tcp_orphan_count,
> -       .memory_allocated       = &tcp_memory_allocated,
> -       .memory_pressure        = &tcp_memory_pressure,
> -       .sysctl_mem             = sysctl_tcp_mem,
> +       .memory_allocated       = memory_allocated_tcp,
> +#ifdef CONFIG_CGROUP_KMEM
> +       .init_cgroup            = tcp_init_cgroup,
> +#endif
> +       .prot_mem               = tcp_sysctl_mem,
>        .sysctl_wmem            = sysctl_tcp_wmem,
>        .sysctl_rmem            = sysctl_tcp_rmem,
>        .max_header             = MAX_TCP_HEADER,
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 882e0b0..06aeb31 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
>        if (free_space < (full_space >> 1)) {
>                icsk->icsk_ack.quick = 0;
>
> -               if (tcp_memory_pressure)
> +               if (sk_memory_pressure(sk))
>                        tp->rcv_ssthresh = min(tp->rcv_ssthresh,
>                                               4U * tp->advmss);
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index ecd44b0..2c67617 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
>        }
>
>  out:
> -       if (tcp_memory_pressure)
> +       if (sk_memory_pressure(sk))
>                sk_mem_reclaim(sk);
>  out_unlock:
>        bh_unlock_sock(sk);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 1b5a193..6c08c65 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
>  int sysctl_udp_wmem_min __read_mostly;
>  EXPORT_SYMBOL(sysctl_udp_wmem_min);
>
> -atomic_long_t udp_memory_allocated;
> -EXPORT_SYMBOL(udp_memory_allocated);
> -
>  #define MAX_UDP_PORTS 65536
>  #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
>
> @@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
>  }
>  EXPORT_SYMBOL(udp_poll);
>
> +static atomic_long_t udp_memory_allocated;
> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg)
> +{
> +       return &udp_memory_allocated;
> +}
> +EXPORT_SYMBOL(memory_allocated_udp);
> +
> +long *udp_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +       return sysctl_udp_mem;
> +}
> +EXPORT_SYMBOL(udp_sysctl_mem);
> +
>  struct proto udp_prot = {
>        .name              = "UDP",
>        .owner             = THIS_MODULE,
> @@ -1936,8 +1946,8 @@ struct proto udp_prot = {
>        .unhash            = udp_lib_unhash,
>        .rehash            = udp_v4_rehash,
>        .get_port          = udp_v4_get_port,
> -       .memory_allocated  = &udp_memory_allocated,
> -       .sysctl_mem        = sysctl_udp_mem,
> +       .memory_allocated  = &memory_allocated_udp,
> +       .prot_mem          = udp_sysctl_mem,
>        .sysctl_wmem       = &sysctl_udp_wmem_min,
>        .sysctl_rmem       = &sysctl_udp_rmem_min,
>        .obj_size          = sizeof(struct udp_sock),
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index d1fb63f..0762e68 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1959,6 +1959,7 @@ static int tcp_v6_init_sock(struct sock *sk)
>  {
>        struct inet_connection_sock *icsk = inet_csk(sk);
>        struct tcp_sock *tp = tcp_sk(sk);
> +       struct kmem_cgroup *sg;
>
>        skb_queue_head_init(&tp->out_of_order_queue);
>        tcp_init_xmit_timers(sk);
> @@ -2012,7 +2013,12 @@ static int tcp_v6_init_sock(struct sock *sk)
>        sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>
>        local_bh_disable();
> -       percpu_counter_inc(&tcp_sockets_allocated);
> +       percpu_counter_inc(sk_sockets_allocated(sk));
> +#ifdef CONFIG_CGROUP_KMEM
> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
> +               percpu_counter_dec(sg_sockets_allocated(sk->sk_prot, sg));
> +#endif
> +
>        local_bh_enable();
>
>        return 0;
> @@ -2221,11 +2227,11 @@ struct proto tcpv6_prot = {
>        .unhash                 = inet_unhash,
>        .get_port               = inet_csk_get_port,
>        .enter_memory_pressure  = tcp_enter_memory_pressure,
> -       .sockets_allocated      = &tcp_sockets_allocated,
> -       .memory_allocated       = &tcp_memory_allocated,
> -       .memory_pressure        = &tcp_memory_pressure,
> +       .sockets_allocated      = sockets_allocated_tcp,
> +       .memory_allocated       = memory_allocated_tcp,
> +       .memory_pressure        = memory_pressure_tcp,
>        .orphan_count           = &tcp_orphan_count,
> -       .sysctl_mem             = sysctl_tcp_mem,
> +       .prot_mem               = tcp_sysctl_mem,
>        .sysctl_wmem            = sysctl_tcp_wmem,
>        .sysctl_rmem            = sysctl_tcp_rmem,
>        .max_header             = MAX_TCP_HEADER,
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 29213b5..ef4b5b3 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
>        .unhash            = udp_lib_unhash,
>        .rehash            = udp_v6_rehash,
>        .get_port          = udp_v6_get_port,
> -       .memory_allocated  = &udp_memory_allocated,
> -       .sysctl_mem        = sysctl_udp_mem,
> +       .memory_allocated  = memory_allocated_udp,
> +       .prot_mem          = udp_sysctl_mem,
>        .sysctl_wmem       = &sysctl_udp_wmem_min,
>        .sysctl_rmem       = &sysctl_udp_rmem_min,
>        .obj_size          = sizeof(struct udp6_sock),
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 836aa63..1b0300d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -119,11 +119,30 @@ static int sctp_memory_pressure;
>  static atomic_long_t sctp_memory_allocated;
>  struct percpu_counter sctp_sockets_allocated;
>
> +static long *sctp_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +       return sysctl_sctp_mem;
> +}
> +
>  static void sctp_enter_memory_pressure(struct sock *sk)
>  {
>        sctp_memory_pressure = 1;
>  }
>
> +static int *memory_pressure_sctp(struct kmem_cgroup *sg)
> +{
> +       return &sctp_memory_pressure;
> +}
> +
> +static atomic_long_t *memory_allocated_sctp(struct kmem_cgroup *sg)
> +{
> +       return &sctp_memory_allocated;
> +}
> +
> +static struct percpu_counter *sockets_allocated_sctp(struct kmem_cgroup *sg)
> +{
> +       return &sctp_sockets_allocated;
> +}
>
>  /* Get the sndbuf space available at the time on the association.  */
>  static inline int sctp_wspace(struct sctp_association *asoc)
> @@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
>        .unhash      =  sctp_unhash,
>        .get_port    =  sctp_get_port,
>        .obj_size    =  sizeof(struct sctp_sock),
> -       .sysctl_mem  =  sysctl_sctp_mem,
> +       .prot_mem    =  sctp_sysctl_mem,
>        .sysctl_rmem =  sysctl_sctp_rmem,
>        .sysctl_wmem =  sysctl_sctp_wmem,
> -       .memory_pressure = &sctp_memory_pressure,
> +       .memory_pressure = memory_pressure_sctp,
>        .enter_memory_pressure = sctp_enter_memory_pressure,
> -       .memory_allocated = &sctp_memory_allocated,
> -       .sockets_allocated = &sctp_sockets_allocated,
> +       .memory_allocated = memory_allocated_sctp,
> +       .sockets_allocated = sockets_allocated_sctp,
>  };
>
>  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> @@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
>        .unhash         = sctp_unhash,
>        .get_port       = sctp_get_port,
>        .obj_size       = sizeof(struct sctp6_sock),
> -       .sysctl_mem     = sysctl_sctp_mem,
> +       .prot_mem       = sctp_sysctl_mem,
>        .sysctl_rmem    = sysctl_sctp_rmem,
>        .sysctl_wmem    = sysctl_sctp_wmem,
> -       .memory_pressure = &sctp_memory_pressure,
> +       .memory_pressure = memory_pressure_sctp,
>        .enter_memory_pressure = sctp_enter_memory_pressure,
> -       .memory_allocated = &sctp_memory_allocated,
> -       .sockets_allocated = &sctp_sockets_allocated,
> +       .memory_allocated = memory_allocated_sctp,
> +       .sockets_allocated = sockets_allocated_sctp,
>  };
>  #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
> --
> 1.7.6
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply

* Re: [PATCH] per-cgroup tcp buffer limitation
From: Glauber Costa @ 2011-09-06 16:16 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman
In-Reply-To: <CAHH2K0aJxjinSu0Ek6jzsZ5dBmm5mEU-typuwYWYWEudF2F3Qg@mail.gmail.com>

On 09/06/2011 01:08 PM, Greg Thelen wrote:
> On Mon, Sep 5, 2011 at 7:35 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> This patch introduces per-cgroup tcp buffers limitation. This allows
>> sysadmins to specify a maximum amount of kernel memory that
>> tcp connections can use at any point in time. TCP is the main interest
>> in this work, but extending it to other protocols would be easy.

Hello Greg,

> With this approach we would be giving admins the ability to
> independently limit user memory with memcg and kernel memory with this
> new kmem cgroup.
>
> At least in some situations admins prefer to give a particular
> container X bytes without thinking about the kernel vs user split.
> Sometimes the admin would prefer the kernel to keep the total
> user+kernel memory below a certain threshold.  To achieve this with
> this approach would we need a user space agent to monitor both kernel
> and user usage for a container and grow/shrink memcg/kmem limits?
Yes, I believe so. And this is not only valid for containers: the 
information we expose in proc, sys, cgroups, etc, is always much more 
fine grained than a considerable part of the users want. Tools come to 
fill this gap.

>
> Do you foresee the kmem cgroup growing to include reclaimable slab,
> where freeing one type of memory allows for reclaim of the other?
Yes, absolutely.

>
>> It piggybacks in the memory control mechanism already present in
>> /proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
>> that will suppress allocation when reached. For each cgroup, however,
>> the file kmem.tcp_maxmem will be used to cap those values.
>>
>> The usage I have in mind here is containers. Each container will
>> define its own values for soft and hard limits, but none of them will
>> be possibly bigger than the value the box' sysadmin specified from
>> the outside.
>>
>> To test for any performance impacts of this patch, I used netperf's
>> TCP_RR benchmark on localhost, so we can have both recv and snd in action.
>>
>> Command line used was ./src/netperf -t TCP_RR -H localhost, and the
>> results:
>>
>> Without the patch
>> =================
>>
>> Socket Size   Request  Resp.   Elapsed  Trans.
>> Send   Recv   Size     Size    Time     Rate
>> bytes  Bytes  bytes    bytes   secs.    per sec
>>
>> 16384  87380  1        1       10.00    26996.35
>> 16384  87380
>>
>> With the patch
>> ===============
>>
>> Local /Remote
>> Socket Size   Request  Resp.   Elapsed  Trans.
>> Send   Recv   Size     Size    Time     Rate
>> bytes  Bytes  bytes    bytes   secs.    per sec
>>
>> 16384  87380  1        1       10.00    27291.86
>> 16384  87380
>>
>> The difference is within a one-percent range.
>>
>> Nesting cgroups doesn't seem to be the dominating factor as well,
>> with nestings up to 10 levels not showing a significant performance
>> difference.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   crypto/af_alg.c               |    7 ++-
>>   include/linux/cgroup_subsys.h |    4 +
>>   include/net/netns/ipv4.h      |    1 +
>>   include/net/sock.h            |   66 +++++++++++++++-
>>   include/net/tcp.h             |   12 ++-
>>   include/net/udp.h             |    3 +-
>>   include/trace/events/sock.h   |   10 +-
>>   init/Kconfig                  |   11 +++
>>   mm/Makefile                   |    1 +
>>   net/core/sock.c               |  136 +++++++++++++++++++++++++++-------
>>   net/decnet/af_decnet.c        |   21 +++++-
>>   net/ipv4/proc.c               |    8 +-
>>   net/ipv4/sysctl_net_ipv4.c    |   59 +++++++++++++--
>>   net/ipv4/tcp.c                |  164 +++++++++++++++++++++++++++++++++++-----
>>   net/ipv4/tcp_input.c          |   17 ++--
>>   net/ipv4/tcp_ipv4.c           |   27 +++++--
>>   net/ipv4/tcp_output.c         |    2 +-
>>   net/ipv4/tcp_timer.c          |    2 +-
>>   net/ipv4/udp.c                |   20 ++++-
>>   net/ipv6/tcp_ipv6.c           |   16 +++-
>>   net/ipv6/udp.c                |    4 +-
>>   net/sctp/socket.c             |   35 +++++++--
>>   22 files changed, 514 insertions(+), 112 deletions(-)
>>
>> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
>> index ac33d5f..df168d8 100644
>> --- a/crypto/af_alg.c
>> +++ b/crypto/af_alg.c
>> @@ -29,10 +29,15 @@ struct alg_type_list {
>>
>>   static atomic_long_t alg_memory_allocated;
>>
>> +static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
>> +{
>> +       return&alg_memory_allocated;
>> +}
>> +
>>   static struct proto alg_proto = {
>>         .name                   = "ALG",
>>         .owner                  = THIS_MODULE,
>> -       .memory_allocated       =&alg_memory_allocated,
>> +       .memory_allocated       = memory_allocated_alg,
>>         .obj_size               = sizeof(struct alg_sock),
>>   };
>>
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index ac663c1..363b8e8 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -35,6 +35,10 @@ SUBSYS(cpuacct)
>>   SUBSYS(mem_cgroup)
>>   #endif
>>
>> +#ifdef CONFIG_CGROUP_KMEM
>> +SUBSYS(kmem)
>> +#endif
>> +
>>   /* */
>>
>>   #ifdef CONFIG_CGROUP_DEVICE
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index d786b4f..bbd023a 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>>         int current_rt_cache_rebuild_count;
>>
>>         unsigned int sysctl_ping_group_range[2];
>> +       long sysctl_tcp_mem[3];
>>
>>         atomic_t rt_genid;
>>         atomic_t dev_addr_genid;
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 8e4062f..e085148 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -62,7 +62,9 @@
>>   #include<linux/atomic.h>
>>   #include<net/dst.h>
>>   #include<net/checksum.h>
>> +#include<linux/kmem_cgroup.h>
>>
>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>>   /*
>>   * This structure really needs to be cleaned up.
>>   * Most of it is for TCP, and not used by any of
>> @@ -339,6 +341,7 @@ struct sock {
>>   #endif
>>         __u32                   sk_mark;
>>         u32                     sk_classid;
>> +       struct kmem_cgroup      *sk_cgrp;
>>         void                    (*sk_state_change)(struct sock *sk);
>>         void                    (*sk_data_ready)(struct sock *sk, int bytes);
>>         void                    (*sk_write_space)(struct sock *sk);
>> @@ -786,16 +789,21 @@ struct proto {
>>
>>         /* Memory pressure */
>>         void                    (*enter_memory_pressure)(struct sock *sk);
>> -       atomic_long_t           *memory_allocated;      /* Current allocated memory. */
>> -       struct percpu_counter   *sockets_allocated;     /* Current number of sockets. */
>> +       /* Current allocated memory. */
>> +       atomic_long_t           *(*memory_allocated)(struct kmem_cgroup *sg);
>> +       /* Current number of sockets. */
>> +       struct percpu_counter   *(*sockets_allocated)(struct kmem_cgroup *sg);
>> +
>> +       int                     (*init_cgroup)(struct cgroup *cgrp,
>> +                                              struct cgroup_subsys *ss);
>>         /*
>>          * Pressure flag: try to collapse.
>>          * Technical note: it is used by multiple contexts non atomically.
>>          * All the __sk_mem_schedule() is of this nature: accounting
>>          * is strict, actions are advisory and have some latency.
>>          */
>> -       int                     *memory_pressure;
>> -       long                    *sysctl_mem;
>> +       int                     *(*memory_pressure)(struct kmem_cgroup *sg);
>> +       long                    *(*prot_mem)(struct kmem_cgroup *sg);
>>         int                     *sysctl_wmem;
>>         int                     *sysctl_rmem;
>>         int                     max_header;
>> @@ -826,6 +834,56 @@ struct proto {
>>   #endif
>>   };
>>
>> +#define sk_memory_pressure(sk)                                         \
>> +({                                                                     \
>> +       int *__ret = NULL;                                              \
>> +       if ((sk)->sk_prot->memory_pressure)                             \
>> +               __ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);    \
>> +       __ret;                                                          \
>> +})
>> +
>> +#define sk_sockets_allocated(sk)                               \
>> +({                                                             \
>> +       struct percpu_counter *__p;                             \
>> +       __p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);    \
>> +       __p;                                                    \
>> +})
>> +
>> +#define sk_memory_allocated(sk)                                        \
>> +({                                                             \
>> +       atomic_long_t *__mem;                                   \
>> +       __mem = (sk)->sk_prot->memory_allocated(sk->sk_cgrp);   \
>> +       __mem;                                                  \
>> +})
>> +
>> +#define sk_prot_mem(sk)                                                \
>> +({                                                             \
>> +       long *__mem = (sk)->sk_prot->prot_mem(sk->sk_cgrp);     \
>> +       __mem;                                                  \
>> +})
>> +
>> +#define sg_memory_pressure(prot, sg)                           \
>> +({                                                             \
>> +       int *__ret = NULL;                                      \
>> +       if (prot->memory_pressure)                              \
>> +               __ret = (prot)->memory_pressure(sg);            \
>> +       __ret;                                                  \
>> +})
>> +
>> +#define sg_memory_allocated(prot, sg)                          \
>> +({                                                             \
>> +       atomic_long_t *__mem;                                   \
>> +       __mem = (prot)->memory_allocated(sg);                   \
>> +       __mem;                                                  \
>> +})
>> +
>> +#define sg_sockets_allocated(prot, sg)                         \
>> +({                                                             \
>> +       struct percpu_counter *__p;                             \
>> +       __p = (prot)->sockets_allocated(sg);                    \
>> +       __p;                                                    \
>> +})
>> +
>>   extern int proto_register(struct proto *prot, int alloc_slab);
>>   extern void proto_unregister(struct proto *prot);
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 149a415..8e1ec4a 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>>   extern int sysctl_tcp_reordering;
>>   extern int sysctl_tcp_ecn;
>>   extern int sysctl_tcp_dsack;
>> -extern long sysctl_tcp_mem[3];
>>   extern int sysctl_tcp_wmem[3];
>>   extern int sysctl_tcp_rmem[3];
>>   extern int sysctl_tcp_app_win;
>> @@ -253,9 +252,12 @@ extern int sysctl_tcp_cookie_size;
>>   extern int sysctl_tcp_thin_linear_timeouts;
>>   extern int sysctl_tcp_thin_dupack;
>>
>> -extern atomic_long_t tcp_memory_allocated;
>> -extern struct percpu_counter tcp_sockets_allocated;
>> -extern int tcp_memory_pressure;
>> +struct kmem_cgroup;
>> +extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
>> +int *memory_pressure_tcp(struct kmem_cgroup *sg);
>> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
>>
>>   /*
>>   * The next routines deal with comparing 32 bit unsigned ints
>> @@ -286,7 +288,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
>>         }
>>
>>         if (sk->sk_wmem_queued>  SOCK_MIN_SNDBUF&&
>> -           atomic_long_read(&tcp_memory_allocated)>  sysctl_tcp_mem[2])
>> +           atomic_long_read(sk_memory_allocated(sk))>  sk_prot_mem(sk)[2])
>>                 return true;
>>         return false;
>>   }
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index 67ea6fc..0e27388 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
>>
>>   extern struct proto udp_prot;
>>
>> -extern atomic_long_t udp_memory_allocated;
>> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
>> +long *udp_sysctl_mem(struct kmem_cgroup *sg);
>>
>>   /* sysctl variables for udp */
>>   extern long sysctl_udp_mem[3];
>> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
>> index 779abb9..12a6083 100644
>> --- a/include/trace/events/sock.h
>> +++ b/include/trace/events/sock.h
>> @@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>
>>         TP_STRUCT__entry(
>>                 __array(char, name, 32)
>> -               __field(long *, sysctl_mem)
>> +               __field(long *, prot_mem)
>>                 __field(long, allocated)
>>                 __field(int, sysctl_rmem)
>>                 __field(int, rmem_alloc)
>> @@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>
>>         TP_fast_assign(
>>                 strncpy(__entry->name, prot->name, 32);
>> -               __entry->sysctl_mem = prot->sysctl_mem;
>> +               __entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
>>                 __entry->allocated = allocated;
>>                 __entry->sysctl_rmem = prot->sysctl_rmem[0];
>>                 __entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
>> @@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>         TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
>>                 "sysctl_rmem=%d rmem_alloc=%d",
>>                 __entry->name,
>> -               __entry->sysctl_mem[0],
>> -               __entry->sysctl_mem[1],
>> -               __entry->sysctl_mem[2],
>> +               __entry->prot_mem[0],
>> +               __entry->prot_mem[1],
>> +               __entry->prot_mem[2],
>>                 __entry->allocated,
>>                 __entry->sysctl_rmem,
>>                 __entry->rmem_alloc)
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..5955ac2 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>           select this option (if, for some reason, they need to disable it
>>           then swapaccount=0 does the trick).
>>
>> +config CGROUP_KMEM
>> +       bool "Kernel Memory Resource Controller for Control Groups"
>> +       depends on CGROUPS
>> +       help
>> +         The Kernel Memory cgroup can limit the amount of memory used by
>> +         certain kernel objects in the system. Those are fundamentally
>> +         different from the entities handled by the Memory Controller,
>> +         which are page-based, and can be swapped. Users of the kmem
>> +         cgroup can use it to guarantee that no group of processes will
>> +         ever exhaust kernel resources alone.
>> +
>>   config CGROUP_PERF
>>         bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>         depends on PERF_EVENTS&&  CGROUPS
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 836e416..1b1aa24 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>   obj-$(CONFIG_QUICKLIST) += quicklist.o
>>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>>   obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>> +obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
>>   obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>>   obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>>   obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 3449df8..2d968ea 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -134,6 +134,24 @@
>>   #include<net/tcp.h>
>>   #endif
>>
>> +static DEFINE_RWLOCK(proto_list_lock);
>> +static LIST_HEAD(proto_list);
>> +
>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
>> +{
>> +       struct proto *proto;
>> +       int ret = 0;
>> +
>> +       read_lock(&proto_list_lock);
>> +       list_for_each_entry(proto,&proto_list, node) {
>> +               if (proto->init_cgroup)
>> +                       ret |= proto->init_cgroup(cgrp, ss);
>> +       }
>> +       read_unlock(&proto_list_lock);
>> +
>> +       return ret;
>> +}
>> +
>>   /*
>>   * Each address family might have different locking rules, so we have
>>   * one slock key per address family:
>> @@ -1114,6 +1132,31 @@ void sock_update_classid(struct sock *sk)
>>   EXPORT_SYMBOL(sock_update_classid);
>>   #endif
>>
>> +void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       sk->sk_cgrp = kcg_from_task(current);
>> +
>> +       /*
>> +        * We don't need to protect against anything task-related, because
>> +        * we are basically stuck with the sock pointer that won't change,
>> +        * even if the task that originated the socket changes cgroups.
>> +        *
>> +        * What we do have to guarantee, is that the chain leading us to
>> +        * the top level won't change under our noses. Incrementing the
>> +        * reference count via cgroup_exclude_rmdir guarantees that.
>> +        */
>> +       cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>>   /**
>>   *     sk_alloc - All socket objects are allocated here
>>   *     @net: the applicable net namespace
>> @@ -1139,6 +1182,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>>                 atomic_set(&sk->sk_wmem_alloc, 1);
>>
>>                 sock_update_classid(sk);
>> +               sock_update_kmem_cgrp(sk);
>>         }
>>
>>         return sk;
>> @@ -1170,6 +1214,7 @@ static void __sk_free(struct sock *sk)
>>                 put_cred(sk->sk_peer_cred);
>>         put_pid(sk->sk_peer_pid);
>>         put_net(sock_net(sk));
>> +       sock_release_kmem_cgrp(sk);
>>         sk_prot_free(sk->sk_prot_creator, sk);
>>   }
>>
>> @@ -1287,8 +1332,8 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
>>                 sk_set_socket(newsk, NULL);
>>                 newsk->sk_wq = NULL;
>>
>> -               if (newsk->sk_prot->sockets_allocated)
>> -                       percpu_counter_inc(newsk->sk_prot->sockets_allocated);
>> +               if (sk_sockets_allocated(sk))
>> +                       percpu_counter_inc(sk_sockets_allocated(sk));
>>
>>                 if (sock_flag(newsk, SOCK_TIMESTAMP) ||
>>                     sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
>> @@ -1676,29 +1721,51 @@ EXPORT_SYMBOL(sk_wait_data);
>>   */
>>   int __sk_mem_schedule(struct sock *sk, int size, int kind)
>>   {
>> -       struct proto *prot = sk->sk_prot;
>>         int amt = sk_mem_pages(size);
>> +       struct proto *prot = sk->sk_prot;
>>         long allocated;
>> +       int *memory_pressure;
>> +       long *prot_mem;
>> +       int parent_failure = 0;
>> +       struct kmem_cgroup *sg;
>>
>>         sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
>> -       allocated = atomic_long_add_return(amt, prot->memory_allocated);
>> +
>> +       memory_pressure = sk_memory_pressure(sk);
>> +       prot_mem = sk_prot_mem(sk);
>> +
>> +       allocated = atomic_long_add_return(amt, sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>> +               long alloc;
>> +               /*
>> +                * Large nestings are not the common case, and stopping in the
>> +                * middle would be complicated enough, that we bill it all the
>> +                * way through the root, and if needed, unbill everything later
>> +                */
>> +               alloc = atomic_long_add_return(amt,
>> +                                              sg_memory_allocated(prot, sg));
>> +               parent_failure |= (alloc>  sk_prot_mem(sk)[2]);
>> +       }
>> +#endif
>> +
>> +       /* Over hard limit (we, or our parents) */
>> +       if (parent_failure || (allocated>  prot_mem[2]))
>> +               goto suppress_allocation;
>>
>>         /* Under limit. */
>> -       if (allocated<= prot->sysctl_mem[0]) {
>> -               if (prot->memory_pressure&&  *prot->memory_pressure)
>> -                       *prot->memory_pressure = 0;
>> +       if (allocated<= prot_mem[0]) {
>> +               if (memory_pressure&&  *memory_pressure)
>> +                       *memory_pressure = 0;
>>                 return 1;
>>         }
>>
>>         /* Under pressure. */
>> -       if (allocated>  prot->sysctl_mem[1])
>> +       if (allocated>  prot_mem[1])
>>                 if (prot->enter_memory_pressure)
>>                         prot->enter_memory_pressure(sk);
>>
>> -       /* Over hard limit. */
>> -       if (allocated>  prot->sysctl_mem[2])
>> -               goto suppress_allocation;
>> -
>>         /* guarantee minimum buffer size under pressure */
>>         if (kind == SK_MEM_RECV) {
>>                 if (atomic_read(&sk->sk_rmem_alloc)<  prot->sysctl_rmem[0])
>> @@ -1712,13 +1779,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
>>                                 return 1;
>>         }
>>
>> -       if (prot->memory_pressure) {
>> +       if (memory_pressure) {
>>                 int alloc;
>>
>> -               if (!*prot->memory_pressure)
>> +               if (!*memory_pressure)
>>                         return 1;
>> -               alloc = percpu_counter_read_positive(prot->sockets_allocated);
>> -               if (prot->sysctl_mem[2]>  alloc *
>> +               alloc = percpu_counter_read_positive(sk_sockets_allocated(sk));
>> +               if (prot_mem[2]>  alloc *
>>                     sk_mem_pages(sk->sk_wmem_queued +
>>                                  atomic_read(&sk->sk_rmem_alloc) +
>>                                  sk->sk_forward_alloc))
>> @@ -1741,7 +1808,13 @@ suppress_allocation:
>>
>>         /* Alas. Undo changes. */
>>         sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
>> -       atomic_long_sub(amt, prot->memory_allocated);
>> +
>> +       atomic_long_sub(amt, sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent)
>> +               atomic_long_sub(amt, sg_memory_allocated(prot, sg));
>> +#endif
>>         return 0;
>>   }
>>   EXPORT_SYMBOL(__sk_mem_schedule);
>> @@ -1753,14 +1826,24 @@ EXPORT_SYMBOL(__sk_mem_schedule);
>>   void __sk_mem_reclaim(struct sock *sk)
>>   {
>>         struct proto *prot = sk->sk_prot;
>> +       struct kmem_cgroup *sg = sk->sk_cgrp;
>> +       int *memory_pressure = sk_memory_pressure(sk);
>>
>>         atomic_long_sub(sk->sk_forward_alloc>>  SK_MEM_QUANTUM_SHIFT,
>> -                  prot->memory_allocated);
>> +                  sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>> +               atomic_long_sub(sk->sk_forward_alloc>>  SK_MEM_QUANTUM_SHIFT,
>> +                                               sg_memory_allocated(prot, sg));
>> +       }
>> +#endif
>> +
>>         sk->sk_forward_alloc&= SK_MEM_QUANTUM - 1;
>>
>> -       if (prot->memory_pressure&&  *prot->memory_pressure&&
>> -           (atomic_long_read(prot->memory_allocated)<  prot->sysctl_mem[0]))
>> -               *prot->memory_pressure = 0;
>> +       if (memory_pressure&&  *memory_pressure&&
>> +           (atomic_long_read(sk_memory_allocated(sk))<  sk_prot_mem(sk)[0]))
>> +               *memory_pressure = 0;
>>   }
>>   EXPORT_SYMBOL(__sk_mem_reclaim);
>>
>> @@ -2252,9 +2335,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>>   #ifdef CONFIG_PROC_FS
>>   #define PROTO_INUSE_NR 64      /* should be enough for the first time */
>>   struct prot_inuse {
>> @@ -2479,13 +2559,17 @@ static char proto_method_implemented(const void *method)
>>
>>   static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
>>   {
>> +       struct kmem_cgroup *sg = kcg_from_task(current);
>> +
>>         seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
>>                         "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
>>                    proto->name,
>>                    proto->obj_size,
>>                    sock_prot_inuse_get(seq_file_net(seq), proto),
>> -                  proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
>> -                  proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
>> +                  proto->memory_allocated != NULL ?
>> +                       atomic_long_read(sg_memory_allocated(proto, sg)) : -1L,
>> +                  proto->memory_pressure != NULL ?
>> +                       *sg_memory_pressure(proto, sg) ? "yes" : "no" : "NI",
>>                    proto->max_header,
>>                    proto->slab == NULL ? "no" : "yes",
>>                    module_name(proto->owner),
>> diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
>> index 19acd00..463b299 100644
>> --- a/net/decnet/af_decnet.c
>> +++ b/net/decnet/af_decnet.c
>> @@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
>>         }
>>   }
>>
>> +static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
>> +{
>> +       return&decnet_memory_allocated;
>> +}
>> +
>> +static int *memory_pressure_dn(struct kmem_cgroup *sg)
>> +{
>> +       return&dn_memory_pressure;
>> +}
>> +
>> +static long *dn_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +       return sysctl_decnet_mem;
>> +}
>> +
>>   static struct proto dn_proto = {
>>         .name                   = "NSP",
>>         .owner                  = THIS_MODULE,
>>         .enter_memory_pressure  = dn_enter_memory_pressure,
>> -       .memory_pressure        =&dn_memory_pressure,
>> -       .memory_allocated       =&decnet_memory_allocated,
>> -       .sysctl_mem             = sysctl_decnet_mem,
>> +       .memory_pressure        = memory_pressure_dn,
>> +       .memory_allocated       = memory_allocated_dn,
>> +       .prot_mem               = dn_sysctl_mem,
>>         .sysctl_wmem            = sysctl_decnet_wmem,
>>         .sysctl_rmem            = sysctl_decnet_rmem,
>>         .max_header             = DN_MAX_NSP_DATA_HEADER + 64,
>> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
>> index b14ec7d..9c80acf 100644
>> --- a/net/ipv4/proc.c
>> +++ b/net/ipv4/proc.c
>> @@ -52,20 +52,22 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
>>   {
>>         struct net *net = seq->private;
>>         int orphans, sockets;
>> +       struct kmem_cgroup *sg = kcg_from_task(current);
>> +       struct percpu_counter *allocated = sg_sockets_allocated(&tcp_prot, sg);
>>
>>         local_bh_disable();
>>         orphans = percpu_counter_sum_positive(&tcp_orphan_count);
>> -       sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
>> +       sockets = percpu_counter_sum_positive(allocated);
>>         local_bh_enable();
>>
>>         socket_seq_show(seq);
>>         seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
>>                    sock_prot_inuse_get(net,&tcp_prot), orphans,
>>                    tcp_death_row.tw_count, sockets,
>> -                  atomic_long_read(&tcp_memory_allocated));
>> +                  atomic_long_read(sg_memory_allocated((&tcp_prot), sg)));
>>         seq_printf(seq, "UDP: inuse %d mem %ld\n",
>>                    sock_prot_inuse_get(net,&udp_prot),
>> -                  atomic_long_read(&udp_memory_allocated));
>> +                  atomic_long_read(sg_memory_allocated((&udp_prot), sg)));
>>         seq_printf(seq, "UDPLITE: inuse %d\n",
>>                    sock_prot_inuse_get(net,&udplite_prot));
>>         seq_printf(seq, "RAW: inuse %d\n",
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 69fd720..5e89480 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,8 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/kmem_cgroup.h>
>> +#include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>>   #include<net/ip.h>
>> @@ -174,6 +176,43 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>>         return ret;
>>   }
>>
>> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> +                          void __user *buffer, size_t *lenp,
>> +                          loff_t *ppos)
>> +{
>> +       int ret;
>> +       unsigned long vec[3];
>> +       struct kmem_cgroup *kmem = kcg_from_task(current);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       ctl_table tmp = {
>> +               .data =&vec,
>> +               .maxlen = sizeof(vec),
>> +               .mode = ctl->mode,
>> +       };
>> +
>> +       if (!write) {
>> +               ctl->data =&net->ipv4.sysctl_tcp_mem;
>> +               return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
>> +       }
>> +
>> +       ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
>> +       if (ret)
>> +               return ret;
>> +
>> +       for (i = 0; i<  3; i++)
>> +               if (vec[i]>  kmem->tcp_max_memory)
>> +                       return -EINVAL;
>> +
>> +       for (i = 0; i<  3; i++) {
>> +               net->ipv4.sysctl_tcp_mem[i] = vec[i];
>> +               kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>   static struct ctl_table ipv4_table[] = {
>>         {
>>                 .procname       = "tcp_timestamps",
>> @@ -433,13 +472,6 @@ static struct ctl_table ipv4_table[] = {
>>                 .proc_handler   = proc_dointvec
>>         },
>>         {
>> -               .procname       = "tcp_mem",
>> -               .data           =&sysctl_tcp_mem,
>> -               .maxlen         = sizeof(sysctl_tcp_mem),
>> -               .mode           = 0644,
>> -               .proc_handler   = proc_doulongvec_minmax
>> -       },
>> -       {
>>                 .procname       = "tcp_wmem",
>>                 .data           =&sysctl_tcp_wmem,
>>                 .maxlen         = sizeof(sysctl_tcp_wmem),
>> @@ -721,6 +753,12 @@ static struct ctl_table ipv4_net_table[] = {
>>                 .mode           = 0644,
>>                 .proc_handler   = ipv4_ping_group_range,
>>         },
>> +       {
>> +               .procname       = "tcp_mem",
>> +               .maxlen         = sizeof(init_net.ipv4.sysctl_tcp_mem),
>> +               .mode           = 0644,
>> +               .proc_handler   = ipv4_tcp_mem,
>> +       },
>>         { }
>>   };
>>
>> @@ -734,6 +772,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>>   static __net_init int ipv4_sysctl_init_net(struct net *net)
>>   {
>>         struct ctl_table *table;
>> +       unsigned long limit;
>>
>>         table = ipv4_net_table;
>>         if (!net_eq(net,&init_net)) {
>> @@ -769,6 +808,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>>
>>         net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>>
>> +       limit = nr_free_buffer_pages() / 8;
>> +       limit = max(limit, 128UL);
>> +       net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>> +       net->ipv4.sysctl_tcp_mem[1] = limit;
>> +       net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>> +
>>         net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>>                         net_ipv4_ctl_path, table);
>>         if (net->ipv4.ipv4_hdr == NULL)
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 46febca..e1918fa 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -266,6 +266,7 @@
>>   #include<linux/crypto.h>
>>   #include<linux/time.h>
>>   #include<linux/slab.h>
>> +#include<linux/nsproxy.h>
>>
>>   #include<net/icmp.h>
>>   #include<net/tcp.h>
>> @@ -282,23 +283,12 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>>   struct percpu_counter tcp_orphan_count;
>>   EXPORT_SYMBOL_GPL(tcp_orphan_count);
>>
>> -long sysctl_tcp_mem[3] __read_mostly;
>>   int sysctl_tcp_wmem[3] __read_mostly;
>>   int sysctl_tcp_rmem[3] __read_mostly;
>>
>> -EXPORT_SYMBOL(sysctl_tcp_mem);
>>   EXPORT_SYMBOL(sysctl_tcp_rmem);
>>   EXPORT_SYMBOL(sysctl_tcp_wmem);
>>
>> -atomic_long_t tcp_memory_allocated;    /* Current allocated memory. */
>> -EXPORT_SYMBOL(tcp_memory_allocated);
>> -
>> -/*
>> - * Current number of TCP sockets.
>> - */
>> -struct percpu_counter tcp_sockets_allocated;
>> -EXPORT_SYMBOL(tcp_sockets_allocated);
>> -
>>   /*
>>   * TCP splice context
>>   */
>> @@ -308,23 +298,157 @@ struct tcp_splice_state {
>>         unsigned int flags;
>>   };
>>
>> +#ifdef CONFIG_CGROUP_KMEM
>>   /*
>>   * Pressure flag: try to collapse.
>>   * Technical note: it is used by multiple contexts non atomically.
>>   * All the __sk_mem_schedule() is of this nature: accounting
>>   * is strict, actions are advisory and have some latency.
>>   */
>> -int tcp_memory_pressure __read_mostly;
>> -EXPORT_SYMBOL(tcp_memory_pressure);
>> -
>>   void tcp_enter_memory_pressure(struct sock *sk)
>>   {
>> +       struct kmem_cgroup *sg = sk->sk_cgrp;
>> +       if (!sg->tcp_memory_pressure) {
>> +               NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
>> +               sg->tcp_memory_pressure = 1;
>> +       }
>> +}
>> +
>> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +       return sg->tcp_prot_mem;
>> +}
>> +
>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
>> +{
>> +       return&(sg->tcp_memory_allocated);
>> +}
>> +
>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +
>> +       /*
>> +        * We can't allow more memory than our parents. Since this
>> +        * will be tested for all calls, by induction, there is no need
>> +        * to test any parent other than our own
>> +        * */
>> +       if (sg->parent&&  (val>  sg->parent->tcp_max_memory))
>> +               val = sg->parent->tcp_max_memory;
>> +
>> +       sg->tcp_max_memory = val;
>> +
>> +       for (i = 0; i<  3; i++)
>> +               sg->tcp_prot_mem[i]  = min_t(long, val,
>> +                                            net->ipv4.sysctl_tcp_mem[i]);
>> +
>> +       cgroup_unlock();
>> +
>> +       return 0;
>> +}
>> +
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>> +       u64 ret;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +       ret = sg->tcp_max_memory;
>> +
>> +       cgroup_unlock();
>> +       return ret;
>> +}
>> +
>> +static struct cftype tcp_files[] = {
>> +       {
>> +               .name = "tcp_maxmem",
>> +               .write_u64 = tcp_write_maxmem,
>> +               .read_u64 = tcp_read_maxmem,
>> +       },
>> +};
>> +
>> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
>> +{
>> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>> +       unsigned long limit;
>> +       struct net *net = current->nsproxy->net_ns;
>> +
>> +       sg->tcp_memory_pressure = 0;
>> +       atomic_long_set(&sg->tcp_memory_allocated, 0);
>> +       percpu_counter_init(&sg->tcp_sockets_allocated, 0);
>> +
>> +       limit = nr_free_buffer_pages() / 8;
>> +       limit = max(limit, 128UL);
>> +
>> +       if (sg->parent)
>> +               sg->tcp_max_memory = sg->parent->tcp_max_memory;
>> +       else
>> +               sg->tcp_max_memory = limit * 2;
>> +
>> +       sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
>> +       sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
>> +       sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
>> +
>> +       return cgroup_add_files(cgrp, ss, tcp_files, ARRAY_SIZE(tcp_files));
>> +}
>> +EXPORT_SYMBOL(tcp_init_cgroup);
>> +
>> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
>> +{
>> +       return&sg->tcp_memory_pressure;
>> +}
>> +
>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
>> +{
>> +       return&sg->tcp_sockets_allocated;
>> +}
>> +#else
>> +
>> +/* Current number of TCP sockets. */
>> +struct percpu_counter tcp_sockets_allocated;
>> +atomic_long_t tcp_memory_allocated;    /* Current allocated memory. */
>> +int tcp_memory_pressure;
>> +
>> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
>> +{
>> +       return&tcp_memory_pressure;
>> +}
>> +
>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
>> +{
>> +       return&tcp_sockets_allocated;
>> +}
>> +
>> +void tcp_enter_memory_pressure(struct sock *sock)
>> +{
>>         if (!tcp_memory_pressure) {
>>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
>>                 tcp_memory_pressure = 1;
>>         }
>>   }
>> +
>> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +       return init_net.ipv4.sysctl_tcp_mem;
>> +}
>> +
>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
>> +{
>> +       return&tcp_memory_allocated;
>> +}
>> +#endif /* CONFIG_CGROUP_KMEM */
>> +
>> +EXPORT_SYMBOL(memory_pressure_tcp);
>> +EXPORT_SYMBOL(sockets_allocated_tcp);
>>   EXPORT_SYMBOL(tcp_enter_memory_pressure);
>> +EXPORT_SYMBOL(tcp_sysctl_mem);
>> +EXPORT_SYMBOL(memory_allocated_tcp);
>>
>>   /* Convert seconds to retransmits based on initial and max timeout */
>>   static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
>> @@ -3226,7 +3350,9 @@ void __init tcp_init(void)
>>
>>         BUILD_BUG_ON(sizeof(struct tcp_skb_cb)>  sizeof(skb->cb));
>>
>> +#ifndef CONFIG_CGROUP_KMEM
>>         percpu_counter_init(&tcp_sockets_allocated, 0);
>> +#endif
>>         percpu_counter_init(&tcp_orphan_count, 0);
>>         tcp_hashinfo.bind_bucket_cachep =
>>                 kmem_cache_create("tcp_bind_bucket",
>> @@ -3277,14 +3403,10 @@ void __init tcp_init(void)
>>         sysctl_tcp_max_orphans = cnt / 2;
>>         sysctl_max_syn_backlog = max(128, cnt / 256);
>>
>> -       limit = nr_free_buffer_pages() / 8;
>> -       limit = max(limit, 128UL);
>> -       sysctl_tcp_mem[0] = limit / 4 * 3;
>> -       sysctl_tcp_mem[1] = limit;
>> -       sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
>> -
>>         /* Set per-socket limits to no more than 1/128 the pressure threshold */
>> -       limit = ((unsigned long)sysctl_tcp_mem[1])<<  (PAGE_SHIFT - 7);
>> +       limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
>> +       limit<<= (PAGE_SHIFT - 7);
>> +
>>         max_share = min(4UL*1024*1024, limit);
>>
>>         sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index ea0d218..c44e830 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
>>         /* Check #1 */
>>         if (tp->rcv_ssthresh<  tp->window_clamp&&
>>             (int)tp->rcv_ssthresh<  tcp_space(sk)&&
>> -           !tcp_memory_pressure) {
>> +           !sk_memory_pressure(sk)) {
>>                 int incr;
>>
>>                 /* Check #2. Increase window, if skb with such overhead
>> @@ -393,15 +393,16 @@ static void tcp_clamp_window(struct sock *sk)
>>   {
>>         struct tcp_sock *tp = tcp_sk(sk);
>>         struct inet_connection_sock *icsk = inet_csk(sk);
>> +       struct proto *prot = sk->sk_prot;
>>
>>         icsk->icsk_ack.quick = 0;
>>
>> -       if (sk->sk_rcvbuf<  sysctl_tcp_rmem[2]&&
>> +       if (sk->sk_rcvbuf<  prot->sysctl_rmem[2]&&
>>             !(sk->sk_userlocks&  SOCK_RCVBUF_LOCK)&&
>> -           !tcp_memory_pressure&&
>> -           atomic_long_read(&tcp_memory_allocated)<  sysctl_tcp_mem[0]) {
>> +           !sk_memory_pressure(sk)&&
>> +           atomic_long_read(sk_memory_allocated(sk))<  sk_prot_mem(sk)[0]) {
>>                 sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
>> -                                   sysctl_tcp_rmem[2]);
>> +                                   prot->sysctl_rmem[2]);
>>         }
>>         if (atomic_read(&sk->sk_rmem_alloc)>  sk->sk_rcvbuf)
>>                 tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
>> @@ -4806,7 +4807,7 @@ static int tcp_prune_queue(struct sock *sk)
>>
>>         if (atomic_read(&sk->sk_rmem_alloc)>= sk->sk_rcvbuf)
>>                 tcp_clamp_window(sk);
>> -       else if (tcp_memory_pressure)
>> +       else if (sk_memory_pressure(sk))
>>                 tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
>>
>>         tcp_collapse_ofo_queue(sk);
>> @@ -4872,11 +4873,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
>>                 return 0;
>>
>>         /* If we are under global TCP memory pressure, do not expand.  */
>> -       if (tcp_memory_pressure)
>> +       if (sk_memory_pressure(sk))
>>                 return 0;
>>
>>         /* If we are under soft global TCP memory pressure, do not expand.  */
>> -       if (atomic_long_read(&tcp_memory_allocated)>= sysctl_tcp_mem[0])
>> +       if (atomic_long_read(sk_memory_allocated(sk))>= sk_prot_mem(sk)[0])
>>                 return 0;
>>
>>         /* If we filled the congestion window, do not expand.  */
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 1c12b8e..af6c095 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1848,6 +1848,7 @@ static int tcp_v4_init_sock(struct sock *sk)
>>   {
>>         struct inet_connection_sock *icsk = inet_csk(sk);
>>         struct tcp_sock *tp = tcp_sk(sk);
>> +       struct kmem_cgroup *sg;
>>
>>         skb_queue_head_init(&tp->out_of_order_queue);
>>         tcp_init_xmit_timers(sk);
>> @@ -1901,7 +1902,13 @@ static int tcp_v4_init_sock(struct sock *sk)
>>         sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>>
>>         local_bh_disable();
>> -       percpu_counter_inc(&tcp_sockets_allocated);
>> +       percpu_counter_inc(sk_sockets_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
>> +               percpu_counter_inc(sg_sockets_allocated(sk->sk_prot, sg));
>> +#endif
>> +
>>         local_bh_enable();
>>
>>         return 0;
>> @@ -1910,6 +1917,7 @@ static int tcp_v4_init_sock(struct sock *sk)
>>   void tcp_v4_destroy_sock(struct sock *sk)
>>   {
>>         struct tcp_sock *tp = tcp_sk(sk);
>> +       struct kmem_cgroup *sg;
>>
>>         tcp_clear_xmit_timers(sk);
>>
>> @@ -1957,7 +1965,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
>>                 tp->cookie_values = NULL;
>>         }
>>
>> -       percpu_counter_dec(&tcp_sockets_allocated);
>> +       percpu_counter_dec(sk_sockets_allocated(sk));
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
>> +               percpu_counter_dec(sg_sockets_allocated(sk->sk_prot, sg));
>> +#endif
>>   }
>>   EXPORT_SYMBOL(tcp_v4_destroy_sock);
>>
>> @@ -2598,11 +2610,14 @@ struct proto tcp_prot = {
>>         .unhash                 = inet_unhash,
>>         .get_port               = inet_csk_get_port,
>>         .enter_memory_pressure  = tcp_enter_memory_pressure,
>> -       .sockets_allocated      =&tcp_sockets_allocated,
>> +       .memory_pressure        = memory_pressure_tcp,
>> +       .sockets_allocated      = sockets_allocated_tcp,
>>         .orphan_count           =&tcp_orphan_count,
>> -       .memory_allocated       =&tcp_memory_allocated,
>> -       .memory_pressure        =&tcp_memory_pressure,
>> -       .sysctl_mem             = sysctl_tcp_mem,
>> +       .memory_allocated       = memory_allocated_tcp,
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       .init_cgroup            = tcp_init_cgroup,
>> +#endif
>> +       .prot_mem               = tcp_sysctl_mem,
>>         .sysctl_wmem            = sysctl_tcp_wmem,
>>         .sysctl_rmem            = sysctl_tcp_rmem,
>>         .max_header             = MAX_TCP_HEADER,
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 882e0b0..06aeb31 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
>>         if (free_space<  (full_space>>  1)) {
>>                 icsk->icsk_ack.quick = 0;
>>
>> -               if (tcp_memory_pressure)
>> +               if (sk_memory_pressure(sk))
>>                         tp->rcv_ssthresh = min(tp->rcv_ssthresh,
>>                                                4U * tp->advmss);
>>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index ecd44b0..2c67617 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
>>         }
>>
>>   out:
>> -       if (tcp_memory_pressure)
>> +       if (sk_memory_pressure(sk))
>>                 sk_mem_reclaim(sk);
>>   out_unlock:
>>         bh_unlock_sock(sk);
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 1b5a193..6c08c65 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
>>   int sysctl_udp_wmem_min __read_mostly;
>>   EXPORT_SYMBOL(sysctl_udp_wmem_min);
>>
>> -atomic_long_t udp_memory_allocated;
>> -EXPORT_SYMBOL(udp_memory_allocated);
>> -
>>   #define MAX_UDP_PORTS 65536
>>   #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
>>
>> @@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
>>   }
>>   EXPORT_SYMBOL(udp_poll);
>>
>> +static atomic_long_t udp_memory_allocated;
>> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg)
>> +{
>> +       return&udp_memory_allocated;
>> +}
>> +EXPORT_SYMBOL(memory_allocated_udp);
>> +
>> +long *udp_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +       return sysctl_udp_mem;
>> +}
>> +EXPORT_SYMBOL(udp_sysctl_mem);
>> +
>>   struct proto udp_prot = {
>>         .name              = "UDP",
>>         .owner             = THIS_MODULE,
>> @@ -1936,8 +1946,8 @@ struct proto udp_prot = {
>>         .unhash            = udp_lib_unhash,
>>         .rehash            = udp_v4_rehash,
>>         .get_port          = udp_v4_get_port,
>> -       .memory_allocated  =&udp_memory_allocated,
>> -       .sysctl_mem        = sysctl_udp_mem,
>> +       .memory_allocated  =&memory_allocated_udp,
>> +       .prot_mem          = udp_sysctl_mem,
>>         .sysctl_wmem       =&sysctl_udp_wmem_min,
>>         .sysctl_rmem       =&sysctl_udp_rmem_min,
>>         .obj_size          = sizeof(struct udp_sock),
>> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
>> index d1fb63f..0762e68 100644
>> --- a/net/ipv6/tcp_ipv6.c
>> +++ b/net/ipv6/tcp_ipv6.c
>> @@ -1959,6 +1959,7 @@ static int tcp_v6_init_sock(struct sock *sk)
>>   {
>>         struct inet_connection_sock *icsk = inet_csk(sk);
>>         struct tcp_sock *tp = tcp_sk(sk);
>> +       struct kmem_cgroup *sg;
>>
>>         skb_queue_head_init(&tp->out_of_order_queue);
>>         tcp_init_xmit_timers(sk);
>> @@ -2012,7 +2013,12 @@ static int tcp_v6_init_sock(struct sock *sk)
>>         sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>>
>>         local_bh_disable();
>> -       percpu_counter_inc(&tcp_sockets_allocated);
>> +       percpu_counter_inc(sk_sockets_allocated(sk));
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
>> +               percpu_counter_dec(sg_sockets_allocated(sk->sk_prot, sg));
>> +#endif
>> +
>>         local_bh_enable();
>>
>>         return 0;
>> @@ -2221,11 +2227,11 @@ struct proto tcpv6_prot = {
>>         .unhash                 = inet_unhash,
>>         .get_port               = inet_csk_get_port,
>>         .enter_memory_pressure  = tcp_enter_memory_pressure,
>> -       .sockets_allocated      =&tcp_sockets_allocated,
>> -       .memory_allocated       =&tcp_memory_allocated,
>> -       .memory_pressure        =&tcp_memory_pressure,
>> +       .sockets_allocated      = sockets_allocated_tcp,
>> +       .memory_allocated       = memory_allocated_tcp,
>> +       .memory_pressure        = memory_pressure_tcp,
>>         .orphan_count           =&tcp_orphan_count,
>> -       .sysctl_mem             = sysctl_tcp_mem,
>> +       .prot_mem               = tcp_sysctl_mem,
>>         .sysctl_wmem            = sysctl_tcp_wmem,
>>         .sysctl_rmem            = sysctl_tcp_rmem,
>>         .max_header             = MAX_TCP_HEADER,
>> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
>> index 29213b5..ef4b5b3 100644
>> --- a/net/ipv6/udp.c
>> +++ b/net/ipv6/udp.c
>> @@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
>>         .unhash            = udp_lib_unhash,
>>         .rehash            = udp_v6_rehash,
>>         .get_port          = udp_v6_get_port,
>> -       .memory_allocated  =&udp_memory_allocated,
>> -       .sysctl_mem        = sysctl_udp_mem,
>> +       .memory_allocated  = memory_allocated_udp,
>> +       .prot_mem          = udp_sysctl_mem,
>>         .sysctl_wmem       =&sysctl_udp_wmem_min,
>>         .sysctl_rmem       =&sysctl_udp_rmem_min,
>>         .obj_size          = sizeof(struct udp6_sock),
>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>> index 836aa63..1b0300d 100644
>> --- a/net/sctp/socket.c
>> +++ b/net/sctp/socket.c
>> @@ -119,11 +119,30 @@ static int sctp_memory_pressure;
>>   static atomic_long_t sctp_memory_allocated;
>>   struct percpu_counter sctp_sockets_allocated;
>>
>> +static long *sctp_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +       return sysctl_sctp_mem;
>> +}
>> +
>>   static void sctp_enter_memory_pressure(struct sock *sk)
>>   {
>>         sctp_memory_pressure = 1;
>>   }
>>
>> +static int *memory_pressure_sctp(struct kmem_cgroup *sg)
>> +{
>> +       return&sctp_memory_pressure;
>> +}
>> +
>> +static atomic_long_t *memory_allocated_sctp(struct kmem_cgroup *sg)
>> +{
>> +       return&sctp_memory_allocated;
>> +}
>> +
>> +static struct percpu_counter *sockets_allocated_sctp(struct kmem_cgroup *sg)
>> +{
>> +       return&sctp_sockets_allocated;
>> +}
>>
>>   /* Get the sndbuf space available at the time on the association.  */
>>   static inline int sctp_wspace(struct sctp_association *asoc)
>> @@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
>>         .unhash      =  sctp_unhash,
>>         .get_port    =  sctp_get_port,
>>         .obj_size    =  sizeof(struct sctp_sock),
>> -       .sysctl_mem  =  sysctl_sctp_mem,
>> +       .prot_mem    =  sctp_sysctl_mem,
>>         .sysctl_rmem =  sysctl_sctp_rmem,
>>         .sysctl_wmem =  sysctl_sctp_wmem,
>> -       .memory_pressure =&sctp_memory_pressure,
>> +       .memory_pressure = memory_pressure_sctp,
>>         .enter_memory_pressure = sctp_enter_memory_pressure,
>> -       .memory_allocated =&sctp_memory_allocated,
>> -       .sockets_allocated =&sctp_sockets_allocated,
>> +       .memory_allocated = memory_allocated_sctp,
>> +       .sockets_allocated = sockets_allocated_sctp,
>>   };
>>
>>   #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>> @@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
>>         .unhash         = sctp_unhash,
>>         .get_port       = sctp_get_port,
>>         .obj_size       = sizeof(struct sctp6_sock),
>> -       .sysctl_mem     = sysctl_sctp_mem,
>> +       .prot_mem       = sctp_sysctl_mem,
>>         .sysctl_rmem    = sysctl_sctp_rmem,
>>         .sysctl_wmem    = sysctl_sctp_wmem,
>> -       .memory_pressure =&sctp_memory_pressure,
>> +       .memory_pressure = memory_pressure_sctp,
>>         .enter_memory_pressure = sctp_enter_memory_pressure,
>> -       .memory_allocated =&sctp_memory_allocated,
>> -       .sockets_allocated =&sctp_sockets_allocated,
>> +       .memory_allocated = memory_allocated_sctp,
>> +       .sockets_allocated = sockets_allocated_sctp,
>>   };
>>   #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
>> --
>> 1.7.6
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>>

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Tim Chen @ 2011-09-06 16:25 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, sedat.dilek@gmail.com, alex.shi
In-Reply-To: <4E631032.6050606@intel.com>

On Sun, 2011-09-04 at 13:44 +0800, Yan, Zheng wrote:
> Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
> in Unix socket's send and receive path) introduced a use-after-free bug.
> It passes the scm reference to the first skb. Skb(s) afterwards may
> reference freed data structure because the first skb can be destructed
> by the receiver at anytime. The fix is by passing the scm reference to
> the very last skb.
> 
> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> Reported-by: Jiri Slaby <jirislaby@gmail.com>
> ---

Thanks for finding this bug in my original patch.  I've missed the case
where receiving side could have released the all the references to the
credential before the send side is using the credential again for
subsequent skbs in the stream, thus causing the problem we saw.  Getting
an extra reference for pid/credentials at the beginning of the stream
and not getting reference for the last skb is the right approach.

Thanks also to Sedat, Valdis and Jiri for their extensive testing to
discover the bug and testing the subsequent fixes. 

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox