Netdev List
 help / color / mirror / Atom feed
* RE: ixgbe: compilation failed if CONFIG_PCI_IOV isn't set
From: Rose, Gregory V @ 2011-11-07 16:47 UTC (permalink / raw)
  To: Alexander Kolesen, Or Gerlitz
  Cc: Kirsher, Jeffrey T, netdev@vger.kernel.org, Li, Sibai
In-Reply-To: <20111107124641.GB28044@vaio>

> -----Original Message-----
> From: Alexander Kolesen [mailto:kolesen.a@gmail.com]
> Sent: Monday, November 07, 2011 4:47 AM
> To: Or Gerlitz
> Cc: Kirsher, Jeffrey T; netdev@vger.kernel.org; Rose, Gregory V; Li, Sibai
> Subject: Re: ixgbe: compilation failed if CONFIG_PCI_IOV isn't set
> 
> > On Sun, Nov 6, 2011 at 5:18 AM, Jeff Kirsher
> > <jeffrey.t.kirsher@intel.com> wrote:
> > > Was with the latest kernel from David Miller's net.gt tree?  I just
> ask
> > > because I just pushed a patch (a couple of days ago) to resolve
> > > compilation errors when CONFIG_PCI_IOV is not enabled by Greg Rose.
> >
> > In my case it was with Linus tree from github
> >
> > Or.
> It was with Linus tree, but David Miller's net.gt alsa has this issue.

Yeah, I'll have it fixed soon.  My apologies.

- Greg

^ permalink raw reply

* RE: linux-next: build failure after merge of the origin tree
From: Rose, Gregory V @ 2011-11-07 16:46 UTC (permalink / raw)
  To: Kirsher, Jeffrey T, David Miller
  Cc: sfr@canb.auug.org.au, torvalds@linux-foundation.org,
	linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <EAE7BD3E-8965-47A4-B3CA-01686E2602C2@intel.com>



> -----Original Message-----
> From: Kirsher, Jeffrey T
> Sent: Sunday, November 06, 2011 9:30 PM
> To: David Miller
> Cc: sfr@canb.auug.org.au; torvalds@linux-foundation.org; linux-
> next@vger.kernel.org; linux-kernel@vger.kernel.org; Rose, Gregory V;
> netdev@vger.kernel.org
> Subject: Re: linux-next: build failure after merge of the origin tree
> 
> 
> 
> Cheers,
> Jeff
> 
> On Nov 6, 2011, at 19:38, "David Miller" <davem@davemloft.net> wrote:
> 
> > From: Stephen Rothwell <sfr@canb.auug.org.au>
> > Date: Mon, 7 Nov 2011 13:47:06 +1100
> >
> >>> If you just revert the commit in origin from -next, then you will get
> >>> conflicts with you pull the net.git tree in.
> >>
> >> I got no conflicts when I merged in the net tree and can see no fix for
> >> this problem in the net tree.  My current head of the net tree is
> 1a6422f
> >> "etherh: Add MAINTAINERS entry for etherh".
> >
> > Ok, Jeff please take a look at this and send me a fix soon.
> >
> > Thanks.
> 
> Ok Dave, at this point, I am puttying together a patch to revert this fix
> since it appears that more trouble comes with this fix.  I will take a
> look at it quickly before sending out a patch to fix the issue.

My bad...  I fixed a compiler warning that occurred with CONFIG_PCI_IOV turned on and didn't realize that my patch would cause an error when turning it back off.

I'll have it fixed ASAP.

- Greg

^ permalink raw reply

* Re: commit 0bdb0bd0 breaks shutdown/reboot
From: Stephen Hemminger @ 2011-11-07 16:45 UTC (permalink / raw)
  To: Dominik Brodowski; +Cc: davem, netdev
In-Reply-To: <20111107163227.GA17798@isilmar-3.linta.de>

On Mon, 7 Nov 2011 17:32:27 +0100
Dominik Brodowski <linux@dominikbrodowski.net> wrote:

> On Mon, Nov 07, 2011 at 08:13:16AM -0800, Stephen Hemminger wrote:
> > Does this help?
> 
> Unfortunately, no. Reboots still fail.
> 
> > 
> > Subject: sky2: block irq's on down
> > 
> > Need to block IRQ's from phy changes to prevent stray IRQ's when
> > device is down.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> Thanks anyway,
> 	Dominik

Are you using Wake On Lan?

^ permalink raw reply

* data corruption in skge hardware
From: Mikulas Patocka @ 2011-11-07 16:42 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Hi

I found a data corruption in skge network card.

The card is this: "03:06.0 Ethernet controller: 3Com Corporation 3c940 
10/100/1000Base-T [Marvell] (rev 10)"

The machine is two quad core Opterons with HT2000 north bridge and HT1000 
south bridge.

When "scatter-gather" and "generic-segmentation-offload" are enabled, the 
card sends out corrupted packets.

It normally manifests as a ssh connection drop once per few days, but I 
found a workload that triggers this bug quickly.

I ran tcpdump on both sending and receiving machine and caught the packet 
corruption:

correct packet (on the sending machine):
19:03:21.131836 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808, 
ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
        0x0000:  4510 0094 c7bf 4000 4006 f12d c0a8 8007
        0x0010:  c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
        0x0020:  8018 00c1 81ed 0000 0101 080a 0084 6735
        0x0030:  0012 7cd8 4301 4af9 87c9 d2b4 8ba6 aedb
        0x0040:  0572 1738 93db 789c 634b 4386 d013 db27
        0x0050:  258b 6fa6 743c d429 a5e1 162f 2721 19bf
        0x0060:  6669 a5c3 6bea 89ec a635 b8b4 8727 38c1
        0x0070:  139f 5989 781b 49dd 79f5 4dfe 78ac ecb0
        0x0080:  546c 33e0 0953 04bc 0647 a9d4 2fc4 cba0
        0x0090:  44b2 3b01

incorrect packet (on the receiving machine):
19:03:21.133174 IP hydra.ssh > phoebe.58913: Flags [P.], seq 53712:53808, 
ack 1, win 193, options [nop,nop,TS val 8677173 ecr 1211608], length 96
        0x0000:  4510 0094 c7bf 4000 4006 f12d c0a8 8007
        0x0010:  c0a8 800e 0016 e621 2d64 84e6 1fc2 3f5b
        0x0020:  8018 00c1 6aa4 0000 0101 080a 0084 6735
        0x0030:  0012 7cd8 0000 0000 0000 0000 0010 0000
        0x0040:  0000 0000 0000 0000 0000 0000 0000 0000
        0x0050:  0000 0000 0000 0000 0000 00c0 dc92 4702
        0x0060:  88ff ff00 0000 0000 0000 0000 0000 0000
        0x0070:  0000 0000 0000 0000 0000 0000 0000 0000
        0x0080:  0000 0000 0000 0000 0000 0000 0000 0000
        0x0090:  0000 00e0

Obviously, scatter-gather doesn't work, the header is correct, but the 
packet body was likely read from random memory.

I tried to use "clflush" instruction on the transmit descriptor and the 
packet body to test if it is a cache-coherency issue, but the corruption 
was still there.

I tried to limit memory to 2G to test if it was a problem with high 
memory, but the corruption was still there.

I tries olded kernels (as far as 2.6.34), the corruption was still there, 
but it took much more time to trigger it with old kernels.


Do you have other reports of data corruption with skge hardware? Shouldn't 
the driver set "scatter-gather" off by default because it is unreliable?

Mikulas

^ permalink raw reply

* Bug? GRE tunnel periodically won't transmit some packets
From: Chris Siebenmann @ 2011-11-07 16:21 UTC (permalink / raw)
  To: netdev; +Cc: cks

 I have a weird problem where a GRE tunnel periodically won't transmit
some (TCP) packets, while at the same time it will transmit others just
fine. This is happening in the current kernel.org git head kernel as
well as earlier ones.

 The networking environment is a GRE tunnel over IPSec in tunnel mode
('esp/tunnel/...') over a DSL PPPoE link. What I observe is that
periodically outbound SSH connections stall early in the protocol
negociation, and other TCP connections can similarly stall. Sometimes
they recover and sometimes they time out. The problem is pretty
reproducable and regular, although not constant (sometimes the affected
packets get through right away).

 I have tcpdump'd both the GRE tunnel device and the underlying DSL
PPPoE device and during a stall, the GRE tcpdump will show packets being
sent that do not appear on the DSL PPPoE link. All of the packets that
I've seen stalling have had 500 data octets.

Typical packets are:
	IP 128.100.3.52.52063 > 128.100.3.51.ssh: Flags [.], seq 22:522, ack 22, win 91, options [nop,nop,TS val 143020 ecr 966040433], length 500

(here 128.100.3.52 is the GRE tunnel IP address of the machine
experiencing problems)

or ttcp:
	IP 128.100.3.52.46585 > 128.100.3.51.5001: Flags [.], seq 1:501, ack 1, win 91, options [nop,nop,TS val 729200 ecr 979199256], length 500

Ttcp had a whole run of 'length 500' packets fail to go through. SSH
will actually successfully transmit later (different-length) packets,
eg:
	128.100.3.52.52063 > 128.100.3.51.ssh: Flags [P.], seq 522:926, ack 22, win 91, options [nop,nop,TS val 143037 ecr 966040450], length 404

 The DSL PPPoE link has an MTU of 1492 and the GRE tunnel has an MTU of
1200 (on both ends). As far as I can tell they do pass packets of this
size. *However*, on kernels that display this problem tracepath and 'ip
route show table cache' both report that the GRE tunnel has a path MTU
of 854 going from 128.100.3.52 to 128.100.3.51; however, 128.100.3.51
sees a pmtu of 1200 for the path to 128.100.3.52.

 The machine experiencing these problems is a 64-bit x86_64 Fedora 15
machine with various kernels. The problem does not happen with the
current Fedora 14 kernel (nominally 2.6.35.14); it does happen with the
Fedora 15 kernel ('2.6.40.6' aka some version of 3.0.0), the Fedora 16
kernel (some version of 3.1.0) and on the current kernel.org git head
as of last night. I am not running NetworkManager; all networking is
statically configured and not changing during operation, and my IPSec
setup is statically keyed[*].

 I would be happy to run any debugging tests or give any further
information that people want. Should I try a different kernel git
repo than Linus's kernel.org one?

 Thanks in advance.

(While I'm reading the mailing list I'm not directly subscribed to it,
so copying me on replies will make sure that I see them immediately.)

	- cks
[*: I'm aware that this is not ideal from a security perspective since
    it relies on me manually rekeying everything every so often.]

^ permalink raw reply

* Re: commit 0bdb0bd0 breaks shutdown/reboot
From: Dominik Brodowski @ 2011-11-07 16:32 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev
In-Reply-To: <20111107081316.2e1cc2cb@nehalam.linuxnetplumber.net>

On Mon, Nov 07, 2011 at 08:13:16AM -0800, Stephen Hemminger wrote:
> Does this help?

Unfortunately, no. Reboots still fail.

> 
> Subject: sky2: block irq's on down
> 
> Need to block IRQ's from phy changes to prevent stray IRQ's when
> device is down.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Thanks anyway,
	Dominik

^ permalink raw reply

* Re: commit 0bdb0bd0 breaks shutdown/reboot
From: Stephen Hemminger @ 2011-11-07 16:13 UTC (permalink / raw)
  To: Dominik Brodowski; +Cc: davem, netdev
In-Reply-To: <20111107153119.GA11724@comet.dominikbrodowski.net>

Does this help?

Subject: sky2: block irq's on down

Need to block IRQ's from phy changes to prevent stray IRQ's when
device is down.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/drivers/net/ethernet/marvell/sky2.c	2011-11-04 15:01:51.310888300 -0700
+++ b/drivers/net/ethernet/marvell/sky2.c	2011-11-07 08:10:05.065118917 -0800
@@ -2106,15 +2106,20 @@ static int sky2_down(struct net_device *
 
 	netif_info(sky2, ifdown, dev, "disabling interface\n");
 
-	/* Disable port IRQ */
-	sky2_write32(hw, B0_IMSK,
-		     sky2_read32(hw, B0_IMSK) & ~portirq_msk[sky2->port]);
-	sky2_read32(hw, B0_IMSK);
-
 	if (hw->ports == 1) {
+		sky2_write32(hw, B0_IMSK, 0);
+		sky2_read32(hw, B0_IMSK);
+
 		napi_disable(&hw->napi);
 		free_irq(hw->pdev->irq, hw);
 	} else {
+		/* Disable port IRQ */
+		u32 imsk = sky2_read32(hw, B0_IMSK);
+
+		imsk &= ~portirq_msk[sky2->port];
+		sky2_write32(hw, B0_IMSK, imsk);
+		sky2_read32(hw, B0_IMSK);
+
 		synchronize_irq(hw->pdev->irq);
 		napi_synchronize(&hw->napi);
 	}
@@ -5017,19 +5022,19 @@ static void __devexit sky2_remove(struct
 	for (i = hw->ports-1; i >= 0; --i)
 		unregister_netdev(hw->dev[i]);
 
-	sky2_write32(hw, B0_IMSK, 0);
-	sky2_read32(hw, B0_IMSK);
-
 	sky2_power_aux(hw);
 
-	sky2_write8(hw, B0_CTST, CS_RST_SET);
-	sky2_read8(hw, B0_CTST);
-
 	if (hw->ports > 1) {
+		sky2_write32(hw, B0_IMSK, 0);
+		sky2_read32(hw, B0_IMSK);
+
 		napi_disable(&hw->napi);
 		free_irq(pdev->irq, hw);
 	}
 
+	sky2_write8(hw, B0_CTST, CS_RST_SET);
+	sky2_read8(hw, B0_CTST);
+
 	if (hw->flags & SKY2_HW_USE_MSI)
 		pci_disable_msi(pdev);
 	pci_free_consistent(pdev, hw->st_size * sizeof(struct sky2_status_le),

^ permalink raw reply

* Re: Contributing for the first time
From: Stephen Hemminger @ 2011-11-07 16:11 UTC (permalink / raw)
  To: Daniel Baluta; +Cc: Alexandru Juncu, kernelnewbies, Greg Freemyer, netdev
In-Reply-To: <CAEnQRZA=dwt6TjcBacYaj0ORdi5QjBEXUaPA4GT2tm2oE_LFew@mail.gmail.com>

On Mon, 7 Nov 2011 15:17:38 +0200
Daniel Baluta <dbaluta@ixiacom.com> wrote:

> On Mon, Nov 7, 2011 at 3:05 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:
> > Alexandru Juncu <alex.juncu@rosedu.org> wrote:
> >
> >>Hello!
> >>
> >>I have been a linux user for many years, and mostly on the networking
> >>side. And I would like to start contributing somehow to the linux
> >>community (in some other way than just promoting it). I guess that  I
> >>should start small with something like man pages.
> >>
> >>And there is something that really has been bugging me for some time.
> >>I'm an iproute2 user and I teach linux courses and show people how to
> >>use it. On most questions from my students about new commands, I
> >>redirect them to the man pages and to the Examples section of that
> >>page. iproute2 doesn't have such examples and I always wish it did.
> >>
> >>Do you think that if I submit a patch to the man pages, adding some
> >>examples of how to use the ip command, will it get accepted? Because
> >>this sounds like a simple thing and it's hard to believe that someone
> >>else didn't try do to this before. What do you think?
> >>Alexandru Juncu
> 
> > I think it will be accepted, but few people like to work on the man pages.
> >
> > Since this is a userspace package you will need to figure out who the maintainer is and if there is a mailinglist they use to discuss/support the package.
> >
> > Then submit your patch there.
> 
> According to [1] maintainer for iproute2 is Stephen Hemminger (CC'ed) and I
> think patches should be sent to netdev mailing list ([2]).
> 
> thanks,
> Daniel.
> 
> [1] http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2
> [2] http://vger.kernel.org/vger-lists.html#netdev

It might also be worth it to break some of the large manual pages into
clearer sections.

^ permalink raw reply

* Re: [PATCH] drivers/net/usb/asix:  resync from vendor's copy
From: Michal Marek @ 2011-11-07 16:09 UTC (permalink / raw)
  To: Mark Lord; +Cc: David Miller, netdev, linux-kernel
In-Reply-To: <4EB19BBE.5050602@teksavvy.com>

On 2.11.2011 20:36, Mark Lord wrote:
> +static char driver_version[] =
> +	"ASIX USB Ethernet Adapter: v" DRIVER_VERSION \
> +	" " __TIME__ " " __DATE__ "\n";

Please drop the __TIME__ and __DATE__, it makes each build produce
different object files.

Michal

^ permalink raw reply

* Re: [PATCH 6/7] fsl_pmc: Add API to enable device as wakeup event source
From: Scott Wood @ 2011-11-07 15:49 UTC (permalink / raw)
  To: Zhao Chenhui; +Cc: linuxppc-dev, netdev, leoli
In-Reply-To: <20111107112236.GB16470@localhost.localdomain>

On 11/07/2011 05:22 AM, Zhao Chenhui wrote:
> On Fri, Nov 04, 2011 at 04:14:25PM -0500, Scott Wood wrote:
>> On 11/04/2011 07:39 AM, Zhao Chenhui wrote:
>>> +	if (enable && !device_may_wakeup(&pdev->dev))
>>> +		return -EINVAL;
>>> +
>>> +	clk_np = of_parse_phandle(pdev->dev.of_node, "clk-handle", 0);
>>> +	if (!clk_np)
>>> +		return -EINVAL;
>>> +
>>> +	pmcdr_mask = (u32 *)of_get_property(clk_np, "fsl,pmcdr-mask", NULL);
>>> +	if (!pmcdr_mask) {
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>> +	/* clear to enable clock in low power mode */
>>> +	if (enable)
>>> +		clrbits32(&pmc_regs->pmcdr, *pmcdr_mask);
>>> +	else
>>> +		setbits32(&pmc_regs->pmcdr, *pmcdr_mask);
>>
>> We should probably initialize PMCDR to all bits set (or at least all
>> ones we know are valid) -- the default should be "not a wakeup source".
> 
> I think it should be initialized in u-boot.

I don't see it.  If you mean you think this should be added to U-Boot, I
disagree.  U-Boot does not use this, and we should not add gratuitous
U-Boot dependencies to Linux -- especially in cases where there are
existing U-Boots in use for relevant boards, that do not have this.

>>> +/**
>>> + * pmc_enable_lossless - enable lossless ethernet in low power mode
>>> + * @enable: True to enable event generation; false to disable
>>> + */
>>> +void pmc_enable_lossless(int enable)
>>> +{
>>> +	if (enable && has_lossless)
>>> +		setbits32(&pmc_regs->pmcsr, PMCSR_LOSSLESS);
>>> +	else
>>> +		clrbits32(&pmc_regs->pmcsr, PMCSR_LOSSLESS);
>>> +}
>>> +EXPORT_SYMBOL_GPL(pmc_enable_lossless);
>>> +#endif
>>
>> Won't we overwrite this later?
>>
>> -Scott
> 
> Do you have any idea?

Set a flag that the code that enters (deep) sleep can use.

Also, rename function to mpc85xx_pmc_set_lossless_ethernet().

-Scott

^ permalink raw reply

* Re: [PATCH net-next 0/2] 802.1ad S-VLAN support
From: David Lamparter @ 2011-11-07 15:48 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Lamparter, netdev
In-Reply-To: <1320678704.3020.33.camel@bwh-desktop>

On Mon, Nov 07, 2011 at 03:11:44PM +0000, Ben Hutchings wrote:
> On Sat, 2011-11-05 at 17:54 +0100, David Lamparter wrote:
> > this kernel patch, together with the iproute2 userspace support,
> > allows creating 802.1ad S-VLAN devices.
[...]
> We definitely need to think about how MTU/MRU are configured when
> multiple VLAN tags are used, though I don't think it's essential to do
> before this goes in.  To be slightly more blunt than your documentation,
> our current handling of MTU/MRU and VLANs is a botch.

I fully agree, both on the botch and on fixing it separately.

> Do you have any plan to improve that?

Yes, what i'd like to do is introduce a new field into struct netdevice
that tracks the hardware Max Frame Size; it'd be a read-only field
that's initialized once by the driver. (The field would only be used by
ethernet-like devices.) To get things started easier, the field can have
a default value like 0xffff, so if the driver doesn't set it we end up
with the same old nothing-checked behaviour.

MTU change requests from userspace are then validated against the MFS
field for ethernet devices.

Each VLAN device created will inherit its parent's value minus 4 (minus
16 for 802.1ah Mac-in-Mac, I'm working on that currently).

A nice side-effect would be that we can export this value in sysfs so
the admin easily can see the hardware limitations. No more trial & error
to find that r8169 (or was it forcedeth?) has the totally weird value of
7200... ("almost-jumbo-frames-but-not-quite")


Anyway, I'm still in the "design" phase with regards to two points:

 - bridge - is the MFS field allowed to change when we add/remove
   devices? Is there a notification e.g. for VLANs on top of the bridge?

 - "speshul" hardware. I think I saw chips that support "1514 bytes" and
   "1514 bytes + 1 vlan tag" but not "1518 bytes". If this is indeed a
   case we want to support (no idea if it is), we could add a separate
   "extra_vlans" field that is 1 for those devices. (It would only be
   used for protocol-0x8100 802.1Q vlans).

> Or to allow use of offload features for multiple-tagged packets?

Hm. Well... I have yet to do quite a bit of reading to understand all of
the offload mechanisms. What the 802.1Q code currently does is

	dev->hw_features = NETIF_F_ALL_CSUM | NETIF_F_SG |
			   NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
			   NETIF_F_HIGHDMA | NETIF_F_SCTP_CSUM |
			   NETIF_F_ALL_FCOE;

which is pretty much the "basic" set. I don't see why any of that should
differ for 802.1ad (or even 802.1ah), but my understanding is barely
enough to tell that these flags should work for 802.1ad.


Comments very welcome,


-David

^ permalink raw reply

* [PATCH] route: fix ICMP secure_redirects
From: Flavio Leitner @ 2011-11-07 15:41 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Flavio Leitner

It should accept ICMP redirects from any host and not
just from gateways when secure_redirects is disabled.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 net/ipv4/route.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 155138d..dd6937ec 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1347,7 +1347,8 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 				continue;
 
 			if (rt->dst.error || rt->dst.dev != dev ||
-			    rt->rt_gateway != old_gw) {
+			    (IN_DEV_SEC_REDIRECTS(in_dev) &&
+			    rt->rt_gateway != old_gw)) {
 				ip_rt_put(rt);
 				continue;
 			}
-- 
1.7.6

^ permalink raw reply related

* commit 0bdb0bd0 breaks shutdown/reboot
From: Dominik Brodowski @ 2011-11-07 15:31 UTC (permalink / raw)
  To: shemminger, davem; +Cc: netdev

Hey,

since commit 0bdb0bd0 -- sky2: manage irq better on single port card -- my
laptop fails to reboot or shutdown. Reverting this commit fixes the issue.

03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8039 PCI-E Fast Ethernet Controller (rev 15)
        Subsystem: Samsung Electronics Co Ltd Device c510
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 43
        Region 0: Memory at f0200000 (64-bit, non-prefetchable) [size=16K]
        Region 2: I/O ports at 2000 [size=256]
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
                Product Name: Marvell Yukon 88E8039 Fast Ethernet Controller
                Read-only fields:
                        [PN] Part number: Yukon 88E8039
                        [EC] Engineering changes: Rev. 1.5
			<snip>
                        [CP] Extended capability: 01 10 cc 03
                        [RV] Reserved: checksum good, 12 byte(s) reserved
                Read/write fields:
                        [RW] Read-write area: 121 byte(s) free
                End
        Capabilities: [5c] MSI: Enable+ Count=1/2 Maskable- 64bit+
                Address: 00000000fee0300c  Data: 4171
        Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <256ns, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 1f, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: sky2

Best,
	Dominik

^ permalink raw reply

* [PATCH v5 10/10] Disable task moving when using kernel memory accounting
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

Since this code is still experimental, we are leaving the exact
details of how to move tasks between cgroups when kernel memory
accounting is used as future work.

For now, we simply disallow movement if there are any pending
accounted memory.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b532f91..248f92d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5690,10 +5690,19 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+	struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+	if (from != mem && !mem_cgroup_is_root(from) &&
+	    res_counter_read_u64(&from->tcp.tcp_memory_allocated, RES_USAGE)) {
+		printk(KERN_WARNING "Can't move tasks between cgroups: "
+			"Kernel memory held. task: %s\n", p->comm);
+		return 1;
+	}
+#endif
 
 	if (mem->move_charge_at_immigrate) {
 		struct mm_struct *mm;
-		struct mem_cgroup *from = mem_cgroup_from_task(p);
 
 		VM_BUG_ON(from == mem);
 
@@ -5861,6 +5870,18 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
 				struct task_struct *p)
 {
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+	struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+	if (from != mem && !mem_cgroup_is_root(from) &&
+	    res_counter_read_u64(&from->tcp.tcp_memory_allocated, RES_USAGE)) {
+		printk(KERN_WARNING "Can't move tasks between cgroups: "
+			"Kernel memory held. task: %s\n", p->comm);
+		return 1;
+	}
+#endif
+
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 09/10] Display current tcp memory allocation in kmem cgroup
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

This patch introduces kmem.tcp.max_usage_in_bytes file, living in the
kmem_cgroup filesystem. The root cgroup will display a value equal
to RESOURCE_MAX. This is to avoid introducing any locking schemes in
the network paths when cgroups are not being actively used.

All others, will see the maximum memory ever used by this cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 mm/memcontrol.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9394224..b532f91 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -540,6 +540,12 @@ static struct cftype tcp_files[] = {
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
+	{
+		.name = "kmem.tcp.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_MAX_USAGE),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
 };
 
 static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
@@ -4254,6 +4260,10 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 	case RES_MAX_USAGE:
 		if (type == _MEM)
 			res_counter_reset_max(&mem->res);
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+		else if (type == _KMEM_TCP)
+			res_counter_reset_max(&mem->tcp.tcp_memory_allocated);
+#endif
 		else
 			res_counter_reset_max(&mem->memsw);
 		break;
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 08/10] Display current tcp memory allocation in kmem cgroup
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

This patch introduces kmem.tcp.failcnt file, living in the
kmem_cgroup filesystem. Following the pattern in the other
memcg resources, this files keeps a counter of how many times
allocation failed due to limits being hit in this cgroup.
The root cgroup will always show a failcnt of 0.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 mm/memcontrol.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 51b5a55..9394224 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -517,6 +517,7 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			    const char *buffer);
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft);
+static int mem_cgroup_reset(struct cgroup *cont, unsigned int event);
 /*
  * We need those things internally in pages, so don't reuse
  * mem_cgroup_{read,write}
@@ -533,6 +534,12 @@ static struct cftype tcp_files[] = {
 		.read_u64 = mem_cgroup_read,
 		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_USAGE),
 	},
+	{
+		.name = "kmem.tcp.failcnt",
+		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_FAILCNT),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
 };
 
 static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
@@ -4134,6 +4141,8 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 		if (mem_cgroup_is_root(mem)) {
 			if (name == RES_USAGE)
 				val = atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
+			else if (name == RES_FAILCNT)
+				val = 0;
 			else
 				val = RESOURCE_MAX;
 		} else
@@ -4251,6 +4260,10 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 	case RES_FAILCNT:
 		if (type == _MEM)
 			res_counter_reset_failcnt(&mem->res);
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+		else if (type == _KMEM_TCP)
+			res_counter_reset_failcnt(&mem->tcp.tcp_memory_allocated);
+#endif
 		else
 			res_counter_reset_failcnt(&mem->memsw);
 		break;
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 07/10] Display current tcp memory allocation in kmem cgroup
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

This patch introduces kmem.tcp.usage_in_bytes file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 mm/memcontrol.c                  |   14 +++++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c1db134..00f1a88 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -79,6 +79,7 @@ Brief summary of control files.
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
  memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
+ memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
 
 1. History
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ee122a6..51b5a55 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -528,6 +528,11 @@ static struct cftype tcp_files[] = {
 		.read_u64 = mem_cgroup_read,
 		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_LIMIT),
 	},
+	{
+		.name = "kmem.tcp.usage_in_bytes",
+		.read_u64 = mem_cgroup_read,
+		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_USAGE),
+	},
 };
 
 static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
@@ -4126,9 +4131,12 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 #if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
 	case _KMEM_TCP:
 		/* Be explicit: tcp root does not have a res_counter */
-		if (mem_cgroup_is_root(mem))
-			val = RESOURCE_MAX;
-		else
+		if (mem_cgroup_is_root(mem)) {
+			if (name == RES_USAGE)
+				val = atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
+			else
+				val = RESOURCE_MAX;
+		} else
 			val = res_counter_read_u64(&mem->tcp.tcp_memory_allocated, name);
 		break;
 #endif
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 06/10] tcp buffer limitation: per-cgroup limit
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.

This value is ignored in the root cgroup, and in all others,
caps the value specified by the admin in the net namespaces'
view of tcp_sysctl_mem.

If namespaces are being used, the admin is allowed to set a
value bigger than cgroup's maximum, the same way it is allowed
to set pretty much unlimited values in a real box.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 include/linux/memcontrol.h       |    9 ++++
 mm/memcontrol.c                  |   76 +++++++++++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c       |   14 +++++++
 4 files changed, 99 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index bf00cd2..c1db134 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -78,6 +78,7 @@ Brief summary of control files.
 
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
+ memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
 
 1. History
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 994a06a..d025979 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -402,10 +402,19 @@ void memcg_memory_allocated_sub(struct mem_cgroup *memcg, struct cg_proto *prot,
 				unsigned long amt);
 u64 memcg_memory_allocated_read(struct mem_cgroup *memcg,
 				struct cg_proto *prot);
+unsigned long long tcp_max_memory(const struct mem_cgroup *memcg);
+void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx);
 #else
 static inline void sock_update_memcg(struct sock *sk)
 {
 }
+static inline unsigned long long tcp_max_memory(const struct mem_cgroup *memcg)
+{
+	return 0;
+}
+static inline void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 #endif /* CONFIG_INET */
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 63360f8..ee122a6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -365,6 +365,7 @@ enum mem_type {
 	_MEMSWAP,
 	_OOM_TYPE,
 	_KMEM,
+	_KMEM_TCP,
 };
 
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -500,6 +501,35 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(sockets_allocated_tcp);
 
+static void tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	val >>= PAGE_SHIFT;
+
+	for (i = 0; i < 3; i++)
+		memcg->tcp.tcp_prot_mem[i]  = min_t(long, val,
+					     net->ipv4.sysctl_tcp_mem[i]);
+}
+
+static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
+			    const char *buffer);
+
+static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft);
+/*
+ * We need those things internally in pages, so don't reuse
+ * mem_cgroup_{read,write}
+ */
+static struct cftype tcp_files[] = {
+	{
+		.name = "kmem.tcp.limit_in_bytes",
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_LIMIT),
+	},
+};
+
 static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
 {
 	/*
@@ -527,6 +557,7 @@ static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
 int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 	struct net *net = current->nsproxy->net_ns;
 	/*
 	 * We need to initialize it at populate, not create time.
@@ -537,7 +568,20 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 	memcg->tcp.tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
 	memcg->tcp.tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
-	return 0;
+	/* Let root cgroup unlimited. All others, respect parent's if needed */
+	if (parent && !parent->use_hierarchy) {
+		unsigned long limit;
+		int ret;
+		limit = nr_free_buffer_pages() / 8;
+		limit = max(limit, 128UL);
+		ret = res_counter_set_limit(&memcg->tcp.tcp_memory_allocated,
+					    limit * 2);
+		if (ret)
+			return ret;
+	}
+
+	return cgroup_add_files(cgrp, ss, tcp_files,
+				ARRAY_SIZE(tcp_files));
 }
 EXPORT_SYMBOL(tcp_init_cgroup);
 
@@ -548,6 +592,18 @@ void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 	percpu_counter_destroy(&memcg->tcp.tcp_sockets_allocated);
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
+
+unsigned long long tcp_max_memory(const struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *cmemcg = (struct mem_cgroup *)memcg;
+	return res_counter_read_u64(&cmemcg->tcp.tcp_memory_allocated,
+				    RES_LIMIT);
+}
+
+void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx)
+{
+	memcg->tcp.tcp_prot_mem[idx] = val;
+}
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -4067,6 +4123,15 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 		val = res_counter_read_u64(&mem->kmem, name);
 		break;
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+	case _KMEM_TCP:
+		/* Be explicit: tcp root does not have a res_counter */
+		if (mem_cgroup_is_root(mem))
+			val = RESOURCE_MAX;
+		else
+			val = res_counter_read_u64(&mem->tcp.tcp_memory_allocated, name);
+		break;
+#endif
 	default:
 		BUG();
 		break;
@@ -4099,6 +4164,15 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			break;
 		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
+
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+		else if (type == _KMEM_TCP) {
+			ret = res_counter_set_limit(&memcg->tcp.tcp_memory_allocated,
+						    val);
+			if (!ret)
+				tcp_update_limit(memcg, val);
+		}
+#endif
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index bbd67ab..915e192 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/memcontrol.h>
 #include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
@@ -182,6 +183,9 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	int ret;
 	unsigned long vec[3];
 	struct net *net = current->nsproxy->net_ns;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	struct mem_cgroup *cg;
+#endif
 
 	ctl_table tmp = {
 		.data = &vec,
@@ -198,6 +202,16 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	rcu_read_lock();
+	cg = mem_cgroup_from_task(current);
+
+	tcp_prot_mem(cg, vec[0], 0);
+	tcp_prot_mem(cg, vec[1], 1);
+	tcp_prot_mem(cg, vec[2], 2);
+	rcu_read_unlock();
+#endif
+
 	net->ipv4.sysctl_tcp_mem[0] = vec[0];
 	net->ipv4.sysctl_tcp_mem[1] = vec[1];
 	net->ipv4.sysctl_tcp_mem[2] = vec[2];
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 05/10] per-netns ipv4 sysctl_tcp_mem
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    1 -
 mm/memcontrol.c            |    8 ++++--
 net/ipv4/af_inet.c         |    2 +
 net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp.c             |   11 +-------
 net/ipv4/tcp_ipv4.c        |    1 -
 net/ipv6/af_inet6.c        |    2 +
 net/ipv6/tcp_ipv6.c        |    1 -
 9 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d786b4f..bbd023a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,7 @@ struct netns_ipv4 {
 	int current_rt_cache_rebuild_count;
 
 	unsigned int sysctl_ping_group_range[2];
+	long sysctl_tcp_mem[3];
 
 	atomic_t rt_genid;
 	atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7301ca8..c34b823 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f14d7d2..63360f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -395,6 +395,7 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 #ifdef CONFIG_INET
 #include <net/sock.h>
 #include <net/ip.h>
+#include <linux/nsproxy.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -526,14 +527,15 @@ static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
 int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct net *net = current->nsproxy->net_ns;
 	/*
 	 * We need to initialize it at populate, not create time.
 	 * This is because net sysctl tables are not up until much
 	 * later
 	 */
-	memcg->tcp.tcp_prot_mem[0] = sysctl_tcp_mem[0];
-	memcg->tcp.tcp_prot_mem[1] = sysctl_tcp_mem[1];
-	memcg->tcp.tcp_prot_mem[2] = sysctl_tcp_mem[2];
+	memcg->tcp.tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
+	memcg->tcp.tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
+	memcg->tcp.tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
 	return 0;
 }
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index da19147..73be7da 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1674,6 +1674,8 @@ static int __init inet_init(void)
 	ip_static_sysctl_init();
 #endif
 
+	tcp_prot.sysctl_mem = init_net.ipv4.sysctl_tcp_mem;
+
 	/*
 	 *	Add all the base protocols.
 	 */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..bbd67ab 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
 	return ret;
 }
 
+static int ipv4_tcp_mem(ctl_table *ctl, int write,
+			   void __user *buffer, size_t *lenp,
+			   loff_t *ppos)
+{
+	int ret;
+	unsigned long vec[3];
+	struct net *net = current->nsproxy->net_ns;
+
+	ctl_table tmp = {
+		.data = &vec,
+		.maxlen = sizeof(vec),
+		.mode = ctl->mode,
+	};
+
+	if (!write) {
+		ctl->data = &net->ipv4.sysctl_tcp_mem;
+		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
+	}
+
+	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	net->ipv4.sysctl_tcp_mem[0] = vec[0];
+	net->ipv4.sysctl_tcp_mem[1] = vec[1];
+	net->ipv4.sysctl_tcp_mem[2] = vec[2];
+
+	return 0;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_mem",
-		.data		= &sysctl_tcp_mem,
-		.maxlen		= sizeof(sysctl_tcp_mem),
-		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
-	},
-	{
 		.procname	= "tcp_wmem",
 		.data		= &sysctl_tcp_wmem,
 		.maxlen		= sizeof(sysctl_tcp_wmem),
@@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+	{
+		.procname	= "tcp_mem",
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
+		.mode		= 0644,
+		.proc_handler	= ipv4_tcp_mem,
+	},
 	{ }
 };
 
@@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
 	struct ctl_table *table;
+	unsigned long limit;
 
 	table = ipv4_net_table;
 	if (!net_eq(net, &init_net)) {
@@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
+	net->ipv4.sysctl_tcp_mem[1] = limit;
+	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 34f5db1..5f618d1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -282,11 +282,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
@@ -3272,14 +3270,9 @@ void __init tcp_init(void)
 	sysctl_tcp_max_orphans = cnt / 2;
 	sysctl_max_syn_backlog = max(128, cnt / 256);
 
-	limit = nr_free_buffer_pages() / 8;
-	limit = max(limit, 128UL);
-	sysctl_tcp_mem[0] = limit / 4 * 3;
-	sysctl_tcp_mem[1] = limit;
-	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
-
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
+	limit = ((unsigned long)init_net.ipv4.sysctl_tcp_mem[1])
+		<< (PAGE_SHIFT - 7);
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 54f6b96..dd1bab7 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2616,7 +2616,6 @@ struct proto tcp_prot = {
 	.orphan_count		= &tcp_orphan_count,
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
-	.sysctl_mem		= sysctl_tcp_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 51672f8..69a6da3 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1118,6 +1118,8 @@ static int __init inet6_init(void)
 	if (err)
 		goto static_sysctl_fail;
 #endif
+	tcpv6_prot.sysctl_mem = init_net.ipv4.sysctl_tcp_mem;
+
 	/*
 	 *	ipngwg API draft makes clear that the correct semantics
 	 *	for TCP and UDP is to consider one TCP and UDP instance
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3c13142..52f8b64 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2208,7 +2208,6 @@ struct proto tcpv6_prot = {
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
 	.orphan_count		= &tcp_orphan_count,
-	.sysctl_mem		= sysctl_tcp_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 04/10] per-cgroup tcp buffers control
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa,
	KAMEZAWA Hiroyuki
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

With all the infrastructure in place, this patch implements
per-cgroup control for tcp memory pressure handling.

A resource conter is used to control allocated memory, except
for the root cgroup, that will keep using global counters.

This patch is the one that actually enables/disables the
jump labels controlling cgroup. To this point, they were always
disabled.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/tcp.h       |   18 +++++++
 include/net/transp_v6.h |    1 +
 mm/memcontrol.c         |  125 ++++++++++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c         |   46 +++++++++++++++--
 net/ipv4/af_inet.c      |    3 +
 net/ipv4/tcp_ipv4.c     |   12 +++++
 net/ipv6/af_inet6.c     |    3 +
 net/ipv6/tcp_ipv6.c     |   10 ++++
 8 files changed, 211 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ccaa3b6..7301ca8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -253,6 +253,22 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
+struct tcp_memcontrol {
+	/* per-cgroup tcp memory pressure knobs */
+	struct res_counter tcp_memory_allocated;
+	struct percpu_counter tcp_sockets_allocated;
+	/* those two are read-mostly, leave them at the end */
+	long tcp_prot_mem[3];
+	int tcp_memory_pressure;
+};
+
+long *sysctl_mem_tcp(struct mem_cgroup *memcg);
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *memcg);
+int *memory_pressure_tcp(struct mem_cgroup *memcg);
+struct res_counter *memory_allocated_tcp(struct mem_cgroup *memcg);
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
+void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
+
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
 extern int tcp_memory_pressure;
@@ -305,6 +321,7 @@ static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
 }
 
 extern struct proto tcp_prot;
+extern struct cg_proto tcp_cg_prot;
 
 #define TCP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.tcp_statistics, field)
 #define TCP_INC_STATS_BH(net, field)	SNMP_INC_STATS_BH((net)->mib.tcp_statistics, field)
@@ -1022,6 +1039,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->loc_port = tcp_hdr(skb)->dest;
 }
 
+extern void tcp_enter_memory_pressure_cg(struct sock *sk);
 extern void tcp_enter_memory_pressure(struct sock *sk);
 
 static inline int keepalive_intvl_when(const struct tcp_sock *tp)
diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index 498433d..1e18849 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -11,6 +11,7 @@ extern struct proto rawv6_prot;
 extern struct proto udpv6_prot;
 extern struct proto udplitev6_prot;
 extern struct proto tcpv6_prot;
+extern struct cg_proto tcpv6_cg_prot;
 
 struct flowi6;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7d684d0..f14d7d2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,9 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include "internal.h"
+#ifdef CONFIG_INET
+#include <net/tcp.h>
+#endif
 
 #include <asm/uaccess.h>
 
@@ -294,6 +297,10 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+#ifdef CONFIG_INET
+	struct tcp_memcontrol tcp;
+#endif
 };
 
 /* Stuffs for move charges at task migration. */
@@ -377,7 +384,7 @@ enum mem_type {
 #define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
-
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont);
 static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 {
 	return (mem == root_mem_cgroup);
@@ -387,6 +394,7 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #ifdef CONFIG_INET
 #include <net/sock.h>
+#include <net/ip.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -451,6 +459,93 @@ u64 memcg_memory_allocated_read(struct mem_cgroup *memcg, struct cg_proto *prot)
 				    RES_USAGE) >> PAGE_SHIFT ;
 }
 EXPORT_SYMBOL(memcg_memory_allocated_read);
+/*
+ * Pressure flag: try to collapse.
+ * Technical note: it is used by multiple contexts non atomically.
+ * All the __sk_mem_schedule() is of this nature: accounting
+ * is strict, actions are advisory and have some latency.
+ */
+void tcp_enter_memory_pressure_cg(struct sock *sk)
+{
+	struct mem_cgroup *memcg = sk->sk_cgrp;
+	if (!memcg->tcp.tcp_memory_pressure) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
+		memcg->tcp.tcp_memory_pressure = 1;
+	}
+}
+EXPORT_SYMBOL(tcp_enter_memory_pressure_cg);
+
+long *sysctl_mem_tcp(struct mem_cgroup *memcg)
+{
+	return memcg->tcp.tcp_prot_mem;
+}
+EXPORT_SYMBOL(sysctl_mem_tcp);
+
+struct res_counter *memory_allocated_tcp(struct mem_cgroup *memcg)
+{
+	return &memcg->tcp.tcp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_tcp);
+
+int *memory_pressure_tcp(struct mem_cgroup *memcg)
+{
+	return &memcg->tcp.tcp_memory_pressure;
+}
+EXPORT_SYMBOL(memory_pressure_tcp);
+
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *memcg)
+{
+	return &memcg->tcp.tcp_sockets_allocated;
+}
+EXPORT_SYMBOL(sockets_allocated_tcp);
+
+static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
+{
+	/*
+	 * The root cgroup does not use res_counters, but rather,
+	 * rely on the data already collected by the network
+	 * subsystem
+	 */
+	if (!mem_cgroup_is_root(cg)) {
+		struct mem_cgroup *parent = parent_mem_cgroup(cg);
+		struct res_counter *res_parent = NULL;
+		cg->tcp.tcp_memory_pressure = 0;
+		percpu_counter_init(&cg->tcp.tcp_sockets_allocated, 0);
+
+		/*
+		 * Because root is not using res_counter, we only need a parent
+		 * if we're second in hierarchy.
+		 */
+		if (!mem_cgroup_is_root(parent) && parent && parent->use_hierarchy)
+			res_parent = &parent->tcp.tcp_memory_allocated;
+
+		res_counter_init(&cg->tcp.tcp_memory_allocated, res_parent);
+	}
+}
+
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	/*
+	 * We need to initialize it at populate, not create time.
+	 * This is because net sysctl tables are not up until much
+	 * later
+	 */
+	memcg->tcp.tcp_prot_mem[0] = sysctl_tcp_mem[0];
+	memcg->tcp.tcp_prot_mem[1] = sysctl_tcp_mem[1];
+	memcg->tcp.tcp_prot_mem[2] = sysctl_tcp_mem[2];
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	percpu_counter_destroy(&memcg->tcp.tcp_sockets_allocated);
+}
+EXPORT_SYMBOL(tcp_destroy_cgroup);
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -4867,17 +4962,39 @@ static struct cftype kmem_cgroup_files[] = {
 static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 {
 	int ret = 0;
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
 	ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
 			       ARRAY_SIZE(kmem_cgroup_files));
+
+	if (!ret)
+		ret = sockets_populate(cont, ss);
+
+	if (!mem_cgroup_is_root(memcg))
+		jump_label_inc(&cgroup_crap_enabled);
+
 	return ret;
 };
 
+static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
+				struct cgroup *cont)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	sockets_destroy(cont, ss);
+
+	if (!mem_cgroup_is_root(memcg))
+		jump_label_dec(&cgroup_crap_enabled);
+}
 #else
 static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 {
 	return 0;
 }
+
+static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
+				struct cgroup *cont)
+{
+}
 #endif
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -5095,6 +5212,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+	tcp_create_cgroup(mem, ss);
+#endif
+
 	if (parent)
 		mem->swappiness = mem_cgroup_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
@@ -5120,6 +5241,8 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	kmem_cgroup_destroy(ss, cont);
+
 	mem_cgroup_put(mem);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index b64d36a..c784173 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -135,6 +135,46 @@
 #include <net/tcp.h>
 #endif
 
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+static DEFINE_RWLOCK(cg_proto_list_lock);
+static LIST_HEAD(cg_proto_list);
+
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct cg_proto *proto;
+	int ret = 0;
+
+	read_lock(&cg_proto_list_lock);
+	list_for_each_entry(proto, &cg_proto_list, node) {
+		if (proto->init_cgroup)
+			ret = proto->init_cgroup(cgrp, ss);
+			if (ret)
+				goto out;
+	}
+
+	read_unlock(&cg_proto_list_lock);
+	return ret;
+out:
+	list_for_each_entry_continue_reverse(proto, &cg_proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(cgrp, ss);
+	read_unlock(&cg_proto_list_lock);
+	return ret;
+}
+
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct cg_proto *proto;
+
+	read_lock(&cg_proto_list_lock);
+	list_for_each_entry_reverse(proto, &cg_proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(cgrp, ss);
+	read_unlock(&cg_proto_list_lock);
+}
+
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -2259,12 +2299,6 @@ void sk_common_release(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_common_release);
 
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
-static DEFINE_RWLOCK(cg_proto_list_lock);
-static LIST_HEAD(cg_proto_list);
-
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1b5096a..da19147 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1661,6 +1661,9 @@ static int __init inet_init(void)
 	if (rc)
 		goto out_unregister_raw_proto;
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	cg_proto_register(&tcp_cg_prot, &tcp_prot);
+#endif
 	/*
 	 *	Tell SOCKET that we are alive...
 	 */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f124a4b..54f6b96 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1917,6 +1917,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
+	sock_update_memcg(sk);
 	return 0;
 }
 
@@ -2632,6 +2633,17 @@ struct proto tcp_prot = {
 };
 EXPORT_SYMBOL(tcp_prot);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+struct cg_proto tcp_cg_prot = {
+	.memory_allocated	= memory_allocated_tcp,
+	.memory_pressure	= memory_pressure_tcp,
+	.sockets_allocated	= sockets_allocated_tcp,
+	.prot_mem		= sysctl_mem_tcp,
+	.init_cgroup		= tcp_init_cgroup,
+	.destroy_cgroup		= tcp_destroy_cgroup,
+};
+EXPORT_SYMBOL(tcp_cg_prot);
+#endif
 
 static int __net_init tcp_sk_init(struct net *net)
 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d27c797..51672f8 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1103,6 +1103,9 @@ static int __init inet6_init(void)
 	if (err)
 		goto out_unregister_raw_proto;
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	cg_proto_register(&tcpv6_cg_prot, &tcpv6_prot);
+#endif
 	/* Register the family here so that the init calls below will
 	 * be able to create sockets. (?? is this dangerous ??)
 	 */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3a08fcd..3c13142 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2224,6 +2224,16 @@ struct proto tcpv6_prot = {
 #endif
 };
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+struct cg_proto tcpv6_cg_prot = {
+	.memory_allocated	= memory_allocated_tcp,
+	.memory_pressure	= memory_pressure_tcp,
+	.sockets_allocated	= sockets_allocated_tcp,
+	.prot_mem		= sysctl_mem_tcp,
+};
+EXPORT_SYMBOL(tcpv6_cg_prot);
+#endif
+
 static const struct inet6_protocol tcpv6_protocol = {
 	.handler	=	tcp_v6_rcv,
 	.err_handler	=	tcp_v6_err,
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 03/10] socket: initial cgroup code.
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

The goal of this work is to move the memory pressure tcp
controls to a cgroup, instead of just relying on global
conditions.

To avoid excessive overhead in the network fast paths,
the code that accounts allocated memory to a cgroup is
hidden inside a static_branch(). This branch is patched out
until the first non-root cgroup is created. So when nobody
is using cgroups, even if it is mounted, no significant performance
penalty should be seen.

This patch handles the generic part of the code, and has nothing
tcp-specific.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kirill A. Shutemov<kirill@shutemov.name>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/memcontrol.h |   27 ++++++++
 include/net/sock.h         |  151 +++++++++++++++++++++++++++++++++++++++++++-
 mm/memcontrol.c            |   85 +++++++++++++++++++++++--
 net/core/sock.c            |   36 ++++++++---
 4 files changed, 281 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e9ff93a..994a06a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -381,5 +381,32 @@ mem_cgroup_print_bad_page(struct page *page)
 }
 #endif
 
+#ifdef CONFIG_INET
+enum {
+	UNDER_LIMIT,
+	SOFT_LIMIT,
+	OVER_LIMIT,
+};
+
+struct sock;
+struct cg_proto;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+void sock_update_memcg(struct sock *sk);
+void memcg_sockets_allocated_dec(struct mem_cgroup *memcg,
+				 struct cg_proto *prot);
+void memcg_sockets_allocated_inc(struct mem_cgroup *memcg,
+				 struct cg_proto *prot);
+void memcg_memory_allocated_add(struct mem_cgroup *memcg, struct cg_proto *prot,
+				unsigned long amt, int *parent_status);
+void memcg_memory_allocated_sub(struct mem_cgroup *memcg, struct cg_proto *prot,
+				unsigned long amt);
+u64 memcg_memory_allocated_read(struct mem_cgroup *memcg,
+				struct cg_proto *prot);
+#else
+static inline void sock_update_memcg(struct sock *sk)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+#endif /* CONFIG_INET */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 8959dcc..02d7cce 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -55,6 +55,7 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/cgroup.h>
+#include <linux/res_counter.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -64,6 +65,8 @@
 #include <net/dst.h>
 #include <net/checksum.h>
 
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss);
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss);
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -231,6 +234,7 @@ struct mem_cgroup;
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
+  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -342,6 +346,7 @@ struct sock {
 #endif
 	__u32			sk_mark;
 	u32			sk_classid;
+	struct mem_cgroup	*sk_cgrp;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
@@ -733,6 +738,9 @@ struct timewait_sock_ops;
 struct inet_hashinfo;
 struct raw_hashinfo;
 
+
+struct cg_proto;
+
 /* Networking protocol blocks we attach to sockets.
  * socket layer -> transport layer interface
  * transport -> network interface is defined by struct inet_proto
@@ -835,9 +843,32 @@ struct proto {
 #ifdef SOCK_REFCNT_DEBUG
 	atomic_t		socks;
 #endif
+	struct cg_proto 	*cg_proto; /* This just makes proto replacement easier */
+};
+
+struct cg_proto {
+	/*
+	 * cgroup specific init/deinit functions. Called once for all
+	 * protocols that implement it, from cgroups populate function.
+	 * This function has to setup any files the protocol want to
+	 * appear in the kmem cgroup filesystem.
+	 */
+	int			(*init_cgroup)(struct cgroup *cgrp,
+					       struct cgroup_subsys *ss);
+	void			(*destroy_cgroup)(struct cgroup *cgrp,
+						  struct cgroup_subsys *ss);
+	struct res_counter	*(*memory_allocated)(struct mem_cgroup *memcg);
+	/* Pointer to the current number of sockets in this cgroup. */
+	struct percpu_counter 	*(*sockets_allocated)(struct mem_cgroup *memcg);
+
+	int			*(*memory_pressure)(struct mem_cgroup *memcg);
+	long			*(*prot_mem)(struct mem_cgroup *memcg);
+
+	struct list_head	node; /* with a bit more effort, we could reuse proto's */
 };
 
 extern int proto_register(struct proto *prot, int alloc_slab);
+extern void cg_proto_register(struct cg_proto *prot, struct proto *proto);
 extern void proto_unregister(struct proto *prot);
 
 #ifdef SOCK_REFCNT_DEBUG
@@ -865,15 +896,40 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
+extern struct jump_label_key cgroup_crap_enabled;
 #include <linux/memcontrol.h>
 static inline int *sk_memory_pressure(const struct sock *sk)
 {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		int *ret = NULL;
+		struct cg_proto *cg_prot = sk->sk_prot->cg_proto;
+
+		if (!sk->sk_cgrp)
+			goto nocgroup;
+		if (cg_prot->memory_pressure)
+			ret = cg_prot->memory_pressure(sk->sk_cgrp);
+		return ret;
+	} else
+nocgroup:
+#endif
 	return sk->sk_prot->memory_pressure;
 }
 
 static inline long sk_prot_mem(const struct sock *sk, int index)
 {
 	long *prot = sk->sk_prot->sysctl_mem;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *cg = sk->sk_cgrp;
+		struct cg_proto *cg_prot = sk->sk_prot->cg_proto;
+		if (!cg) /* this handles the case with existing sockets */
+			goto nocgroup;
+
+		cg_prot->prot_mem(sk->sk_cgrp);
+	}
+nocgroup:
+#endif
 	return prot[index];
 }
 
@@ -881,32 +937,93 @@ static inline long
 sk_memory_allocated(const struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *cg = sk->sk_cgrp;
+		struct cg_proto *cg_prot = sk->sk_prot->cg_proto;
+		if (!cg) /* this handles the case with existing sockets */
+			goto nocgroup;
+
+		return memcg_memory_allocated_read(cg, cg_prot);
+	}
+nocgroup:
+#endif
 	return atomic_long_read(prot->memory_allocated);
 }
 
 static inline long
-sk_memory_allocated_add(struct sock *sk, int amt)
+sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
 {
 	struct proto *prot = sk->sk_prot;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *cg = sk->sk_cgrp;
+		struct cg_proto *cg_prot = prot->cg_proto;
+
+		if (!cg)
+			goto nocgroup;
+
+		memcg_memory_allocated_add(cg, cg_prot, amt, parent_status);
+	}
+nocgroup:
+#endif
 	return atomic_long_add_return(amt, prot->memory_allocated);
 }
 
 static inline void
-sk_memory_allocated_sub(struct sock *sk, int amt)
+sk_memory_allocated_sub(struct sock *sk, int amt, int parent_status)
 {
 	struct proto *prot = sk->sk_prot;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *cg = sk->sk_cgrp;
+		struct cg_proto *cg_prot = prot->cg_proto;
+
+		if (!cg)
+			goto nocgroup;
+
+		/* Otherwise it was uncharged already */
+		if (parent_status != OVER_LIMIT)
+			memcg_memory_allocated_sub(cg, cg_prot, amt);
+	}
+nocgroup:
+#endif
 	atomic_long_sub(amt, prot->memory_allocated);
 }
 
 static inline void sk_sockets_allocated_dec(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *cg = sk->sk_cgrp;
+		struct cg_proto *cg_prot = prot->cg_proto;
+
+		if (!cg)
+			goto nocgroup;
+
+		memcg_sockets_allocated_dec(cg, cg_prot);
+	}
+nocgroup:
+#endif
 	percpu_counter_dec(prot->sockets_allocated);
 }
 
 static inline void sk_sockets_allocated_inc(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *cg = sk->sk_cgrp;
+		struct cg_proto *cg_prot = prot->cg_proto;
+
+		if (!cg)
+			goto nocgroup;
+
+		memcg_sockets_allocated_inc(cg, cg_prot);
+	}
+nocgroup:
+#endif
 	percpu_counter_inc(prot->sockets_allocated);
 }
 
@@ -914,19 +1031,47 @@ static inline int
 sk_sockets_allocated_read_positive(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
-
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *cg = sk->sk_cgrp;
+		struct cg_proto *cg_prot = prot->cg_proto;
+
+		if (!cg)
+			goto nocgroup;
+		return percpu_counter_sum_positive(cg_prot->sockets_allocated(cg));
+	}
+nocgroup:
+#endif
 	return percpu_counter_sum_positive(prot->sockets_allocated);
 }
 
 static inline int
 kcg_sockets_allocated_sum_positive(struct proto *prot, struct mem_cgroup *cg)
 {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct cg_proto *cg_prot = prot->cg_proto;
+		if (!cg)
+			goto nocgroup;
+		return percpu_counter_sum_positive(cg_prot->sockets_allocated(cg));
+	}
+nocgroup:
+#endif
 	return percpu_counter_sum_positive(prot->sockets_allocated);
 }
 
 static inline long
 kcg_memory_allocated(struct proto *prot, struct mem_cgroup *cg)
 {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct cg_proto *cg_prot = prot->cg_proto;
+		if (!cg)
+			goto nocgroup;
+		return memcg_memory_allocated_read(cg, cg_prot);
+	}
+nocgroup:
+#endif
 	return atomic_long_read(prot->memory_allocated);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3389d33..7d684d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -376,6 +376,85 @@ enum mem_type {
 #define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
 #define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
+static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
+
+static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
+{
+	return (mem == root_mem_cgroup);
+}
+
+/* Writing them here to avoid exposing memcg's inner layout */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#ifdef CONFIG_INET
+#include <net/sock.h>
+
+void sock_update_memcg(struct sock *sk)
+{
+	/* right now a socket spends its whole life in the same cgroup */
+	if (sk->sk_cgrp) {
+		WARN_ON(1);
+		return;
+	}
+	if (static_branch(&cgroup_crap_enabled)) {
+		struct mem_cgroup *memcg;
+
+		BUG_ON(!sk->sk_prot->cg_proto);
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_task(current);
+		if (!mem_cgroup_is_root(memcg))
+			sk->sk_cgrp = memcg;
+		rcu_read_unlock();
+	}
+}
+
+void memcg_sockets_allocated_dec(struct mem_cgroup *memcg,
+				 struct cg_proto *prot)
+{
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
+		percpu_counter_dec(prot->sockets_allocated(memcg));
+}
+EXPORT_SYMBOL(memcg_sockets_allocated_dec);
+
+void memcg_sockets_allocated_inc(struct mem_cgroup *memcg,
+				 struct cg_proto *prot)
+{
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
+		percpu_counter_inc(prot->sockets_allocated(memcg));
+}
+EXPORT_SYMBOL(memcg_sockets_allocated_inc);
+
+void memcg_memory_allocated_add(struct mem_cgroup *memcg, struct cg_proto *prot,
+				unsigned long amt, int *parent_status)
+{
+	struct res_counter *fail;
+	int ret;
+
+	ret = res_counter_charge(prot->memory_allocated(memcg),
+				 amt << PAGE_SHIFT, &fail);
+
+	if (ret < 0)
+		*parent_status = OVER_LIMIT;
+}
+EXPORT_SYMBOL(memcg_memory_allocated_add);
+
+void memcg_memory_allocated_sub(struct mem_cgroup *memcg, struct cg_proto *prot,
+				unsigned long amt)
+{
+	res_counter_uncharge(prot->memory_allocated(memcg), amt << PAGE_SHIFT);
+}
+EXPORT_SYMBOL(memcg_memory_allocated_sub);
+
+u64 memcg_memory_allocated_read(struct mem_cgroup *memcg, struct cg_proto *prot)
+{
+	return res_counter_read_u64(prot->memory_allocated(memcg),
+				    RES_USAGE) >> PAGE_SHIFT ;
+}
+EXPORT_SYMBOL(memcg_memory_allocated_read);
+#endif /* CONFIG_INET */
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -872,12 +951,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
 #define for_each_mem_cgroup_all(iter) \
 	for_each_mem_cgroup_tree_cond(iter, NULL, true)
 
-
-static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
-{
-	return (mem == root_mem_cgroup);
-}
-
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *mem;
diff --git a/net/core/sock.c b/net/core/sock.c
index 26bdb1c..b64d36a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -111,6 +111,7 @@
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/user_namespace.h>
+#include <linux/jump_label.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -141,6 +142,9 @@
 static struct lock_class_key af_family_keys[AF_MAX];
 static struct lock_class_key af_family_slock_keys[AF_MAX];
 
+struct jump_label_key cgroup_crap_enabled;
+EXPORT_SYMBOL(cgroup_crap_enabled);
+
 /*
  * Make lock validator output more readable. (we pre-construct these
  * strings build-time, so that runtime initialization of socket
@@ -1678,26 +1682,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	int amt = sk_mem_pages(size);
 	long allocated;
 	int *memory_pressure;
+	int parent_status = UNDER_LIMIT;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
 
 	memory_pressure = sk_memory_pressure(sk);
-	allocated = sk_memory_allocated_add(sk, amt);
+	allocated = sk_memory_allocated_add(sk, amt, &parent_status);
+
+	/* Over hard limit (we or our parents) */
+	if ((parent_status == OVER_LIMIT) || (allocated > sk_prot_mem(sk, 2)))
+		goto suppress_allocation;
 
 	/* Under limit. */
 	if (allocated <= sk_prot_mem(sk, 0))
 		if (memory_pressure && *memory_pressure)
 			*memory_pressure = 0;
 
-	/* Under pressure. */
-	if (allocated > sk_prot_mem(sk, 1))
+	/* Under pressure. (we or our parents) */
+	if ((parent_status == SOFT_LIMIT) || allocated > sk_prot_mem(sk, 1))
 		if (prot->enter_memory_pressure)
 			prot->enter_memory_pressure(sk);
 
-	/* Over hard limit. */
-	if (allocated > sk_prot_mem(sk, 2))
-		goto suppress_allocation;
-
 	/* guarantee minimum buffer size under pressure */
 	if (kind == SK_MEM_RECV) {
 		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
@@ -1742,7 +1747,7 @@ suppress_allocation:
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
 
-	sk_memory_allocated_sub(sk, amt);
+	sk_memory_allocated_sub(sk, amt, parent_status);
 
 	return 0;
 }
@@ -1757,7 +1762,7 @@ void __sk_mem_reclaim(struct sock *sk)
 	int *memory_pressure = sk_memory_pressure(sk);
 
 	sk_memory_allocated_sub(sk,
-				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
+				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT, 0);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
 	if (memory_pressure && *memory_pressure &&
@@ -2257,6 +2262,9 @@ EXPORT_SYMBOL(sk_common_release);
 static DEFINE_RWLOCK(proto_list_lock);
 static LIST_HEAD(proto_list);
 
+static DEFINE_RWLOCK(cg_proto_list_lock);
+static LIST_HEAD(cg_proto_list);
+
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
@@ -2358,6 +2366,16 @@ static inline void release_proto_idx(struct proto *prot)
 }
 #endif
 
+void cg_proto_register(struct cg_proto *prot, struct proto *parent)
+{
+	write_lock(&cg_proto_list_lock);
+	list_add(&prot->node, &cg_proto_list);
+	write_unlock(&cg_proto_list_lock);
+
+	parent->cg_proto = prot;
+}
+EXPORT_SYMBOL(cg_proto_register);
+
 int proto_register(struct proto *prot, int alloc_slab)
 {
 	if (alloc_slab) {
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 02/10] foundations of per-cgroup memory pressure controlling.
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.

Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/memcontrol.h |    4 ++
 include/net/sock.h         |   86 ++++++++++++++++++++++++++++++++++++++-----
 include/net/tcp.h          |    3 +-
 net/core/sock.c            |   55 +++++++++++++++++-----------
 net/ipv4/proc.c            |    7 ++--
 net/ipv4/tcp_input.c       |   12 +++---
 net/ipv4/tcp_ipv4.c        |    4 +-
 net/ipv4/tcp_output.c      |    2 +-
 net/ipv4/tcp_timer.c       |    2 +-
 net/ipv6/tcp_ipv6.c        |    2 +-
 10 files changed, 130 insertions(+), 47 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ac797fa..e9ff93a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -362,6 +362,10 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+	return NULL;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/net/sock.h b/include/net/sock.h
index c6658be..8959dcc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,6 +54,7 @@
 #include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -168,6 +169,8 @@ struct sock_common {
 	/* public: */
 };
 
+struct mem_cgroup;
+
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -793,22 +796,21 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
-	/* Memory pressure */
-	void			(*enter_memory_pressure)(struct sock *sk);
-	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
+	void                    (*enter_memory_pressure)(struct sock *sk);
+	atomic_long_t           *memory_allocated;      /* Current allocated memory. */
+	struct percpu_counter   *sockets_allocated;     /* Current number of sockets. */
 	/*
 	 * Pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
 	 * All the __sk_mem_schedule() is of this nature: accounting
 	 * is strict, actions are advisory and have some latency.
 	 */
-	int			*memory_pressure;
-	long			*sysctl_mem;
-	int			*sysctl_wmem;
-	int			*sysctl_rmem;
-	int			max_header;
-	bool			no_autobind;
+	int                     *memory_pressure;
+	long                    *sysctl_mem;
+	int                     *sysctl_wmem;
+	int                     *sysctl_rmem;
+	int                     max_header;
+	bool                    no_autobind;
 
 	struct kmem_cache	*slab;
 	unsigned int		obj_size;
@@ -863,6 +865,70 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
+#include <linux/memcontrol.h>
+static inline int *sk_memory_pressure(const struct sock *sk)
+{
+	return sk->sk_prot->memory_pressure;
+}
+
+static inline long sk_prot_mem(const struct sock *sk, int index)
+{
+	long *prot = sk->sk_prot->sysctl_mem;
+	return prot[index];
+}
+
+static inline long
+sk_memory_allocated(const struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	return atomic_long_read(prot->memory_allocated);
+}
+
+static inline long
+sk_memory_allocated_add(struct sock *sk, int amt)
+{
+	struct proto *prot = sk->sk_prot;
+	return atomic_long_add_return(amt, prot->memory_allocated);
+}
+
+static inline void
+sk_memory_allocated_sub(struct sock *sk, int amt)
+{
+	struct proto *prot = sk->sk_prot;
+	atomic_long_sub(amt, prot->memory_allocated);
+}
+
+static inline void sk_sockets_allocated_dec(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	percpu_counter_dec(prot->sockets_allocated);
+}
+
+static inline void sk_sockets_allocated_inc(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	percpu_counter_inc(prot->sockets_allocated);
+}
+
+static inline int
+sk_sockets_allocated_read_positive(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+
+	return percpu_counter_sum_positive(prot->sockets_allocated);
+}
+
+static inline int
+kcg_sockets_allocated_sum_positive(struct proto *prot, struct mem_cgroup *cg)
+{
+	return percpu_counter_sum_positive(prot->sockets_allocated);
+}
+
+static inline long
+kcg_memory_allocated(struct proto *prot, struct mem_cgroup *cg)
+{
+	return atomic_long_read(prot->memory_allocated);
+}
 
 #ifdef CONFIG_PROC_FS
 /* Called with local bh disabled */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e147f42..ccaa3b6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -44,6 +44,7 @@
 #include <net/dst.h>
 
 #include <linux/seq_file.h>
+#include <linux/memcontrol.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -285,7 +286,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 	}
 
 	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+	    sk_memory_allocated(sk) > sk_prot_mem(sk, 2))
 		return true;
 	return false;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 4ed7b1d..26bdb1c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1288,7 +1288,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		newsk->sk_wq = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
-			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
+			sk_sockets_allocated_inc(newsk);
 
 		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
 		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1677,30 +1677,32 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
+	int *memory_pressure;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	allocated = atomic_long_add_return(amt, prot->memory_allocated);
+
+	memory_pressure = sk_memory_pressure(sk);
+	allocated = sk_memory_allocated_add(sk, amt);
 
 	/* Under limit. */
-	if (allocated <= prot->sysctl_mem[0]) {
-		if (prot->memory_pressure && *prot->memory_pressure)
-			*prot->memory_pressure = 0;
-		return 1;
-	}
+	if (allocated <= sk_prot_mem(sk, 0))
+		if (memory_pressure && *memory_pressure)
+			*memory_pressure = 0;
 
 	/* Under pressure. */
-	if (allocated > prot->sysctl_mem[1])
+	if (allocated > sk_prot_mem(sk, 1))
 		if (prot->enter_memory_pressure)
 			prot->enter_memory_pressure(sk);
 
 	/* Over hard limit. */
-	if (allocated > prot->sysctl_mem[2])
+	if (allocated > sk_prot_mem(sk, 2))
 		goto suppress_allocation;
 
 	/* guarantee minimum buffer size under pressure */
 	if (kind == SK_MEM_RECV) {
 		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
 			return 1;
+
 	} else { /* SK_MEM_SEND */
 		if (sk->sk_type == SOCK_STREAM) {
 			if (sk->sk_wmem_queued < prot->sysctl_wmem[0])
@@ -1710,13 +1712,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 				return 1;
 	}
 
-	if (prot->memory_pressure) {
+	if (memory_pressure) {
 		int alloc;
 
-		if (!*prot->memory_pressure)
+		if (!*memory_pressure)
 			return 1;
-		alloc = percpu_counter_read_positive(prot->sockets_allocated);
-		if (prot->sysctl_mem[2] > alloc *
+		alloc = sk_sockets_allocated_read_positive(sk);
+		if (sk_prot_mem(sk, 2) > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
 				 sk->sk_forward_alloc))
@@ -1739,7 +1741,9 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-	atomic_long_sub(amt, prot->memory_allocated);
+
+	sk_memory_allocated_sub(sk, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1750,15 +1754,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reclaim(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
+	int *memory_pressure = sk_memory_pressure(sk);
 
-	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
+	sk_memory_allocated_sub(sk,
+				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	if (memory_pressure && *memory_pressure &&
+	    (sk_memory_allocated(sk) < sk_prot_mem(sk, 0)))
+		*memory_pressure = 0;
 }
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
@@ -2477,13 +2481,20 @@ static char proto_method_implemented(const void *method)
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
+	struct mem_cgroup *cg = mem_cgroup_from_task(current);
+	int *memory_pressure = NULL;
+
+	if (proto->memory_pressure)
+		memory_pressure = proto->memory_pressure;
+
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
-		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
-		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
+		   proto->memory_allocated != NULL ?
+			kcg_memory_allocated(proto, cg) : -1L,
+		   memory_pressure != NULL ? *memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
 		   module_name(proto->owner),
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 4bfad5d..535456d 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -52,20 +52,21 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
 	int orphans, sockets;
+	struct mem_cgroup *cg = mem_cgroup_from_task(current);
 
 	local_bh_disable();
 	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
-	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
+	sockets = kcg_sockets_allocated_sum_positive(&tcp_prot, cg);
 	local_bh_enable();
 
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   tcp_death_row.tw_count, sockets,
-		   atomic_long_read(&tcp_memory_allocated));
+		   kcg_memory_allocated(&tcp_prot, cg));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
-		   atomic_long_read(&udp_memory_allocated));
+		   kcg_memory_allocated(&udp_prot, cg));
 	seq_printf(seq, "UDPLITE: inuse %d\n",
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 52b5c2d..3df862d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -322,7 +322,7 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 	/* Check #1 */
 	if (tp->rcv_ssthresh < tp->window_clamp &&
 	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-	    !tcp_memory_pressure) {
+	    !sk_memory_pressure(sk)) {
 		int incr;
 
 		/* Check #2. Increase window, if skb with such overhead
@@ -411,8 +411,8 @@ static void tcp_clamp_window(struct sock *sk)
 
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
-	    !tcp_memory_pressure &&
-	    atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    !sk_memory_pressure(sk) &&
+	    sk_memory_allocated(sk) < sk_prot_mem(sk, 0)) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -4864,7 +4864,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
 		tcp_clamp_window(sk);
-	else if (tcp_memory_pressure)
+	else if (sk_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
 	tcp_collapse_ofo_queue(sk);
@@ -4930,11 +4930,11 @@ static int tcp_should_expand_sndbuf(const struct sock *sk)
 		return 0;
 
 	/* If we are under global TCP memory pressure, do not expand.  */
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (sk_memory_allocated(sk) >= sk_prot_mem(sk, 0))
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0ea10ee..f124a4b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1914,7 +1914,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -1970,7 +1970,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
-	percpu_counter_dec(&tcp_sockets_allocated);
+	sk_sockets_allocated_dec(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 980b98f..04e229b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1919,7 +1919,7 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_memory_pressure)
+		if (sk_memory_pressure(sk))
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 2e0f0af..c9f830c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
 	}
 
 out:
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		sk_mem_reclaim(sk);
 out_unlock:
 	bh_unlock_sock(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 10b2b31..3a08fcd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1995,7 +1995,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 01/10] Basic kernel memory functionality for the Memory Controller
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, Glauber Costa
In-Reply-To: <1320679595-21074-1-git-send-email-glommer@parallels.com>

This patch lays down the foundation for the kernel memory component
of the Memory Controller.

As of today, I am only laying down the following files:

 * memory.independent_kmem_limit
 * memory.kmem.limit_in_bytes (currently ignored)
 * memory.kmem.usage_in_bytes (always zero)

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++-
 init/Kconfig                     |   14 +++++
 mm/memcontrol.c                  |  104 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 147 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 06eb6d9..bf00cd2 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -44,8 +44,9 @@ Features:
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
 
- Kernel memory and Hugepages are not under control yet. We just manage
- pages on LRU. To add more controls, we have to take care of performance.
+ Hugepages is not under control yet. We just manage pages on LRU. To add more
+ controls, we have to take care of performance. Kernel memory support is work
+ in progress, and the current version provides basically functionality.
 
 Brief summary of control files.
 
@@ -56,8 +57,11 @@ Brief summary of control files.
 				 (See 5.5 for details)
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
+ memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
+				 (See 2.7 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
+ memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
  memory.max_usage_in_bytes	 # show max memory usage recorded
@@ -72,6 +76,9 @@ Brief summary of control files.
  memory.oom_control		 # set/show oom controls.
  memory.numa_stat		 # show the number of memory usage per numa node
 
+ memory.independent_kmem_limit	 # select whether or not kernel memory limits are
+				   independent of user limits
+
 1. History
 
 The memory controller has a long history. A request for comments for the memory
@@ -255,6 +262,31 @@ When oom event notifier is registered, event will be delivered.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
   zone->lru_lock, it has no lock of its own.
 
+2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
+
+ With the Kernel memory extension, the Memory Controller is able to limit
+the amount of kernel memory used by the system. Kernel memory is fundamentally
+different than user memory, since it can't be swapped out, which makes it
+possible to DoS the system by consuming too much of this precious resource.
+Kernel memory limits are not imposed for the root cgroup.
+
+Memory limits as specified by the standard Memory Controller may or may not
+take kernel memory into consideration. This is achieved through the file
+memory.independent_kmem_limit. A Value different than 0 will allow for kernel
+memory to be controlled separately.
+
+When kernel memory limits are not independent, the limit values set in
+memory.kmem files are ignored.
+
+Currently no soft limit is implemented for kernel memory. It is future work
+to trigger slab reclaim when those limits are reached.
+
+CAUTION: As of this writing, the kmem extention may prevent tasks from moving
+among cgroups. If a task has kmem accounting in a cgroup, the task cannot be
+moved until the kmem resource is released. Also, until the resource is fully
+released, the cgroup cannot be destroyed. So, please consider your use cases
+and set kmem extention config option carefully.
+
 3. User Interface
 
 0. Configuration
diff --git a/init/Kconfig b/init/Kconfig
index 31ba0fd..e4b6246 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -689,6 +689,20 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  For those who want to have the feature enabled by default should
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
+config CGROUP_MEM_RES_CTLR_KMEM
+	bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && EXPERIMENTAL
+	default n
+	help
+	  The Kernel Memory extension for Memory Resource Controller can limit
+	  the amount of memory used by kernel objects in the system. Those are
+	  fundamentally different from the entities handled by the standard
+	  Memory Controller, which are page-based, and can be swapped. Users of
+	  the kmem extension can use it to guarantee that no group of processes
+	  will ever exhaust kernel resources alone.
+
+	  WARNING: The current experimental implementation does not allow a
+	  task to move among different cgroups with a kmem resource being held.
 
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d57555..3389d33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -226,6 +226,10 @@ struct mem_cgroup {
 	 */
 	struct res_counter memsw;
 	/*
+	 * the counter to account for kmem usage.
+	 */
+	struct res_counter kmem;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -276,6 +280,11 @@ struct mem_cgroup {
 	 */
 	unsigned long 	move_charge_at_immigrate;
 	/*
+	 * Should kernel memory limits be stabilished independently
+	 * from user memory ?
+	 */
+	int		kmem_independent_accounting;
+	/*
 	 * percpu counter.
 	 */
 	struct mem_cgroup_stat_cpu *stat;
@@ -343,9 +352,14 @@ enum charge_type {
 };
 
 /* for encoding cft->private value on file */
-#define _MEM			(0)
-#define _MEMSWAP		(1)
-#define _OOM_TYPE		(2)
+
+enum mem_type {
+	_MEM = 0,
+	_MEMSWAP,
+	_OOM_TYPE,
+	_KMEM,
+};
+
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
@@ -3838,10 +3852,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
 	u64 val;
 
 	if (!mem_cgroup_is_root(mem)) {
+		val = 0;
+		if (!mem->kmem_independent_accounting)
+			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
 		if (!swap)
-			return res_counter_read_u64(&mem->res, RES_USAGE);
+			val += res_counter_read_u64(&mem->res, RES_USAGE);
 		else
-			return res_counter_read_u64(&mem->memsw, RES_USAGE);
+			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
+
+		return val;
 	}
 
 	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
@@ -3874,6 +3893,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 		else
 			val = res_counter_read_u64(&mem->memsw, name);
 		break;
+	case _KMEM:
+		val = res_counter_read_u64(&mem->kmem, name);
+		break;
+
 	default:
 		BUG();
 		break;
@@ -4604,6 +4627,35 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static u64 kmem_limit_independent_read(struct cgroup *cgroup, struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cgroup)->kmem_independent_accounting;
+}
+
+static int kmem_limit_independent_write(struct cgroup *cgroup, struct cftype *cft,
+					u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+
+	val = !!val;
+
+	if (parent && parent->use_hierarchy &&
+	   (val != parent->kmem_independent_accounting))
+		return -EINVAL;
+	/*
+	 * TODO: We need to handle the case in which we are doing
+	 * independent kmem accounting as authorized by our parent,
+	 * but then our parent changes its parameter.
+	 */
+	cgroup_lock();
+	memcg->kmem_independent_accounting = val;
+	cgroup_unlock();
+	return 0;
+}
+#endif
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4719,6 +4771,42 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
 }
 #endif
 
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static struct cftype kmem_cgroup_files[] = {
+	{
+		.name = "independent_kmem_limit",
+		.read_u64 = kmem_limit_independent_read,
+		.write_u64 = kmem_limit_independent_write,
+	},
+	{
+		.name = "kmem.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "kmem.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+		.read_u64 = mem_cgroup_read,
+	},
+};
+
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	int ret = 0;
+
+	ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
+			       ARRAY_SIZE(kmem_cgroup_files));
+	return ret;
+};
+
+#else
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	return 0;
+}
+#endif
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -4917,6 +5005,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent && parent->use_hierarchy) {
 		res_counter_init(&mem->res, &parent->res);
 		res_counter_init(&mem->memsw, &parent->memsw);
+		res_counter_init(&mem->kmem, &parent->kmem);
 		/*
 		 * We increment refcnt of the parent to ensure that we can
 		 * safely access it on res_counter_charge/uncharge.
@@ -4927,6 +5016,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	} else {
 		res_counter_init(&mem->res, NULL);
 		res_counter_init(&mem->memsw, NULL);
+		res_counter_init(&mem->kmem, NULL);
 	}
 	mem->last_scanned_child = 0;
 	mem->last_scanned_node = MAX_NUMNODES;
@@ -4970,6 +5060,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 
 	if (!ret)
 		ret = register_memsw_files(cont, ss);
+
+	if (!ret)
+		ret = register_kmem_files(cont, ss);
+
 	return ret;
 }
 
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 00/10] per-cgroup tcp memory pressure
From: Glauber Costa @ 2011-11-07 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet

Hi all,

This is my new attempt at implementing per-cgroup tcp memory pressure.
I am particularly interested in what the network folks have to comment on
it: my main goal is to achieve the least impact possible in the network code.

Here's a brief description of my approach:

When only the root cgroup is present, the code should behave the same way as
before - with the exception of the inclusion of an extra field in struct sock,
and one in struct proto. All tests are patched out with static branch, and we
still access addresses directly - the same as we did before.

When a cgroup other than root is created, we patch in the branches, and account
resources for that cgroup. The variables in the root cgroup are still updated.
If we were to try to be 100 % coherent with the memcg code, that should depend
on use_hierarchy. However, I feel that this is a good compromise in terms of
leaving the network code untouched, and still having a global vision of its
resources. I also do not compute max_usage for the root cgroup, for a similar
reason.

Please let me know what you think of it.

Glauber Costa (10):
  Basic kernel memory functionality for the Memory Controller
  foundations of per-cgroup memory pressure controlling.
  socket: initial cgroup code.
  per-cgroup tcp buffers control
  per-netns ipv4 sysctl_tcp_mem
  tcp buffer limitation: per-cgroup limit
  Display current tcp memory allocation in kmem cgroup
  Display current tcp memory allocation in kmem cgroup
  Display current tcp memory allocation in kmem cgroup
  Disable task moving when using kernel memory accounting

 Documentation/cgroups/memory.txt |   38 +++-
 include/linux/memcontrol.h       |   40 ++++
 include/net/netns/ipv4.h         |    1 +
 include/net/sock.h               |  231 +++++++++++++++++++-
 include/net/tcp.h                |   22 ++-
 include/net/transp_v6.h          |    1 +
 init/Kconfig                     |   14 ++
 mm/memcontrol.c                  |  442 ++++++++++++++++++++++++++++++++++++-
 net/core/sock.c                  |  121 ++++++++---
 net/ipv4/af_inet.c               |    5 +
 net/ipv4/proc.c                  |    7 +-
 net/ipv4/sysctl_net_ipv4.c       |   65 +++++-
 net/ipv4/tcp.c                   |   11 +-
 net/ipv4/tcp_input.c             |   12 +-
 net/ipv4/tcp_ipv4.c              |   17 ++-
 net/ipv4/tcp_output.c            |    2 +-
 net/ipv4/tcp_timer.c             |    2 +-
 net/ipv6/af_inet6.c              |    5 +
 net/ipv6/tcp_ipv6.c              |   13 +-
 19 files changed, 962 insertions(+), 87 deletions(-)

-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next 0/2] 802.1ad S-VLAN support
From: Ben Hutchings @ 2011-11-07 15:11 UTC (permalink / raw)
  To: David Lamparter; +Cc: netdev
In-Reply-To: <1320512055-1231037-1-git-send-email-equinox@diac24.net>

On Sat, 2011-11-05 at 17:54 +0100, David Lamparter wrote:
> Hi DaveM, hi everyone,
> 
> 
> this kernel patch, together with the iproute2 userspace support,
> allows creating 802.1ad S-VLAN devices.
> 
> This feature might have weird interactions with hardware VLAN
> acceleration. I've done my best to make sure it doesn't break
> 802.1Q, but my access to hardware is rather limited. I did grep
> & scan all drivers for maybe-affected vlan behaviour and found
> nothing. I've tested on e1000, forcedeth, virtio and a Kirkwood
> ARM.

I didn't try it at all, but it looks reasonable to me.

We definitely need to think about how MTU/MRU are configured when
multiple VLAN tags are used, though I don't think it's essential to do
before this goes in.  To be slightly more blunt than your documentation,
our current handling of MTU/MRU and VLANs is a botch.

Do you have any plan to improve that?  Or to allow use of offload
features for multiple-tagged packets?

Ben.

> It'd be nice to get this into the next merge window to get some
> people with funny hardware a nice smoke trail...

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox