Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
From: Michał Mirosław @ 2011-09-04 16:18 UTC (permalink / raw)
  To: kaber; +Cc: davem, netfilter-devel, netdev
In-Reply-To: <1315070771-18576-9-git-send-email-kaber@trash.net>

On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
> From: Patrick McHardy <kaber@trash.net>
> 
> Add support for memory mapped sendmsg() to netlink. Userspace queued to
> be processed frames into the TX ring and invokes sendmsg with
> msg.iov.iov_base = NULL to trigger processing of all pending messages.
> 
> Since the kernel usually performs full message validation before beginning
> processing, userspace must be prevented from modifying the message
> contents while the kernel is processing them. In order to do so, the
> frames contents are copied to an allocated skb in case the the ring is
> mapped more than once or the file descriptor is shared (f.i. through
> AF_UNIX file descriptor passing).
> 
> Otherwise an skb without a data area is allocated, the data pointer set
> to point to the data area of the ring frame and the skb is processed.
> Once the skb is freed, the destructor releases the frame back to userspace
> by setting the status to NL_MMAP_STATUS_UNUSED.

Is this protected from threads? Like: one thread waits on sendmsg() and
another (same process) changes the buffer.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Stephen Hemminger @ 2011-09-04 16:36 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: David S. Miller, netdev
In-Reply-To: <4E632A2E.5040805@gmail.com>

On Sun, 04 Sep 2011 09:35:10 +0200
Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:

> Le 04/09/2011 06:14, Stephen Hemminger a écrit :
> 
> >> Instead of asserting carrier when the bridge have no port, can't we assert carrier when the three
> >> following condition are true at the same time :
> >>
> >> - The bridge have no port.
> >> - At least one IP address is setup on the bridge.
> >> - The two above conditions are true for more than a configurable amount of seconds, with a default
> >> of 10, for example.
> >>
> >> This would only delay carrier on for a few seconds for the regression and keep the current behavior
> >> (carrier off until at least 1 port is on) for DHCP.
> >
> > This fails on two counts:
> > 1. Bridge's often run without IP addresses!
> > 2. DHCP won't try and send out request until carrier is true.
> 
> Sorry, I missed to say that we should of course also assert carrier on if one port has carrier on.
> 
> And rethinking about it, the delay is probably useless :
> 
> bridge_carrier_on = at_least_one_port_has_carrier_on | (bridge_has_no_port & bridge_has_at_least_one_ip)
> 
> That way :
> - for those using bridge without any port, manually setting the IP will assert carrier on. (By the 
> way, why don't they use a dummy device instead?)
> 
> - for those using bridge with ports:
> -- Using any kind of autoconfig will work as expected. Carrier will only be asserted at the time 
> first port get carrier.
> -- Using static IP confifiguration, carrier will possibly be erroneously reported as on during the 
> small time gap between IP address configuration and first port is added to the bridge. This time gap 
> may be removed by simply configuring the IP after the first port is added. This is probably already 
> true for most distribs. And anyway, this time gap is probably not a problem.
> -- Carrier will also be erroneously reported as on after removing the last port, if the bridge still 
> has an IP. (But we can arrange for this not to happen).
> 
> And in order to ensure user really understand why carrier is on of off, we can simply issue an INFO 
> message for the non-natural case (bridge_has_no_port & bridga_has_at_least_one_ip).
> 
> I consider all this reasonable.
> 
> 	Nicolas.

Any bridge behaviour based on IP address configuration is a
layering violation and won't work.  The problem is related to dynamic issues
with IPv6 and DHCP and needs to be addressed at that level.

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Krzysztof Olędzki @ 2011-09-04 17:12 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Nicolas de Pesloüan, David S. Miller, netdev
In-Reply-To: <20110904093634.685d7c56@nehalam.ftrdhcpuser.net>

On 2011-09-04 18:36, Stephen Hemminger wrote:
> On Sun, 04 Sep 2011 09:35:10 +0200
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:
>
>> Le 04/09/2011 06:14, Stephen Hemminger a écrit :
>>
>>>> Instead of asserting carrier when the bridge have no port, can't we assert carrier when the three
>>>> following condition are true at the same time :
>>>>
>>>> - The bridge have no port.
>>>> - At least one IP address is setup on the bridge.
>>>> - The two above conditions are true for more than a configurable amount of seconds, with a default
>>>> of 10, for example.
>>>>
>>>> This would only delay carrier on for a few seconds for the regression and keep the current behavior
>>>> (carrier off until at least 1 port is on) for DHCP.
>>>
>>> This fails on two counts:
>>> 1. Bridge's often run without IP addresses!
>>> 2. DHCP won't try and send out request until carrier is true.
>>
>> Sorry, I missed to say that we should of course also assert carrier on if one port has carrier on.
>>
>> And rethinking about it, the delay is probably useless :
>>
>> bridge_carrier_on = at_least_one_port_has_carrier_on | (bridge_has_no_port&  bridge_has_at_least_one_ip)
>>
>> That way :
>> - for those using bridge without any port, manually setting the IP will assert carrier on. (By the
>> way, why don't they use a dummy device instead?)
>>
>> - for those using bridge with ports:
>> -- Using any kind of autoconfig will work as expected. Carrier will only be asserted at the time
>> first port get carrier.
>> -- Using static IP confifiguration, carrier will possibly be erroneously reported as on during the
>> small time gap between IP address configuration and first port is added to the bridge. This time gap
>> may be removed by simply configuring the IP after the first port is added. This is probably already
>> true for most distribs. And anyway, this time gap is probably not a problem.
>> -- Carrier will also be erroneously reported as on after removing the last port, if the bridge still
>> has an IP. (But we can arrange for this not to happen).
>>
>> And in order to ensure user really understand why carrier is on of off, we can simply issue an INFO
>> message for the non-natural case (bridge_has_no_port&  bridga_has_at_least_one_ip).
>>
>> I consider all this reasonable.
>>
>> 	Nicolas.
>
> Any bridge behaviour based on IP address configuration is a
> layering violation and won't work.  The problem is related to dynamic issues
> with IPv6 and DHCP and needs to be addressed at that level.

Maybe we can simply add a switch controlling if a bridge with no 
attached ports has carrier off (default) or on.

For example:
  echo {0|1} > /sys/devices/virtual/net/brX/bridge/orphaned_carrier

  brctl orphaned_carrier brX {on|off}

Best regards,

				Krzysztof Olędzki

^ permalink raw reply

* Re: [PATCH 2/2] Add a netlink attribute INET_DIAG_SECCTX
From: Rongqing Li @ 2011-09-05  0:32 UTC (permalink / raw)
  To: Paul Moore; +Cc: netdev, selinux, linux-security-module
In-Reply-To: <1971656.b979ahlREa@sifl>

On 09/01/2011 08:28 PM, Paul Moore wrote:
> On Thursday, September 01, 2011 05:33:07 PM Rongqing Li wrote:
>> On 09/01/2011 05:18 AM, Paul Moore wrote:
>>> On Wednesday, August 31, 2011 04:36:17 PM rongqing.li@windriver.com wrote:
>>>> From: Roy.Li<rongqing.li@windriver.com>
>>>>
>>>> Add a new netlink attribute INET_DIAG_SECCTX to dump the security
>>>> context of TCP sockets.
>>>
>>> You'll have to forgive me, I'm not familiar with the netlink code used
>>> by
>>> netstat and friends, but is there anyway to report back the security
>>> context of UDP sockets?  Or does the code below handle that already?
>>>
>>> In general, AF_INET and AF_INET6 sockets, regardless of any upper level
>>> protocols, have security contexts associated with them and it would be
>>> nice to see them in netstat.
>>
>> Yes, this is real concern, If the dumping tcp security context can be
>> accepted by netdev, I am planning to implement it for ipv4 udp socket, unix
>> socket. then ipv6..
>
> Great, I'm glad to hear you're planning on implementing this for more than
> just TCP.
>
> I understand your desire to have the basic idea accepted with only TCP
> implemented - and that is fine with me - but I would like to see support for
> all of the protocols merged at the same time.  In other words, seeking the
> basic ACKs for TCP from the davem, et al is okay but I'd like to defer merging
> TCP support until you have everything implemented and ready to be merged.
>

Ok, I will try

-- 
Best Reagrds,
Roy | RongQing Li

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Ang Way Chuang @ 2011-09-05  4:51 UTC (permalink / raw)
  To: Krzysztof Olędzki
  Cc: Stephen Hemminger, Nicolas de Pesloüan, David S. Miller,
	netdev, Achmad Basuki
In-Reply-To: <4E63B175.6020106@ans.pl>

On 05/09/11 02:12, Krzysztof Olędzki wrote:
> On 2011-09-04 18:36, Stephen Hemminger wrote:
>> On Sun, 04 Sep 2011 09:35:10 +0200
>> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:
>>
>>> Le 04/09/2011 06:14, Stephen Hemminger a écrit :
>>>
>>>>> Instead of asserting carrier when the bridge have no port, can't we assert carrier when the three
>>>>> following condition are true at the same time :
>>>>>
>>>>> - The bridge have no port.
>>>>> - At least one IP address is setup on the bridge.
>>>>> - The two above conditions are true for more than a configurable amount of seconds, with a default
>>>>> of 10, for example.
>>>>>
>>>>> This would only delay carrier on for a few seconds for the regression and keep the current behavior
>>>>> (carrier off until at least 1 port is on) for DHCP.
>>>>
>>>> This fails on two counts:
>>>> 1. Bridge's often run without IP addresses!
>>>> 2. DHCP won't try and send out request until carrier is true.
>>>
>>> Sorry, I missed to say that we should of course also assert carrier on if one port has carrier on.
>>>
>>> And rethinking about it, the delay is probably useless :
>>>
>>> bridge_carrier_on = at_least_one_port_has_carrier_on | (bridge_has_no_port&  bridge_has_at_least_one_ip)
>>>
>>> That way :
>>> - for those using bridge without any port, manually setting the IP will assert carrier on. (By the
>>> way, why don't they use a dummy device instead?)
>>>
>>> - for those using bridge with ports:
>>> -- Using any kind of autoconfig will work as expected. Carrier will only be asserted at the time
>>> first port get carrier.
>>> -- Using static IP confifiguration, carrier will possibly be erroneously reported as on during the
>>> small time gap between IP address configuration and first port is added to the bridge. This time gap
>>> may be removed by simply configuring the IP after the first port is added. This is probably already
>>> true for most distribs. And anyway, this time gap is probably not a problem.
>>> -- Carrier will also be erroneously reported as on after removing the last port, if the bridge still
>>> has an IP. (But we can arrange for this not to happen).
>>>
>>> And in order to ensure user really understand why carrier is on of off, we can simply issue an INFO
>>> message for the non-natural case (bridge_has_no_port&  bridga_has_at_least_one_ip).
>>>
>>> I consider all this reasonable.
>>>
>>>     Nicolas.
>>
>> Any bridge behaviour based on IP address configuration is a
>> layering violation and won't work.  The problem is related to dynamic issues
>> with IPv6 and DHCP and needs to be addressed at that level.
>
> Maybe we can simply add a switch controlling if a bridge with no attached ports has carrier off (default) or on.
>
> For example:
>  echo {0|1} > /sys/devices/virtual/net/brX/bridge/orphaned_carrier
>
>  brctl orphaned_carrier brX {on|off}
>
> Best regards,
>
>                 Krzysztof Olędzki
My colleague investigated the problem. In the case of libvirt, there are 2 issues at hand:
- dnsmasq cannot bind to port 53 for IPv6 address, therefore fails.
- radvd fails to start because the interface is treated as being down.

This blog entry from Daniel Berrange explains the configuration used by libvirt:
http://berrange.com/posts/2011/06/16/providing-ipv6-connectivity-to-virtual-guests-with-libvirt-and-kvm/
It's quite possible that other users may have the same problem in the future if they plan to use radvd or
dnsmasq on a bridge interface. I'm not sure if there are other applications that depend on the carrier on
a bridge interface to function properly. Come to think of it, unless the user is aware of the change
in carrier on the bridge interface, it is very hard for them to diagnose the problem and apply the right
solution (toggling the carrier). Ideally, there is no reason why dnsmasq and radvd should be started if
there is no port attached to a bridge interface. But, we cannot assume that's how it is done.

I think that it is prudent that the whole commit should be reverted because the problem may not be limited
to libvirt use case.
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH] ipv4: Fix fib_info->fib_metrics leak
From: Yan, Zheng @ 2011-09-05  6:24 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: davem@davemloft.net, laijs

Commit 4670994d(net,rcu: convert call_rcu(fc_rport_free_rcu) to
kfree_rcu()) introduced a memory leak. This patch reverts it.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
---
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 33e2c35..80106d8 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -142,6 +142,14 @@ const struct fib_prop fib_props[RTN_MAX + 1] = {
 };
 
 /* Release a nexthop info record */
+static void free_fib_info_rcu(struct rcu_head *head)
+{
+	struct fib_info *fi = container_of(head, struct fib_info, rcu);
+
+	if (fi->fib_metrics != (u32 *) dst_default_metrics)
+		kfree(fi->fib_metrics);
+	kfree(fi);
+}
 
 void free_fib_info(struct fib_info *fi)
 {
@@ -156,7 +164,7 @@ void free_fib_info(struct fib_info *fi)
 	} endfor_nexthops(fi);
 	fib_info_cnt--;
 	release_net(fi->fib_net);
-	kfree_rcu(fi, rcu);
+	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
 
 void fib_release_info(struct fib_info *fi)

^ permalink raw reply related

* RE: [PATCH net-next v4 4/4] r8169: support new chips of RTL8111F
From: hayeswang @ 2011-09-05 12:13 UTC (permalink / raw)
  To: 'Francois Romieu'; +Cc: netdev, linux-kernel
In-Reply-To: <20110902162831.GA25544@electric-eye.fr.zoreil.com>

 

> -----Original Message-----
> From: Francois Romieu [mailto:romieu@fr.zoreil.com] 
> Sent: Saturday, September 03, 2011 12:29 AM
> To: Hayeswang
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next v4 4/4] r8169: support new chips 
> of RTL8111F
> 
> Hayes Wang <hayeswang@realtek.com> :
> > Support new chips of RTL8111F.
> 
> Amongst other things :o)
> 
> > diff --git a/drivers/net/ethernet/realtek/r8169.c 
> > b/drivers/net/ethernet/realtek/r8169.c
> > index 175c769..8e6a200 100644
> > --- a/drivers/net/ethernet/realtek/r8169.c
> > +++ b/drivers/net/ethernet/realtek/r8169.c
> [...]
> > @@ -711,7 +719,10 @@ MODULE_FIRMWARE(FIRMWARE_8168D_1);
> >  MODULE_FIRMWARE(FIRMWARE_8168D_2);
> >  MODULE_FIRMWARE(FIRMWARE_8168E_1);
> >  MODULE_FIRMWARE(FIRMWARE_8168E_2);
> > +MODULE_FIRMWARE(FIRMWARE_8168E_3);
> 
> This one is relevant for Linus's tree.
> 
> Don't worry about submitting again, I'll send it separately.
> 

OK. I would adjust it again.

> No opinion regarding the jumbo fixes patches I sent on 2011/07/17 ?
> 

It seems fine except that you don't need to disable rx checksum when using jumbo
frame. The chips don't support tx checksum when using jumbo frame because of the
feature of early tx. The early tx means that the hw would start to send the
packet without loading the all data into fifo, so the software checksum is
necessary. However, there is no influence for rx, so you could still enable rx
checksum.
 
Best Regards,
Hayes

^ permalink raw reply

* RE: [linux-firmware v4 2/2] rtl_nic: add new firmware for RTL8111F
From: hayeswang @ 2011-09-05 12:33 UTC (permalink / raw)
  To: 'Ben Hutchings'; +Cc: dwmw2, romieu, netdev
In-Reply-To: <1314969392.3092.157.camel@deadeye>

 

> -----Original Message-----
> From: Ben Hutchings [mailto:ben@decadent.org.uk] 
> Sent: Friday, September 02, 2011 9:17 PM
> To: Hayeswang
> Cc: dwmw2@infradead.org; romieu@fr.zoreil.com; netdev@vger.kernel.org
> Subject: Re: [linux-firmware v4 2/2] rtl_nic: add new 
> firmware for RTL8111F
> 
> On Thu, 2011-09-01 at 15:07 +0800, Hayes Wang wrote:
> > Add new firmware:
> > 1. rtl_nic/rtl8168f-1.fw
> >    version: 0.0.2
> > 2. rtl_nic/rtl8168f-2.fw
> >    version: 0.0.2
> [...]
> 
> Applied.  (And please do send firmware files to me as well as 
> David; we can both commit to linux-firmware.git.)
> 

Fine. Thanks.
 
Best Regards,
Hayes

^ permalink raw reply

* 3.1-rc2: (on a second machine): WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
From: Justin Piszcz @ 2011-09-05 12:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Alan Piszcz

Hello,

Kernel crash: 3.1-rc2:

Sep  5 08:28:36 l1 kernel: [1194854.704017] ------------[ cut here ]------------
Sep  5 08:28:36 l1 kernel: [1194854.704030] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x23f/0x250()
Sep  5 08:28:36 l1 kernel: [1194854.704033] Hardware name: P35-DS4
Sep  5 08:28:36 l1 kernel: [1194854.704036] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Sep  5 08:28:36 l1 kernel: [1194854.704039] Modules linked in: tcp_diag ub ppdev lp inet_diag w83795 ipmi_msghandler dm_mod nouveau ttm snd_hda_codec_realtek snd_hda_intel snd_hda_codec drm_kms_helper snd_pcm_oss snd_mixer_oss drm snd_pcm i2c_algo_bit mxm_wmi wmi intel_agp intel_gtt agpgart floppy snd_timer firewire_ohci firewire_core video snd soundcore snd_page_alloc parport_pc parport
Sep  5 08:28:36 l1 kernel: [1194854.704077] Pid: 19, comm: ksoftirqd/3 Not tainted 3.1.0-rc2 #1
Sep  5 08:28:36 l1 kernel: [1194854.704080] Call Trace:
Sep  5 08:28:36 l1 kernel: [1194854.704088]  [<ffffffff810372da>] warn_slowpath_common+0x7a/0xb0
Sep  5 08:28:36 l1 kernel: [1194854.704093]  [<ffffffff810373b1>] warn_slowpath_fmt+0x41/0x50
Sep  5 08:28:36 l1 kernel: [1194854.704099]  [<ffffffff815ee154>] ? schedule+0x2e4/0x950
Sep  5 08:28:36 l1 kernel: [1194854.704103]  [<ffffffff814fbc7f>] dev_watchdog+0x23f/0x250
Sep  5 08:28:36 l1 kernel: [1194854.704109]  [<ffffffff81043162>] run_timer_softirq+0xf2/0x220
Sep  5 08:28:36 l1 kernel: [1194854.704113]  [<ffffffff814fba40>] ? qdisc_reset+0x50/0x50
Sep  5 08:28:36 l1 kernel: Sep  5 08:31:47 l1 syslogd 1.5.0#6.1: restart (remote reception).

Thoughts?

Justin.

^ permalink raw reply

* Re: [PATCH 1/2] sfc: Remove requirement to set tx-usecs-irq for shared channels when modifying coalescing parameters
From: Ben Hutchings @ 2011-09-05 14:12 UTC (permalink / raw)
  To: Ripduman Sohan; +Cc: linux-net-drivers, shodgson, netdev
In-Reply-To: <1314784356-30619-1-git-send-email-ripduman.sohan@cl.cam.ac.uk>

On Wed, 2011-08-31 at 10:52 +0100, Ripduman Sohan wrote:
> Shared TX/RX channels possess a single channel timer controlled by the
> rx-usecs-irq parameter.  Changing coalescing parameters required
> explicitly setting the tx-usecs-irq parameter to 0.  Ethtool (to HEAD
> of tree) does not do this and instead retrieves and re-submits the
> current tx-usecs-irq value resulting in an unsupported operation
> error.  I found this behaviour counter-intuitive and was only able to
> work out correct moderation parameters by studying the driver code.
> 
> This patch relaxes the requirement to set tx-usecs-irq to 0 by only
> erring if the presented tx-usecs-irq value differs from the current
> value.  I acknowledge, however, that there may be existing scripts
> relying on the old behaviour and so this condition is only triggered
> if a value for tx-usecs-irq is actually presented.

After you first reminded me about this in email, I had a good look at
our ethtool coalescing control functions and tried to fix them
thoroughly.

Eli Cohen also recently queried about the semantics of the fields in
struct ethtool_coalesce
<http://thread.gmane.org/gmane.linux.network/202368>, so I updated the
kernel-doc comments for it (19e2f6f..a27fc96).

So I have some changes of my own in preparation, which I'll send as
follow-ups.

> ---
>  drivers/net/sfc/efx.c     |    6 +++---
>  drivers/net/sfc/efx.h     |    1 +
>  drivers/net/sfc/ethtool.c |    4 +++-
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
> index faca764..9a313cd 100644
> --- a/drivers/net/sfc/efx.c
> +++ b/drivers/net/sfc/efx.c
> @@ -1556,7 +1556,7 @@ static void efx_remove_all(struct efx_nic *efx)
>   *
>   **************************************************************************/
>  
> -static unsigned irq_mod_ticks(int usecs, int resolution)
> +unsigned efx_irq_mod_ticks(int usecs, int resolution)
>  {
>  	if (usecs <= 0)
>  		return 0; /* cannot receive interrupts ahead of time :-) */
> @@ -1570,8 +1570,8 @@ void efx_init_irq_moderation(struct efx_nic *efx, int tx_usecs, int rx_usecs,
>  			     bool rx_adaptive)
>  {
>  	struct efx_channel *channel;
> -	unsigned tx_ticks = irq_mod_ticks(tx_usecs, EFX_IRQ_MOD_RESOLUTION);
> -	unsigned rx_ticks = irq_mod_ticks(rx_usecs, EFX_IRQ_MOD_RESOLUTION);
> +	unsigned tx_ticks = efx_irq_mod_ticks(tx_usecs, EFX_IRQ_MOD_RESOLUTION);
> +	unsigned rx_ticks = efx_irq_mod_ticks(rx_usecs, EFX_IRQ_MOD_RESOLUTION);
>  
>  	EFX_ASSERT_RESET_SERIALISED(efx);
>  

I would rather add a efx_get_irq_moderation() function for use in
ethtool.c.

> diff --git a/drivers/net/sfc/efx.h b/drivers/net/sfc/efx.h
> index b0d1209..ddfcc7e 100644
> --- a/drivers/net/sfc/efx.h
> +++ b/drivers/net/sfc/efx.h
> @@ -113,6 +113,7 @@ extern int efx_reset_up(struct efx_nic *efx, enum reset_type method, bool ok);
>  extern void efx_schedule_reset(struct efx_nic *efx, enum reset_type type);
>  extern void efx_init_irq_moderation(struct efx_nic *efx, int tx_usecs,
>  				    int rx_usecs, bool rx_adaptive);
> +extern unsigned efx_irq_mod_ticks(int usecs, int resolution);
>  
>  /* Dummy PHY ops for PHY drivers */
>  extern int efx_port_dummy_op_int(struct efx_nic *efx);
> diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
> index bc4643a..0a52447 100644
> --- a/drivers/net/sfc/ethtool.c
> +++ b/drivers/net/sfc/ethtool.c
> @@ -644,7 +644,9 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
>  	efx_for_each_channel(channel, efx) {
>  		if (efx_channel_has_rx_queue(channel) &&
>  		    efx_channel_has_tx_queues(channel) &&
> -		    tx_usecs) {
> +		    tx_usecs &&
> +		    efx_irq_mod_ticks(tx_usecs, EFX_IRQ_MOD_RESOLUTION) !=
> +		    channel->irq_moderation) {
>  			netif_err(efx, drv, efx->net_dev, "Channel is shared. "
>  				  "Only RX coalescing may be set\n");
>  			return -EOPNOTSUPP;

channel->irq_moderation will be the value selected by the adaptive
moderation algorithm.  efx->rx_irq_moderation is the appropriate value
to use here.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] usbnet: ignore get interface retval of -EINPROGRESS
From: Jim Wylder @ 2011-09-05 16:29 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: netdev
In-Reply-To: <201109041216.12127.oliver@neukum.org>

When calling pm_runtime_get, usb_autopm_get_interface_async
treats a return value of -EINPROGRESS as a success and
increments the usage count.  Since the interface is resuming,
it is safe for usbnet_start_xmit to submit the urb.  If instead,
usbnet_start_xmit treats this as an error the packet will be
dropped.  Additionally, a corresponding tx_complete will not
run to offset the earlier increment of the usage count from the
call to usb_autopm_get_interface_async.

Signed-off-by: James Wylder <james.wylder@motorola.com>
---
 drivers/net/usb/usbnet.c |   16 +++++++++++++++-
 1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ce395fe..7dcef05 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -81,6 +81,19 @@

 /*-------------------------------------------------------------------------*/

+/* When power management is configured, usb_autopm_get_interface_async()
+ * will return -EINPROGRESS to signify that the interface was already
+ * resuming at the time that the function was called.  In our case
+ * (all cases?) this not an error condition.
+ * */
+#ifdef CONFIG_PM
+#define USBNET_PM_INTF_ALREADY_RESUMING(x) (x == -EINPROGRESS)
+#else
+#define USBNET_PM_INTF_ALREADY_RESUMING(x) 0
+#endif
+
+/*-------------------------------------------------------------------------*/
+
 // randomly generated ethernet address
 static u8	node_id [ETH_ALEN];

@@ -1042,6 +1055,7 @@ EXPORT_SYMBOL_GPL(usbnet_tx_timeout);

 /*-------------------------------------------------------------------------*/

+
 netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 				     struct net_device *net)
 {
@@ -1105,7 +1119,7 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,

 	spin_lock_irqsave(&dev->txq.lock, flags);
 	retval = usb_autopm_get_interface_async(dev->intf);
-	if (retval < 0) {
+	if (retval < 0 && !USBNET_PM_INTF_ALREADY_RESUMING(retval)) {
 		spin_unlock_irqrestore(&dev->txq.lock, flags);
 		goto drop;
 	}
-- 
1.7.6

On Sun, Sep 4, 2011 at 5:16 AM, Oliver Neukum <oliver@neukum.org> wrote:
> Am Samstag, 3. September 2011, 19:32:44 schrieb Jim Wylder:
>
> Hi,
>
>> Thanks for the quick feedback.  True usbnet_start_xmit() could be
>> running at anytime, but the usb_autopm_get_interface_async() will only
>> return -EINPROGRESS when rpm_resume detects that
>> dev->power.runtime_status == RPM_RESUMING.  My understanding is that
>> for an asynchronous request, the promise that the device is resuming
>> would be equivalent to cases where usb_autopm_get_interface_async()
>> returns success.
>
> If CONFIG_PM is set.
>
>> In all other cases, when we are not attempting to resume an already
>> resuming interface, this change should have no impact.
>>
>> Are you recommending that I add an additional check for DEV_ASLEEP,
>
> The check is already there but depending on CONFIG_PM.
>
>> possibly to decide whether to drop the or continue on?  Or am I
>> missing your point?  I had not done anything similar because my
>> understanding was that knowing that the device is in fact resuming
>> would be sufficient.
>
> At that point it is sufficient. Upon further thought it looks like your check
> is correct.
> We just need to make very sure that for the decision to queue or
> to submit EVENT_DEV_ASLEEP is relevant.
> Would you consider defining a macro with a nice name for the
> ==0 || == -EINPROGRESS check, so that any reader knows what
> this is about? This is because I doubt only usbnet is affected.
>
>        Regards
>                Oliver
>

^ permalink raw reply related

* Re: [PATCH] usbnet: ignore get interface retval of -EINPROGRESS
From: Jim Wylder @ 2011-09-05 16:36 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: netdev
In-Reply-To: <CAPopfEXz+HPdwWh_GQ2WzMnHh5DuXTse78NAU=-iN15wv-0C5g@mail.gmail.com>

Oliver,

I have a couple of questions about this:

- Does the macro as implemented properly communicate the intent as you
requested?

- You made a comment about this being useful outside of usbnet.  The
only appropriate place that I could identify for this would be all the
way up in usb.h.  Do you think something at that level is more
appropriate?

Jim

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Nicolas de Pesloüan @ 2011-09-05 17:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David S. Miller, netdev
In-Reply-To: <20110904093634.685d7c56@nehalam.ftrdhcpuser.net>

Le 04/09/2011 18:36, Stephen Hemminger a écrit :
> On Sun, 04 Sep 2011 09:35:10 +0200
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:
>
>> Le 04/09/2011 06:14, Stephen Hemminger a écrit :
>>
>>>> Instead of asserting carrier when the bridge have no port, can't we assert carrier when the three
>>>> following condition are true at the same time :
>>>>
>>>> - The bridge have no port.
>>>> - At least one IP address is setup on the bridge.
>>>> - The two above conditions are true for more than a configurable amount of seconds, with a default
>>>> of 10, for example.
>>>>
>>>> This would only delay carrier on for a few seconds for the regression and keep the current behavior
>>>> (carrier off until at least 1 port is on) for DHCP.
>>>
>>> This fails on two counts:
>>> 1. Bridge's often run without IP addresses!
>>> 2. DHCP won't try and send out request until carrier is true.
>>
>> Sorry, I missed to say that we should of course also assert carrier on if one port has carrier on.
>>
>> And rethinking about it, the delay is probably useless :
>>
>> bridge_carrier_on = at_least_one_port_has_carrier_on | (bridge_has_no_port&  bridge_has_at_least_one_ip)
>>
>> That way :
>> - for those using bridge without any port, manually setting the IP will assert carrier on. (By the
>> way, why don't they use a dummy device instead?)
>>
>> - for those using bridge with ports:
>> -- Using any kind of autoconfig will work as expected. Carrier will only be asserted at the time
>> first port get carrier.
>> -- Using static IP confifiguration, carrier will possibly be erroneously reported as on during the
>> small time gap between IP address configuration and first port is added to the bridge. This time gap
>> may be removed by simply configuring the IP after the first port is added. This is probably already
>> true for most distribs. And anyway, this time gap is probably not a problem.
>> -- Carrier will also be erroneously reported as on after removing the last port, if the bridge still
>> has an IP. (But we can arrange for this not to happen).
>>
>> And in order to ensure user really understand why carrier is on of off, we can simply issue an INFO
>> message for the non-natural case (bridge_has_no_port&  bridga_has_at_least_one_ip).
>>
>> I consider all this reasonable.
>>
>> 	Nicolas.
>
> Any bridge behaviour based on IP address configuration is a
> layering violation and won't work.  The problem is related to dynamic issues
> with IPv6 and DHCP and needs to be addressed at that level.

Well, this is not a bridge behavior, this is a behavior of the local (non physical) interface of the 
bridge. If this interface were physical, it would have been external to the bridge (on one of the 
hosts connected to the bridge).  As such, this behavior is not related to the layering of the 
bridge, so I don't consider it a layering violation.

My proposed change doesn't impact the way the bridge works (forwards packets) in any way.

And it really solves both issues.

	Nicolas.

^ permalink raw reply

* [PATCH net-next 1/5] sfc: Correct error code for unsupported interrupt coalescing parameters
From: Ben Hutchings @ 2011-09-05 17:41 UTC (permalink / raw)
  To: David Miller, Ripduman Sohan; +Cc: netdev, linux-net-drivers
In-Reply-To: <1315231921.3028.11.camel@bwh-desktop>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/ethtool.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index bc4643a..6de2715 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -628,12 +628,12 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 	unsigned tx_usecs, rx_usecs, adaptive;
 
 	if (coalesce->use_adaptive_tx_coalesce)
-		return -EOPNOTSUPP;
+		return -EINVAL;
 
 	if (coalesce->rx_coalesce_usecs || coalesce->tx_coalesce_usecs) {
 		netif_err(efx, drv, efx->net_dev, "invalid coalescing setting. "
 			  "Only rx/tx_coalesce_usecs_irq are supported\n");
-		return -EOPNOTSUPP;
+		return -EINVAL;
 	}
 
 	rx_usecs = coalesce->rx_coalesce_usecs_irq;
@@ -647,7 +647,7 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 		    tx_usecs) {
 			netif_err(efx, drv, efx->net_dev, "Channel is shared. "
 				  "Only RX coalescing may be set\n");
-			return -EOPNOTSUPP;
+			return -EINVAL;
 		}
 	}
 
-- 
1.7.4.4



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 2/5] sfc: Use consistent types for interrupt coalescing parameters
From: Ben Hutchings @ 2011-09-05 17:41 UTC (permalink / raw)
  To: David Miller, Ripduman Sohan; +Cc: netdev, linux-net-drivers
In-Reply-To: <1315231921.3028.11.camel@bwh-desktop>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c     |   10 +++++-----
 drivers/net/ethernet/sfc/efx.h     |    4 ++--
 drivers/net/ethernet/sfc/ethtool.c |    3 ++-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index b6b0e71..097ed8b 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -1556,18 +1556,18 @@ static void efx_remove_all(struct efx_nic *efx)
  *
  **************************************************************************/
 
-static unsigned irq_mod_ticks(int usecs, int resolution)
+static unsigned int irq_mod_ticks(unsigned int usecs, unsigned int resolution)
 {
-	if (usecs <= 0)
-		return 0; /* cannot receive interrupts ahead of time :-) */
+	if (usecs == 0)
+		return 0;
 	if (usecs < resolution)
 		return 1; /* never round down to 0 */
 	return usecs / resolution;
 }
 
 /* Set interrupt moderation parameters */
-void efx_init_irq_moderation(struct efx_nic *efx, int tx_usecs, int rx_usecs,
-			     bool rx_adaptive)
+void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
+			     unsigned int rx_usecs, bool rx_adaptive)
 {
 	struct efx_channel *channel;
 	unsigned tx_ticks = irq_mod_ticks(tx_usecs, EFX_IRQ_MOD_RESOLUTION);
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index b0d1209..8f5acae 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -111,8 +111,8 @@ extern int efx_reset_up(struct efx_nic *efx, enum reset_type method, bool ok);
 
 /* Global */
 extern void efx_schedule_reset(struct efx_nic *efx, enum reset_type type);
-extern void efx_init_irq_moderation(struct efx_nic *efx, int tx_usecs,
-				    int rx_usecs, bool rx_adaptive);
+extern void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
+				    unsigned int rx_usecs, bool rx_adaptive);
 
 /* Dummy PHY ops for PHY drivers */
 extern int efx_port_dummy_op_int(struct efx_nic *efx);
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index 6de2715..dedaa2c 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -625,7 +625,8 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_channel *channel;
-	unsigned tx_usecs, rx_usecs, adaptive;
+	unsigned int tx_usecs, rx_usecs;
+	bool adaptive;
 
 	if (coalesce->use_adaptive_tx_coalesce)
 		return -EINVAL;
-- 
1.7.4.4



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 3/5] sfc: Correct reporting and validation of TX interrupt coalescing
From: Ben Hutchings @ 2011-09-05 17:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers, Ripduman Sohan
In-Reply-To: <1315231921.3028.11.camel@bwh-desktop>

The reported TX IRQ moderation is generated in a completely crazy way.
Make it simple and correct.

When channels are shared between RX and TX, TX IRQ moderation must be
the same as RX IRQ moderation, but must be specified as 0!  Allow it
to be either specified as the same, or left at its previous value
in which case it will be quietly overridden.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c     |   18 ++++++++++
 drivers/net/ethernet/sfc/efx.h     |    2 +
 drivers/net/ethernet/sfc/ethtool.c |   67 +++++++++++++++++------------------
 3 files changed, 53 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 097ed8b..e0157c0 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -1585,6 +1585,24 @@ void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
 	}
 }
 
+void efx_get_irq_moderation(struct efx_nic *efx, unsigned int *tx_usecs,
+			    unsigned int *rx_usecs, bool *rx_adaptive)
+{
+	*rx_adaptive = efx->irq_rx_adaptive;
+	*rx_usecs = efx->irq_rx_moderation * EFX_IRQ_MOD_RESOLUTION;
+
+	/* If channels are shared between RX and TX, so is IRQ
+	 * moderation.  Otherwise, IRQ moderation is the same for all
+	 * TX channels and is not adaptive.
+	 */
+	if (efx->tx_channel_offset == 0)
+		*tx_usecs = *rx_usecs;
+	else
+		*tx_usecs =
+			efx->channel[efx->tx_channel_offset]->irq_moderation *
+			EFX_IRQ_MOD_RESOLUTION;
+}
+
 /**************************************************************************
  *
  * Hardware monitor
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index 8f5acae..8ca6863 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -113,6 +113,8 @@ extern int efx_reset_up(struct efx_nic *efx, enum reset_type method, bool ok);
 extern void efx_schedule_reset(struct efx_nic *efx, enum reset_type type);
 extern void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
 				    unsigned int rx_usecs, bool rx_adaptive);
+extern void efx_get_irq_moderation(struct efx_nic *efx, unsigned int *tx_usecs,
+				   unsigned int *rx_usecs, bool *rx_adaptive);
 
 /* Dummy PHY ops for PHY drivers */
 extern int efx_port_dummy_op_int(struct efx_nic *efx);
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index dedaa2c..1cb6ed0 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -586,40 +586,37 @@ static int efx_ethtool_nway_reset(struct net_device *net_dev)
 	return mdio45_nway_restart(&efx->mdio);
 }
 
+/*
+ * Each channel has a single IRQ and moderation timer, started by any
+ * completion (or other event).  Unless the module parameter
+ * separate_tx_channels is set, IRQs and moderation are therefore
+ * shared between RX and TX completions.  In this case, when RX IRQ
+ * moderation is explicitly changed then TX IRQ moderation is
+ * automatically changed too, but otherwise we fail if the two values
+ * are requested to be different.
+ *
+ * We implement adaptive IRQ moderation, but use a different algorithm
+ * from that assumed in the definition of struct ethtool_coalesce.
+ * Therefore we do not use any of the adaptive moderation parameters
+ * in it.
+ */
+
 static int efx_ethtool_get_coalesce(struct net_device *net_dev,
 				    struct ethtool_coalesce *coalesce)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
-	struct efx_channel *channel;
-
-	memset(coalesce, 0, sizeof(*coalesce));
-
-	/* Find lowest IRQ moderation across all used TX queues */
-	coalesce->tx_coalesce_usecs_irq = ~((u32) 0);
-	efx_for_each_channel(channel, efx) {
-		if (!efx_channel_has_tx_queues(channel))
-			continue;
-		if (channel->irq_moderation < coalesce->tx_coalesce_usecs_irq) {
-			if (channel->channel < efx->n_rx_channels)
-				coalesce->tx_coalesce_usecs_irq =
-					channel->irq_moderation;
-			else
-				coalesce->tx_coalesce_usecs_irq = 0;
-		}
-	}
+	unsigned int tx_usecs, rx_usecs;
+	bool rx_adaptive;
 
-	coalesce->use_adaptive_rx_coalesce = efx->irq_rx_adaptive;
-	coalesce->rx_coalesce_usecs_irq = efx->irq_rx_moderation;
+	efx_get_irq_moderation(efx, &tx_usecs, &rx_usecs, &rx_adaptive);
 
-	coalesce->tx_coalesce_usecs_irq *= EFX_IRQ_MOD_RESOLUTION;
-	coalesce->rx_coalesce_usecs_irq *= EFX_IRQ_MOD_RESOLUTION;
+	coalesce->tx_coalesce_usecs_irq = tx_usecs;
+	coalesce->rx_coalesce_usecs_irq = rx_usecs;
+	coalesce->use_adaptive_rx_coalesce = rx_adaptive;
 
 	return 0;
 }
 
-/* Set coalescing parameters
- * The difficulties occur for shared channels
- */
 static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 				    struct ethtool_coalesce *coalesce)
 {
@@ -637,20 +634,22 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 		return -EINVAL;
 	}
 
+	efx_get_irq_moderation(efx, &tx_usecs, &rx_usecs, &adaptive);
+
 	rx_usecs = coalesce->rx_coalesce_usecs_irq;
-	tx_usecs = coalesce->tx_coalesce_usecs_irq;
 	adaptive = coalesce->use_adaptive_rx_coalesce;
 
-	/* If the channel is shared only allow RX parameters to be set */
-	efx_for_each_channel(channel, efx) {
-		if (efx_channel_has_rx_queue(channel) &&
-		    efx_channel_has_tx_queues(channel) &&
-		    tx_usecs) {
-			netif_err(efx, drv, efx->net_dev, "Channel is shared. "
-				  "Only RX coalescing may be set\n");
-			return -EINVAL;
-		}
+	/* If channels are shared, TX IRQ moderation can be quietly
+	 * overridden unless it is changed from its old value.
+	 */
+	if (efx->tx_channel_offset == 0 &&
+	    coalesce->tx_coalesce_usecs_irq != tx_usecs &&
+	    coalesce->tx_coalesce_usecs_irq != rx_usecs) {
+		netif_err(efx, drv, efx->net_dev, "Channels are shared. "
+			  "RX and TX IRQ moderation must be equal\n");
+		return -EINVAL;
 	}
+	tx_usecs = coalesce->tx_coalesce_usecs_irq;
 
 	efx_init_irq_moderation(efx, tx_usecs, rx_usecs, adaptive);
 	efx_for_each_channel(channel, efx)
-- 
1.7.4.4



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 4/5] sfc: Validate IRQ moderation parameters in efx_init_irq_moderation()
From: Ben Hutchings @ 2011-09-05 17:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers, Ripduman Sohan
In-Reply-To: <1315231921.3028.11.camel@bwh-desktop>

Add a range check, and move the check that RX and TX are consistent
from efx_ethtool_set_coalesce().

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c     |   20 +++++++++++++++++---
 drivers/net/ethernet/sfc/efx.h     |    5 +++--
 drivers/net/ethernet/sfc/ethtool.c |   17 ++++++++---------
 drivers/net/ethernet/sfc/falcon.c  |    2 ++
 drivers/net/ethernet/sfc/nic.h     |    3 ++-
 drivers/net/ethernet/sfc/siena.c   |    2 ++
 6 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index e0157c0..76dcadf 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -1357,7 +1357,8 @@ static int efx_probe_nic(struct efx_nic *efx)
 	netif_set_real_num_rx_queues(efx->net_dev, efx->n_rx_channels);
 
 	/* Initialise the interrupt moderation settings */
-	efx_init_irq_moderation(efx, tx_irq_mod_usec, rx_irq_mod_usec, true);
+	efx_init_irq_moderation(efx, tx_irq_mod_usec, rx_irq_mod_usec, true,
+				true);
 
 	return 0;
 
@@ -1566,8 +1567,9 @@ static unsigned int irq_mod_ticks(unsigned int usecs, unsigned int resolution)
 }
 
 /* Set interrupt moderation parameters */
-void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
-			     unsigned int rx_usecs, bool rx_adaptive)
+int efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
+			    unsigned int rx_usecs, bool rx_adaptive,
+			    bool rx_may_override_tx)
 {
 	struct efx_channel *channel;
 	unsigned tx_ticks = irq_mod_ticks(tx_usecs, EFX_IRQ_MOD_RESOLUTION);
@@ -1575,6 +1577,16 @@ void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
 
 	EFX_ASSERT_RESET_SERIALISED(efx);
 
+	if (tx_ticks > EFX_IRQ_MOD_MAX || rx_ticks > EFX_IRQ_MOD_MAX)
+		return -EINVAL;
+
+	if (tx_ticks != rx_ticks && efx->tx_channel_offset == 0 &&
+	    !rx_may_override_tx) {
+		netif_err(efx, drv, efx->net_dev, "Channels are shared. "
+			  "RX and TX IRQ moderation must be equal\n");
+		return -EINVAL;
+	}
+
 	efx->irq_rx_adaptive = rx_adaptive;
 	efx->irq_rx_moderation = rx_ticks;
 	efx_for_each_channel(channel, efx) {
@@ -1583,6 +1595,8 @@ void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
 		else if (efx_channel_has_tx_queues(channel))
 			channel->irq_moderation = tx_ticks;
 	}
+
+	return 0;
 }
 
 void efx_get_irq_moderation(struct efx_nic *efx, unsigned int *tx_usecs,
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index 8ca6863..442f4d0 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -111,8 +111,9 @@ extern int efx_reset_up(struct efx_nic *efx, enum reset_type method, bool ok);
 
 /* Global */
 extern void efx_schedule_reset(struct efx_nic *efx, enum reset_type type);
-extern void efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
-				    unsigned int rx_usecs, bool rx_adaptive);
+extern int efx_init_irq_moderation(struct efx_nic *efx, unsigned int tx_usecs,
+				   unsigned int rx_usecs, bool rx_adaptive,
+				   bool rx_may_override_tx);
 extern void efx_get_irq_moderation(struct efx_nic *efx, unsigned int *tx_usecs,
 				   unsigned int *rx_usecs, bool *rx_adaptive);
 
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index 1cb6ed0..98b363b 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -623,7 +623,8 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_channel *channel;
 	unsigned int tx_usecs, rx_usecs;
-	bool adaptive;
+	bool adaptive, rx_may_override_tx;
+	int rc;
 
 	if (coalesce->use_adaptive_tx_coalesce)
 		return -EINVAL;
@@ -642,16 +643,14 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 	/* If channels are shared, TX IRQ moderation can be quietly
 	 * overridden unless it is changed from its old value.
 	 */
-	if (efx->tx_channel_offset == 0 &&
-	    coalesce->tx_coalesce_usecs_irq != tx_usecs &&
-	    coalesce->tx_coalesce_usecs_irq != rx_usecs) {
-		netif_err(efx, drv, efx->net_dev, "Channels are shared. "
-			  "RX and TX IRQ moderation must be equal\n");
-		return -EINVAL;
-	}
+	rx_may_override_tx = coalesce->tx_coalesce_usecs_irq == tx_usecs;
 	tx_usecs = coalesce->tx_coalesce_usecs_irq;
 
-	efx_init_irq_moderation(efx, tx_usecs, rx_usecs, adaptive);
+	rc = efx_init_irq_moderation(efx, tx_usecs, rx_usecs, adaptive,
+				     rx_may_override_tx);
+	if (rc != 0)
+		return rc;
+
 	efx_for_each_channel(channel, efx)
 		efx->type->push_irq_moderation(channel);
 
diff --git a/drivers/net/ethernet/sfc/falcon.c b/drivers/net/ethernet/sfc/falcon.c
index 94bf4aa..4dd1748 100644
--- a/drivers/net/ethernet/sfc/falcon.c
+++ b/drivers/net/ethernet/sfc/falcon.c
@@ -104,6 +104,8 @@ static void falcon_push_irq_moderation(struct efx_channel *channel)
 	efx_dword_t timer_cmd;
 	struct efx_nic *efx = channel->efx;
 
+	BUILD_BUG_ON(EFX_IRQ_MOD_MAX > (1 << FRF_AB_TC_TIMER_VAL_WIDTH));
+
 	/* Set timer register */
 	if (channel->irq_moderation) {
 		EFX_POPULATE_DWORD_2(timer_cmd,
diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h
index 4bd1f28..b5b2886 100644
--- a/drivers/net/ethernet/sfc/nic.h
+++ b/drivers/net/ethernet/sfc/nic.h
@@ -204,7 +204,8 @@ extern irqreturn_t efx_nic_fatal_interrupt(struct efx_nic *efx);
 extern irqreturn_t falcon_legacy_interrupt_a1(int irq, void *dev_id);
 extern void falcon_irq_ack_a1(struct efx_nic *efx);
 
-#define EFX_IRQ_MOD_RESOLUTION 5
+#define EFX_IRQ_MOD_RESOLUTION	5
+#define EFX_IRQ_MOD_MAX		0x1000
 
 /* Global Resources */
 extern int efx_nic_flush_queues(struct efx_nic *efx);
diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
index 5735e84..4fdd148 100644
--- a/drivers/net/ethernet/sfc/siena.c
+++ b/drivers/net/ethernet/sfc/siena.c
@@ -36,6 +36,8 @@ static void siena_push_irq_moderation(struct efx_channel *channel)
 {
 	efx_dword_t timer_cmd;
 
+	BUILD_BUG_ON(EFX_IRQ_MOD_MAX > (1 << FRF_CZ_TC_TIMER_VAL_WIDTH));
+
 	if (channel->irq_moderation)
 		EFX_POPULATE_DWORD_2(timer_cmd,
 				     FRF_CZ_TC_TIMER_MODE,
-- 
1.7.4.4



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 5/5] sfc: Use correct fields of struct ethtool_coalesce
From: Ben Hutchings @ 2011-09-05 17:43 UTC (permalink / raw)
  To: David Miller, Ripduman Sohan; +Cc: netdev, linux-net-drivers
In-Reply-To: <1315231921.3028.11.camel@bwh-desktop>

An earlier developer misunderstood the meaning of the 'irq' fields and
the driver did not support the standard fields.  To avoid invalidating
existing user documentation, we report and accept changes through
either the standard or 'irq' fields.  If both are changed at the same
time, we prefer the standard field.

Also explain why we don't currently use the 'max_frames' fields.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/ethtool.c |   36 +++++++++++++++++++++++++++---------
 1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index 98b363b..93f1fb9 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -595,6 +595,20 @@ static int efx_ethtool_nway_reset(struct net_device *net_dev)
  * automatically changed too, but otherwise we fail if the two values
  * are requested to be different.
  *
+ * The hardware does not support a limit on the number of completions
+ * before an IRQ, so we do not use the max_frames fields.  We should
+ * report and require that max_frames == (usecs != 0), but this would
+ * invalidate existing user documentation.
+ *
+ * The hardware does not have distinct settings for interrupt
+ * moderation while the previous IRQ is being handled, so we should
+ * not use the 'irq' fields.  However, an earlier developer
+ * misunderstood the meaning of the 'irq' fields and the driver did
+ * not support the standard fields.  To avoid invalidating existing
+ * user documentation, we report and accept changes through either the
+ * standard or 'irq' fields.  If both are changed at the same time, we
+ * prefer the standard field.
+ *
  * We implement adaptive IRQ moderation, but use a different algorithm
  * from that assumed in the definition of struct ethtool_coalesce.
  * Therefore we do not use any of the adaptive moderation parameters
@@ -610,7 +624,9 @@ static int efx_ethtool_get_coalesce(struct net_device *net_dev,
 
 	efx_get_irq_moderation(efx, &tx_usecs, &rx_usecs, &rx_adaptive);
 
+	coalesce->tx_coalesce_usecs = tx_usecs;
 	coalesce->tx_coalesce_usecs_irq = tx_usecs;
+	coalesce->rx_coalesce_usecs = rx_usecs;
 	coalesce->rx_coalesce_usecs_irq = rx_usecs;
 	coalesce->use_adaptive_rx_coalesce = rx_adaptive;
 
@@ -629,22 +645,24 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 	if (coalesce->use_adaptive_tx_coalesce)
 		return -EINVAL;
 
-	if (coalesce->rx_coalesce_usecs || coalesce->tx_coalesce_usecs) {
-		netif_err(efx, drv, efx->net_dev, "invalid coalescing setting. "
-			  "Only rx/tx_coalesce_usecs_irq are supported\n");
-		return -EINVAL;
-	}
-
 	efx_get_irq_moderation(efx, &tx_usecs, &rx_usecs, &adaptive);
 
-	rx_usecs = coalesce->rx_coalesce_usecs_irq;
+	if (coalesce->rx_coalesce_usecs != rx_usecs)
+		rx_usecs = coalesce->rx_coalesce_usecs;
+	else
+		rx_usecs = coalesce->rx_coalesce_usecs_irq;
+
 	adaptive = coalesce->use_adaptive_rx_coalesce;
 
 	/* If channels are shared, TX IRQ moderation can be quietly
 	 * overridden unless it is changed from its old value.
 	 */
-	rx_may_override_tx = coalesce->tx_coalesce_usecs_irq == tx_usecs;
-	tx_usecs = coalesce->tx_coalesce_usecs_irq;
+	rx_may_override_tx = (coalesce->tx_coalesce_usecs == tx_usecs &&
+			      coalesce->tx_coalesce_usecs_irq == tx_usecs);
+	if (coalesce->tx_coalesce_usecs != tx_usecs)
+		tx_usecs = coalesce->tx_coalesce_usecs;
+	else
+		tx_usecs = coalesce->tx_coalesce_usecs_irq;
 
 	rc = efx_init_irq_moderation(efx, tx_usecs, rx_usecs, adaptive,
 				     rx_may_override_tx);
-- 
1.7.4.4


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* Re: [PATCH] netfilter: install nf_nat.h and related headers to INSTALL_HDR_PATH
From: Pablo Neira Ayuso @ 2011-09-05 17:48 UTC (permalink / raw)
  To: Anthony G. Basile
  Cc: davem, kaber, blueness, gurligebis, base-system, kernel,
	toolchain, mchehab, hverkuil, laurent.pinchart, arnd, eparis,
	linux-kernel, netdev, netfilter-devel, netfilter, coreteam
In-Reply-To: <1315075784-10163-1-git-send-email-basile@opensource.dyc.edu>

On Sat, Sep 03, 2011 at 02:49:44PM -0400, Anthony G. Basile wrote:
> Currently nf_nat.h, nf_conntrack_tuple.h and related headers under
> include/net/netfilter are not installed as part of the public kernel
> headers.   However, there are userland applications, other than iptables
> which ships with its own headers, which need these to make use of NAT in
> the kernel's netfilter API.  For example, miniupnpd, requires them and is
> forced to search /usr/src/linux when building.

Could anyone clarify why miniupnpd (or any other application) require
this?

Those headers contain structure layouts that may change along time
without further notice, thus breaking backward compatibility.

and BTW, no need to cross-post this message to such a huge list of CC.
I guess you could simply use netfilter-devel for this.

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Stephen Hemminger @ 2011-09-05 17:57 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: David S. Miller, netdev
In-Reply-To: <4E65046E.1020005@gmail.com>

The root cause of the problem is applications that don't deal with unresolved
IPv6 addresses. I already had to solve this in our distribution for NTP in a
not bridge related problem. It is better to fix the applications to understand
IPv6 address semantics than to try and force bridge to behave in a way that
is friendly to these applications.

The earlier mail said it is a problem with dnsmasq and radvd. Let's work on understanding
if they need to be updated before jumping in with hacks to the bridge code.

^ permalink raw reply

* DEAR FRIEND
From: MR PAUL LEE CHONG @ 2011-09-05 18:10 UTC (permalink / raw)







I am Mr.Paul lee Chong the manager Accounting / Audit Department of(Alliance 
bank,Kuala Lumpur, Malaysia). I am contacting in regards to a business 
proposal that will be of an immense benefit to both of us and the less 
privileged ones. Being the manager accounting 
/audit department of our Head/Regional office discovered a sum of $8.5M 
USD in an account 
belonging to one of our foreign customers Late Mr.Moises Saba Masri. He 
was a Jewish business mogul from Mexico who died on a helicopter crash 
early last year. Mr.Saba was 47-years-old when both his wife, his only son 
Avraham (Albert) and his daughter-in-law died in a helicopter crash.You 
can get more information about Late Mr.Moises Saba Masri's unfortunate 
crash on the website-link below here. 


http://www.ynetnews.com/articles/0,7340,L-3832556,00.html? 
The choice of contacting you was personal and based on trust and also due 
to the sensitive 
nature of the transaction and the confidentiality herein. 
Now our bank has been waiting for any of the relatives to come-up for the 
claim of the 
inheritance fund but unfortunately nobody has asked for the funds from 
anywhere until this 
moment.I personally have been unsuccessful in locating neither the 
relatives nor any next of 
kin to Mr. Saba. On this regards, I seek your consent to present you as 
the next of kin /will 
beneficiary to the deceased so that the procedure of this account valued 
at $8.5M USD can be 
paid to you. This will be disbursed or shared in these percentages 60% to 
me and 40% to you. I have secured all necessary legal documents that can 
be used to back up this claim we are about to make. All I need is to 
upload your names to the documents and legalize at the malaysia High Court 
in order to prove you the legitimate Next of kin/beneficiary. All I 
require now is your honest 
co-operation; confidentiality and trust to enable us see this transaction 
through. I guarantee you 100 % success and that this business transaction 
will be executed under the ambit of law and will also protect you from any 
breach of the contract. 
Please provide me the following as we have 7 working days to run 
it through: 
1. Your full name: 
2. Telephone number: 
3. Contact address: 
4. Age: 
5. Gender: 
6. Occupation: 
Having gone through a methodical search, I decided to contact you hoping 
that you will find this proposal interesting. Please on your confirmation 
of this message and indicating your interest I will furnish you with more 
information's.Contact me via this Email:(paulleechong@gmx.com or 
paulleechong@w.cn) 
Your concern to this e-mail and business proposal will be highly 
appreciated. 




Yours Sincerely, 
Mr. Paul Lee Chong. 

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Ang Way Chuang @ 2011-09-05 19:02 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Nicolas de Pesloüan, David S. Miller, netdev
In-Reply-To: <20110905105735.1b912715@nehalam.ftrdhcpuser.net>

On 06/09/11 02:57, Stephen Hemminger wrote:
> The root cause of the problem is applications that don't deal with unresolved
> IPv6 addresses. I already had to solve this in our distribution for NTP in a
> not bridge related problem. It is better to fix the applications to understand
> IPv6 address semantics than to try and force bridge to behave in a way that
> is friendly to these applications.
Care to share the patch that you did for NTP? Perhaps, I may apply the same trick and tried it out on dnsmasq and radvd if I have time.
Thanks.
>
> The earlier mail said it is a problem with dnsmasq and radvd. Let's work on understanding
> if they need to be updated before jumping in with hacks to the bridge code.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH] af_packet: flush complete kernel cache in packet_sendmsg
From: Phil Sutter @ 2011-09-05 19:57 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Ben Hutchings, linux-arm-kernel, netdev, David S. Miller
In-Reply-To: <20110902172850.GB6619@n2100.arm.linux.org.uk>

Hi,

On Fri, Sep 02, 2011 at 06:28:50PM +0100, Russell King - ARM Linux wrote:
> On Fri, Sep 02, 2011 at 02:46:17PM +0100, Ben Hutchings wrote:
> > On Fri, 2011-09-02 at 13:08 +0200, Phil Sutter wrote:
> > > This flushes the cache before and after accessing the mmapped packet
> > > buffer. It seems like the call to flush_dcache_page from inside
> > > __packet_get_status is not enough on Kirkwood (or ARM in general).
> > > ---
> > > I know this is far from an optimal solution, but it's in fact the only working
> > > one I found.
> > [...]
> > 
> > This is ridiculous.  If flush_dcache_page() isn't doing everything it
> > should, you need to fix that.
> 
> It does do everything it should - which is to perform maintanence on
> page cache pages.  It flushes the kernel mapping of the page.  It
> also flushes the userspace mappings of the page which it finds by
> walking the mmap list via the associated struct page.  It does not
> touch vmalloc mappings because it has no way to know whether they
> exist or not.
> 
> It doesn't do so much for anonymous pages - to do so would only
> duplicate what flush_anon_page() does at the very same callsites.
> Plus the mmap list isn't available for such pages so there's no
> way to find out what userspace addresses to flush.

Indeed very interesting information, thanks a lot!

The code in question uses __get_free_pages(), and if that fails uses
vmalloc() (see alloc_one_pg_vec_page() for reference). Both code paths
show result in the same faulty behaviour.

> If the AF_PACKET buffers are created from anonymous pages and it's
> using flush_dcache_page(), it's using the wrong interface.

So, in order to fix this, which alternative would you suggest? Quite a
lot of work has been done regarding memory allocation, so I guess
changing that side is a no-go.

Greetings, Phil

-- 
Viprinet GmbH
Mainzer Str. 43
55411 Bingen am Rhein
Germany

Zentrale:     +49-6721-49030-0
Durchwahl:    +49-6721-49030-134
Fax:          +49-6721-49030-209

phil.sutter@viprinet.com
http://www.viprinet.com

Sitz der Gesellschaft: Bingen am Rhein
Handelsregister: Amtsgericht Mainz HRB40380
Geschäftsführer: Simon Kissel

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Stephen Hemminger @ 2011-09-05 22:45 UTC (permalink / raw)
  To: Ang Way Chuang; +Cc: Nicolas de Pesloüan, David S. Miller, netdev
In-Reply-To: <4E651CB4.3070707@sfc.wide.ad.jp>

On Tue, 06 Sep 2011 04:02:12 +0900
Ang Way Chuang <wcang@sfc.wide.ad.jp> wrote:

> On 06/09/11 02:57, Stephen Hemminger wrote:
> > The root cause of the problem is applications that don't deal with unresolved
> > IPv6 addresses. I already had to solve this in our distribution for NTP in a
> > not bridge related problem. It is better to fix the applications to understand
> > IPv6 address semantics than to try and force bridge to behave in a way that
> > is friendly to these applications.
> Care to share the patch that you did for NTP? Perhaps, I may apply the same trick and tried it out on dnsmasq and radvd if I have time.

It has been submitted upstream, but probably an eternity until
it ever gets merged through the NTP project...

The problem happens on our system because NTP runs before interfaces
brought up.



commit 0997ebe40b2835a58ba2ea1e504e38d6d29c95ed
Author: Stephen Hemminger <stephen.hemminger@vyatta.com>
Date:   Tue Oct 26 17:55:04 2010 -0700

    Ignore IPV6 Dynamic addresses
    
    During boot link-local addresses are generated dynamically.
    These addresses are in a tentative state until after resolution occurs.
    While in the tentative state, the address can not be bound to.
    NTP daemon will see the address become available later when it rescans.
    
    (revised patch for 4.2.4p6)

diff --git a/lib/isc/unix/interfaceiter.c b/lib/isc/unix/interfaceiter.c
index 87af69e..64a9f4f 100644
--- a/lib/isc/unix/interfaceiter.c
+++ b/lib/isc/unix/interfaceiter.c
@@ -151,6 +151,7 @@ get_addr(unsigned int family, isc_netaddr_t *dst, struct sockaddr *src,
 static isc_result_t linux_if_inet6_next(isc_interfaceiter_t *);
 static isc_result_t linux_if_inet6_current(isc_interfaceiter_t *);
 static void linux_if_inet6_first(isc_interfaceiter_t *iter);
+#include <linux/if_addr.h>
 #endif
 
 #if HAVE_GETIFADDRS
@@ -216,6 +217,11 @@ linux_if_inet6_current(isc_interfaceiter_t *iter) {
 			      "/proc/net/if_inet6:strlen(%s) != 32", address);
 		return (ISC_R_FAILURE);
 	}
+#ifdef __linux
+	/* Ignore DAD addresses -- can't bind to them till resolved */
+	if (flags & IFA_F_TENTATIVE)
+		return (ISC_R_IGNORE);
+#endif
 	for (i = 0; i < 16; i++) {
 		unsigned char byte;
 		static const char hex[] = "0123456789abcdef";

^ permalink raw reply related

* [PATCH] per-cgroup tcp buffer limitation
From: Glauber Costa @ 2011-09-06  2:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch introduces per-cgroup tcp buffers limitation. This allows
sysadmins to specify a maximum amount of kernel memory that
tcp connections can use at any point in time. TCP is the main interest
in this work, but extending it to other protocols would be easy.

It piggybacks in the memory control mechanism already present in
/proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
that will suppress allocation when reached. For each cgroup, however,
the file kmem.tcp_maxmem will be used to cap those values.

The usage I have in mind here is containers. Each container will
define its own values for soft and hard limits, but none of them will
be possibly bigger than the value the box' sysadmin specified from
the outside.

To test for any performance impacts of this patch, I used netperf's
TCP_RR benchmark on localhost, so we can have both recv and snd in action.

Command line used was ./src/netperf -t TCP_RR -H localhost, and the
results:

Without the patch
=================

Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    26996.35
16384  87380

With the patch
===============

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    27291.86
16384  87380

The difference is within a one-percent range.

Nesting cgroups doesn't seem to be the dominating factor as well,
with nestings up to 10 levels not showing a significant performance
difference.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 crypto/af_alg.c               |    7 ++-
 include/linux/cgroup_subsys.h |    4 +
 include/net/netns/ipv4.h      |    1 +
 include/net/sock.h            |   66 +++++++++++++++-
 include/net/tcp.h             |   12 ++-
 include/net/udp.h             |    3 +-
 include/trace/events/sock.h   |   10 +-
 init/Kconfig                  |   11 +++
 mm/Makefile                   |    1 +
 net/core/sock.c               |  136 +++++++++++++++++++++++++++-------
 net/decnet/af_decnet.c        |   21 +++++-
 net/ipv4/proc.c               |    8 +-
 net/ipv4/sysctl_net_ipv4.c    |   59 +++++++++++++--
 net/ipv4/tcp.c                |  164 +++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_input.c          |   17 ++--
 net/ipv4/tcp_ipv4.c           |   27 +++++--
 net/ipv4/tcp_output.c         |    2 +-
 net/ipv4/tcp_timer.c          |    2 +-
 net/ipv4/udp.c                |   20 ++++-
 net/ipv6/tcp_ipv6.c           |   16 +++-
 net/ipv6/udp.c                |    4 +-
 net/sctp/socket.c             |   35 +++++++--
 22 files changed, 514 insertions(+), 112 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index ac33d5f..df168d8 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -29,10 +29,15 @@ struct alg_type_list {
 
 static atomic_long_t alg_memory_allocated;
 
+static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
+{
+	return &alg_memory_allocated;
+}
+
 static struct proto alg_proto = {
 	.name			= "ALG",
 	.owner			= THIS_MODULE,
-	.memory_allocated	= &alg_memory_allocated,
+	.memory_allocated	= memory_allocated_alg,
 	.obj_size		= sizeof(struct alg_sock),
 };
 
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..363b8e8 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -35,6 +35,10 @@ SUBSYS(cpuacct)
 SUBSYS(mem_cgroup)
 #endif
 
+#ifdef CONFIG_CGROUP_KMEM
+SUBSYS(kmem)
+#endif
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEVICE
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d786b4f..bbd023a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,7 @@ struct netns_ipv4 {
 	int current_rt_cache_rebuild_count;
 
 	unsigned int sysctl_ping_group_range[2];
+	long sysctl_tcp_mem[3];
 
 	atomic_t rt_genid;
 	atomic_t dev_addr_genid;
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..e085148 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -62,7 +62,9 @@
 #include <linux/atomic.h>
 #include <net/dst.h>
 #include <net/checksum.h>
+#include <linux/kmem_cgroup.h>
 
+int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -339,6 +341,7 @@ struct sock {
 #endif
 	__u32			sk_mark;
 	u32			sk_classid;
+	struct kmem_cgroup	*sk_cgrp;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
@@ -786,16 +789,21 @@ struct proto {
 
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
-	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
+	/* Current allocated memory. */
+	atomic_long_t		*(*memory_allocated)(struct kmem_cgroup *sg);
+	/* Current number of sockets. */
+	struct percpu_counter	*(*sockets_allocated)(struct kmem_cgroup *sg);
+
+	int			(*init_cgroup)(struct cgroup *cgrp,
+					       struct cgroup_subsys *ss);
 	/*
 	 * Pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
 	 * All the __sk_mem_schedule() is of this nature: accounting
 	 * is strict, actions are advisory and have some latency.
 	 */
-	int			*memory_pressure;
-	long			*sysctl_mem;
+	int			*(*memory_pressure)(struct kmem_cgroup *sg);
+	long			*(*prot_mem)(struct kmem_cgroup *sg);
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
@@ -826,6 +834,56 @@ struct proto {
 #endif
 };
 
+#define sk_memory_pressure(sk)						\
+({									\
+	int *__ret = NULL;						\
+	if ((sk)->sk_prot->memory_pressure)				\
+		__ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);	\
+	__ret;								\
+})
+
+#define sk_sockets_allocated(sk)				\
+({								\
+	struct percpu_counter *__p;				\
+	__p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);	\
+	__p;							\
+})
+
+#define sk_memory_allocated(sk)					\
+({								\
+	atomic_long_t *__mem;					\
+	__mem = (sk)->sk_prot->memory_allocated(sk->sk_cgrp);	\
+	__mem;							\
+})
+
+#define sk_prot_mem(sk)						\
+({								\
+	long *__mem = (sk)->sk_prot->prot_mem(sk->sk_cgrp);	\
+	__mem;							\
+})
+
+#define sg_memory_pressure(prot, sg)				\
+({								\
+	int *__ret = NULL;					\
+	if (prot->memory_pressure)				\
+		__ret = (prot)->memory_pressure(sg);		\
+	__ret;							\
+})
+
+#define sg_memory_allocated(prot, sg)				\
+({								\
+	atomic_long_t *__mem;					\
+	__mem = (prot)->memory_allocated(sg);			\
+	__mem;							\
+})
+
+#define sg_sockets_allocated(prot, sg)				\
+({								\
+	struct percpu_counter *__p;				\
+	__p = (prot)->sockets_allocated(sg);			\
+	__p;							\
+})
+
 extern int proto_register(struct proto *prot, int alloc_slab);
 extern void proto_unregister(struct proto *prot);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..8e1ec4a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
@@ -253,9 +252,12 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
-extern atomic_long_t tcp_memory_allocated;
-extern struct percpu_counter tcp_sockets_allocated;
-extern int tcp_memory_pressure;
+struct kmem_cgroup;
+extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
+int *memory_pressure_tcp(struct kmem_cgroup *sg);
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
 
 /*
  * The next routines deal with comparing 32 bit unsigned ints
@@ -286,7 +288,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 	}
 
 	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+	    atomic_long_read(sk_memory_allocated(sk)) > sk_prot_mem(sk)[2])
 		return true;
 	return false;
 }
diff --git a/include/net/udp.h b/include/net/udp.h
index 67ea6fc..0e27388 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 
 extern struct proto udp_prot;
 
-extern atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
+long *udp_sysctl_mem(struct kmem_cgroup *sg);
 
 /* sysctl variables for udp */
 extern long sysctl_udp_mem[3];
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 779abb9..12a6083 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
-		__field(long *, sysctl_mem)
+		__field(long *, prot_mem)
 		__field(long, allocated)
 		__field(int, sysctl_rmem)
 		__field(int, rmem_alloc)
@@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_fast_assign(
 		strncpy(__entry->name, prot->name, 32);
-		__entry->sysctl_mem = prot->sysctl_mem;
+		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
 		__entry->allocated = allocated;
 		__entry->sysctl_rmem = prot->sysctl_rmem[0];
 		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
@@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
 	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
 		"sysctl_rmem=%d rmem_alloc=%d",
 		__entry->name,
-		__entry->sysctl_mem[0],
-		__entry->sysctl_mem[1],
-		__entry->sysctl_mem[2],
+		__entry->prot_mem[0],
+		__entry->prot_mem[1],
+		__entry->prot_mem[2],
 		__entry->allocated,
 		__entry->sysctl_rmem,
 		__entry->rmem_alloc)
diff --git a/init/Kconfig b/init/Kconfig
index d627783..5955ac2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_KMEM
+	bool "Kernel Memory Resource Controller for Control Groups"
+	depends on CGROUPS
+	help
+	  The Kernel Memory cgroup can limit the amount of memory used by
+	  certain kernel objects in the system. Those are fundamentally
+	  different from the entities handled by the Memory Controller,
+	  which are page-based, and can be swapped. Users of the kmem
+	  cgroup can use it to guarantee that no group of processes will
+	  ever exhaust kernel resources alone.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 836e416..1b1aa24 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/net/core/sock.c b/net/core/sock.c
index 3449df8..2d968ea 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -134,6 +134,24 @@
 #include <net/tcp.h>
 #endif
 
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct proto *proto;
+	int ret = 0;
+
+	read_lock(&proto_list_lock);
+	list_for_each_entry(proto, &proto_list, node) {
+		if (proto->init_cgroup)
+			ret |= proto->init_cgroup(cgrp, ss);
+	}
+	read_unlock(&proto_list_lock);
+
+	return ret;
+}
+
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -1114,6 +1132,31 @@ void sock_update_classid(struct sock *sk)
 EXPORT_SYMBOL(sock_update_classid);
 #endif
 
+void sock_update_kmem_cgrp(struct sock *sk)
+{
+#ifdef CONFIG_CGROUP_KMEM
+	sk->sk_cgrp = kcg_from_task(current);
+
+	/*
+	 * We don't need to protect against anything task-related, because
+	 * we are basically stuck with the sock pointer that won't change,
+	 * even if the task that originated the socket changes cgroups.
+	 *
+	 * What we do have to guarantee, is that the chain leading us to
+	 * the top level won't change under our noses. Incrementing the
+	 * reference count via cgroup_exclude_rmdir guarantees that.
+	 */
+	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
+#endif
+}
+
+void sock_release_kmem_cgrp(struct sock *sk)
+{
+#ifdef CONFIG_CGROUP_KMEM
+	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
+#endif
+}
+
 /**
  *	sk_alloc - All socket objects are allocated here
  *	@net: the applicable net namespace
@@ -1139,6 +1182,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_kmem_cgrp(sk);
 	}
 
 	return sk;
@@ -1170,6 +1214,7 @@ static void __sk_free(struct sock *sk)
 		put_cred(sk->sk_peer_cred);
 	put_pid(sk->sk_peer_pid);
 	put_net(sock_net(sk));
+	sock_release_kmem_cgrp(sk);
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
 
@@ -1287,8 +1332,8 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		sk_set_socket(newsk, NULL);
 		newsk->sk_wq = NULL;
 
-		if (newsk->sk_prot->sockets_allocated)
-			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
+		if (sk_sockets_allocated(sk))
+			percpu_counter_inc(sk_sockets_allocated(sk));
 
 		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
 		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1676,29 +1721,51 @@ EXPORT_SYMBOL(sk_wait_data);
  */
 int __sk_mem_schedule(struct sock *sk, int size, int kind)
 {
-	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
+	struct proto *prot = sk->sk_prot;
 	long allocated;
+	int *memory_pressure;
+	long *prot_mem;
+	int parent_failure = 0;
+	struct kmem_cgroup *sg;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	allocated = atomic_long_add_return(amt, prot->memory_allocated);
+
+	memory_pressure = sk_memory_pressure(sk);
+	prot_mem = sk_prot_mem(sk);
+
+	allocated = atomic_long_add_return(amt, sk_memory_allocated(sk));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
+		long alloc;
+		/*
+		 * Large nestings are not the common case, and stopping in the
+		 * middle would be complicated enough, that we bill it all the
+		 * way through the root, and if needed, unbill everything later
+		 */
+		alloc = atomic_long_add_return(amt,
+					       sg_memory_allocated(prot, sg));
+		parent_failure |= (alloc > sk_prot_mem(sk)[2]);
+	}
+#endif
+
+	/* Over hard limit (we, or our parents) */
+	if (parent_failure || (allocated > prot_mem[2]))
+		goto suppress_allocation;
 
 	/* Under limit. */
-	if (allocated <= prot->sysctl_mem[0]) {
-		if (prot->memory_pressure && *prot->memory_pressure)
-			*prot->memory_pressure = 0;
+	if (allocated <= prot_mem[0]) {
+		if (memory_pressure && *memory_pressure)
+			*memory_pressure = 0;
 		return 1;
 	}
 
 	/* Under pressure. */
-	if (allocated > prot->sysctl_mem[1])
+	if (allocated > prot_mem[1])
 		if (prot->enter_memory_pressure)
 			prot->enter_memory_pressure(sk);
 
-	/* Over hard limit. */
-	if (allocated > prot->sysctl_mem[2])
-		goto suppress_allocation;
-
 	/* guarantee minimum buffer size under pressure */
 	if (kind == SK_MEM_RECV) {
 		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
@@ -1712,13 +1779,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 				return 1;
 	}
 
-	if (prot->memory_pressure) {
+	if (memory_pressure) {
 		int alloc;
 
-		if (!*prot->memory_pressure)
+		if (!*memory_pressure)
 			return 1;
-		alloc = percpu_counter_read_positive(prot->sockets_allocated);
-		if (prot->sysctl_mem[2] > alloc *
+		alloc = percpu_counter_read_positive(sk_sockets_allocated(sk));
+		if (prot_mem[2] > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
 				 sk->sk_forward_alloc))
@@ -1741,7 +1808,13 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-	atomic_long_sub(amt, prot->memory_allocated);
+
+	atomic_long_sub(amt, sk_memory_allocated(sk));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent)
+		atomic_long_sub(amt, sg_memory_allocated(prot, sg));
+#endif
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1753,14 +1826,24 @@ EXPORT_SYMBOL(__sk_mem_schedule);
 void __sk_mem_reclaim(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *sg = sk->sk_cgrp;
+	int *memory_pressure = sk_memory_pressure(sk);
 
 	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
+		   sk_memory_allocated(sk));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
+		atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
+						sg_memory_allocated(prot, sg));
+	}
+#endif
+
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	if (memory_pressure && *memory_pressure &&
+	    (atomic_long_read(sk_memory_allocated(sk)) < sk_prot_mem(sk)[0]))
+		*memory_pressure = 0;
 }
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
@@ -2252,9 +2335,6 @@ void sk_common_release(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_common_release);
 
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
@@ -2479,13 +2559,17 @@ static char proto_method_implemented(const void *method)
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
+	struct kmem_cgroup *sg = kcg_from_task(current);
+
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
-		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
-		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
+		   proto->memory_allocated != NULL ?
+			atomic_long_read(sg_memory_allocated(proto, sg)) : -1L,
+		   proto->memory_pressure != NULL ?
+			*sg_memory_pressure(proto, sg) ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
 		   module_name(proto->owner),
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 19acd00..463b299 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
 	}
 }
 
+static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
+{
+	return &decnet_memory_allocated;
+}
+
+static int *memory_pressure_dn(struct kmem_cgroup *sg)
+{
+	return &dn_memory_pressure;
+}
+
+static long *dn_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_decnet_mem;
+}
+
 static struct proto dn_proto = {
 	.name			= "NSP",
 	.owner			= THIS_MODULE,
 	.enter_memory_pressure	= dn_enter_memory_pressure,
-	.memory_pressure	= &dn_memory_pressure,
-	.memory_allocated	= &decnet_memory_allocated,
-	.sysctl_mem		= sysctl_decnet_mem,
+	.memory_pressure	= memory_pressure_dn,
+	.memory_allocated	= memory_allocated_dn,
+	.prot_mem		= dn_sysctl_mem,
 	.sysctl_wmem		= sysctl_decnet_wmem,
 	.sysctl_rmem		= sysctl_decnet_rmem,
 	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index b14ec7d..9c80acf 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -52,20 +52,22 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
 	int orphans, sockets;
+	struct kmem_cgroup *sg = kcg_from_task(current);
+	struct percpu_counter *allocated = sg_sockets_allocated(&tcp_prot, sg);
 
 	local_bh_disable();
 	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
-	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
+	sockets = percpu_counter_sum_positive(allocated);
 	local_bh_enable();
 
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   tcp_death_row.tw_count, sockets,
-		   atomic_long_read(&tcp_memory_allocated));
+		   atomic_long_read(sg_memory_allocated((&tcp_prot), sg)));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
-		   atomic_long_read(&udp_memory_allocated));
+		   atomic_long_read(sg_memory_allocated((&udp_prot), sg)));
 	seq_printf(seq, "UDPLITE: inuse %d\n",
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..5e89480 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,8 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/kmem_cgroup.h>
+#include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -174,6 +176,43 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
 	return ret;
 }
 
+static int ipv4_tcp_mem(ctl_table *ctl, int write,
+			   void __user *buffer, size_t *lenp,
+			   loff_t *ppos)
+{
+	int ret;
+	unsigned long vec[3];
+	struct kmem_cgroup *kmem = kcg_from_task(current);
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	ctl_table tmp = {
+		.data = &vec,
+		.maxlen = sizeof(vec),
+		.mode = ctl->mode,
+	};
+
+	if (!write) {
+		ctl->data = &net->ipv4.sysctl_tcp_mem;
+		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
+	}
+
+	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < 3; i++)
+		if (vec[i] > kmem->tcp_max_memory)
+			return -EINVAL;
+
+	for (i = 0; i < 3; i++) {
+		net->ipv4.sysctl_tcp_mem[i] = vec[i];
+		kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
+	}
+
+	return 0;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -433,13 +472,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_mem",
-		.data		= &sysctl_tcp_mem,
-		.maxlen		= sizeof(sysctl_tcp_mem),
-		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
-	},
-	{
 		.procname	= "tcp_wmem",
 		.data		= &sysctl_tcp_wmem,
 		.maxlen		= sizeof(sysctl_tcp_wmem),
@@ -721,6 +753,12 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+	{
+		.procname	= "tcp_mem",
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
+		.mode		= 0644,
+		.proc_handler	= ipv4_tcp_mem,
+	},
 	{ }
 };
 
@@ -734,6 +772,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
 	struct ctl_table *table;
+	unsigned long limit;
 
 	table = ipv4_net_table;
 	if (!net_eq(net, &init_net)) {
@@ -769,6 +808,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
+	net->ipv4.sysctl_tcp_mem[1] = limit;
+	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..e1918fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -266,6 +266,7 @@
 #include <linux/crypto.h>
 #include <linux/time.h>
 #include <linux/slab.h>
+#include <linux/nsproxy.h>
 
 #include <net/icmp.h>
 #include <net/tcp.h>
@@ -282,23 +283,12 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-EXPORT_SYMBOL(tcp_memory_allocated);
-
-/*
- * Current number of TCP sockets.
- */
-struct percpu_counter tcp_sockets_allocated;
-EXPORT_SYMBOL(tcp_sockets_allocated);
-
 /*
  * TCP splice context
  */
@@ -308,23 +298,157 @@ struct tcp_splice_state {
 	unsigned int flags;
 };
 
+#ifdef CONFIG_CGROUP_KMEM
 /*
  * Pressure flag: try to collapse.
  * Technical note: it is used by multiple contexts non atomically.
  * All the __sk_mem_schedule() is of this nature: accounting
  * is strict, actions are advisory and have some latency.
  */
-int tcp_memory_pressure __read_mostly;
-EXPORT_SYMBOL(tcp_memory_pressure);
-
 void tcp_enter_memory_pressure(struct sock *sk)
 {
+	struct kmem_cgroup *sg = sk->sk_cgrp;
+	if (!sg->tcp_memory_pressure) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
+		sg->tcp_memory_pressure = 1;
+	}
+}
+
+long *tcp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sg->tcp_prot_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &(sg->tcp_memory_allocated);
+}
+
+static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
+	/*
+	 * We can't allow more memory than our parents. Since this
+	 * will be tested for all calls, by induction, there is no need
+	 * to test any parent other than our own
+	 * */
+	if (sg->parent && (val > sg->parent->tcp_max_memory))
+		val = sg->parent->tcp_max_memory;
+
+	sg->tcp_max_memory = val;
+
+	for (i = 0; i < 3; i++)
+		sg->tcp_prot_mem[i]  = min_t(long, val,
+					     net->ipv4.sysctl_tcp_mem[i]);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
+static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = sg->tcp_max_memory;
+
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype tcp_files[] = {
+	{
+		.name = "tcp_maxmem",
+		.write_u64 = tcp_write_maxmem,
+		.read_u64 = tcp_read_maxmem,
+	},
+};
+
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	unsigned long limit;
+	struct net *net = current->nsproxy->net_ns;
+
+	sg->tcp_memory_pressure = 0;
+	atomic_long_set(&sg->tcp_memory_allocated, 0);
+	percpu_counter_init(&sg->tcp_sockets_allocated, 0);
+
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+
+	if (sg->parent)
+		sg->tcp_max_memory = sg->parent->tcp_max_memory;
+	else
+		sg->tcp_max_memory = limit * 2;
+
+	sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
+	sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
+	sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
+
+	return cgroup_add_files(cgrp, ss, tcp_files, ARRAY_SIZE(tcp_files));
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+int *memory_pressure_tcp(struct kmem_cgroup *sg)
+{
+	return &sg->tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &sg->tcp_sockets_allocated;
+}
+#else
+
+/* Current number of TCP sockets. */
+struct percpu_counter tcp_sockets_allocated;
+atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
+int tcp_memory_pressure;
+
+int *memory_pressure_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_sockets_allocated;
+}
+
+void tcp_enter_memory_pressure(struct sock *sock)
+{
 	if (!tcp_memory_pressure) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
 		tcp_memory_pressure = 1;
 	}
 }
+
+long *tcp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return init_net.ipv4.sysctl_tcp_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_memory_allocated;
+}
+#endif /* CONFIG_CGROUP_KMEM */
+
+EXPORT_SYMBOL(memory_pressure_tcp);
+EXPORT_SYMBOL(sockets_allocated_tcp);
 EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL(tcp_sysctl_mem);
+EXPORT_SYMBOL(memory_allocated_tcp);
 
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
@@ -3226,7 +3350,9 @@ void __init tcp_init(void)
 
 	BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
 
+#ifndef CONFIG_CGROUP_KMEM
 	percpu_counter_init(&tcp_sockets_allocated, 0);
+#endif
 	percpu_counter_init(&tcp_orphan_count, 0);
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
@@ -3277,14 +3403,10 @@ void __init tcp_init(void)
 	sysctl_tcp_max_orphans = cnt / 2;
 	sysctl_max_syn_backlog = max(128, cnt / 256);
 
-	limit = nr_free_buffer_pages() / 8;
-	limit = max(limit, 128UL);
-	sysctl_tcp_mem[0] = limit / 4 * 3;
-	sysctl_tcp_mem[1] = limit;
-	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
-
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
+	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
+	limit <<= (PAGE_SHIFT - 7);
+
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..c44e830 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
 	/* Check #1 */
 	if (tp->rcv_ssthresh < tp->window_clamp &&
 	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-	    !tcp_memory_pressure) {
+	    !sk_memory_pressure(sk)) {
 		int incr;
 
 		/* Check #2. Increase window, if skb with such overhead
@@ -393,15 +393,16 @@ static void tcp_clamp_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct proto *prot = sk->sk_prot;
 
 	icsk->icsk_ack.quick = 0;
 
-	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
+	if (sk->sk_rcvbuf < prot->sysctl_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
-	    !tcp_memory_pressure &&
-	    atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    !sk_memory_pressure(sk) &&
+	    atomic_long_read(sk_memory_allocated(sk)) < sk_prot_mem(sk)[0]) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
-				    sysctl_tcp_rmem[2]);
+				    prot->sysctl_rmem[2]);
 	}
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
 		tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
@@ -4806,7 +4807,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
 		tcp_clamp_window(sk);
-	else if (tcp_memory_pressure)
+	else if (sk_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
 	tcp_collapse_ofo_queue(sk);
@@ -4872,11 +4873,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
 		return 0;
 
 	/* If we are under global TCP memory pressure, do not expand.  */
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (atomic_long_read(sk_memory_allocated(sk)) >= sk_prot_mem(sk)[0])
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1c12b8e..af6c095 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1848,6 +1848,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct kmem_cgroup *sg;
 
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
@@ -1901,7 +1902,13 @@ static int tcp_v4_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	percpu_counter_inc(sk_sockets_allocated(sk));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
+		percpu_counter_inc(sg_sockets_allocated(sk->sk_prot, sg));
+#endif
+
 	local_bh_enable();
 
 	return 0;
@@ -1910,6 +1917,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 void tcp_v4_destroy_sock(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct kmem_cgroup *sg;
 
 	tcp_clear_xmit_timers(sk);
 
@@ -1957,7 +1965,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
-	percpu_counter_dec(&tcp_sockets_allocated);
+	percpu_counter_dec(sk_sockets_allocated(sk));
+#ifdef CONFIG_CGROUP_KMEM
+	for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
+		percpu_counter_dec(sg_sockets_allocated(sk->sk_prot, sg));
+#endif
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
@@ -2598,11 +2610,14 @@ struct proto tcp_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
+	.memory_pressure	= memory_pressure_tcp,
+	.sockets_allocated	= sockets_allocated_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.memory_allocated	= memory_allocated_tcp,
+#ifdef CONFIG_CGROUP_KMEM
+	.init_cgroup		= tcp_init_cgroup,
+#endif
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..06aeb31 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_memory_pressure)
+		if (sk_memory_pressure(sk))
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..2c67617 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
 	}
 
 out:
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		sk_mem_reclaim(sk);
 out_unlock:
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1b5a193..6c08c65 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
 int sysctl_udp_wmem_min __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_wmem_min);
 
-atomic_long_t udp_memory_allocated;
-EXPORT_SYMBOL(udp_memory_allocated);
-
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
@@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(udp_poll);
 
+static atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg)
+{
+	return &udp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_udp);
+
+long *udp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_udp_mem;
+}
+EXPORT_SYMBOL(udp_sysctl_mem);
+
 struct proto udp_prot = {
 	.name		   = "UDP",
 	.owner		   = THIS_MODULE,
@@ -1936,8 +1946,8 @@ struct proto udp_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v4_rehash,
 	.get_port	   = udp_v4_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = &memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp_sock),
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d1fb63f..0762e68 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1959,6 +1959,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct kmem_cgroup *sg;
 
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
@@ -2012,7 +2013,12 @@ static int tcp_v6_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	percpu_counter_inc(sk_sockets_allocated(sk));
+#ifdef CONFIG_CGROUP_KMEM
+	for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
+		percpu_counter_dec(sg_sockets_allocated(sk->sk_prot, sg));
+#endif
+
 	local_bh_enable();
 
 	return 0;
@@ -2221,11 +2227,11 @@ struct proto tcpv6_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
+	.sockets_allocated	= sockets_allocated_tcp,
+	.memory_allocated	= memory_allocated_tcp,
+	.memory_pressure	= memory_pressure_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 29213b5..ef4b5b3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v6_rehash,
 	.get_port	   = udp_v6_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp6_sock),
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 836aa63..1b0300d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -119,11 +119,30 @@ static int sctp_memory_pressure;
 static atomic_long_t sctp_memory_allocated;
 struct percpu_counter sctp_sockets_allocated;
 
+static long *sctp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_sctp_mem;
+}
+
 static void sctp_enter_memory_pressure(struct sock *sk)
 {
 	sctp_memory_pressure = 1;
 }
 
+static int *memory_pressure_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_memory_pressure;
+}
+
+static atomic_long_t *memory_allocated_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_memory_allocated;
+}
+
+static struct percpu_counter *sockets_allocated_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_sockets_allocated;
+}
 
 /* Get the sndbuf space available at the time on the association.  */
 static inline int sctp_wspace(struct sctp_association *asoc)
@@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
 	.unhash      =	sctp_unhash,
 	.get_port    =	sctp_get_port,
 	.obj_size    =  sizeof(struct sctp_sock),
-	.sysctl_mem  =  sysctl_sctp_mem,
+	.prot_mem    =  sctp_sysctl_mem,
 	.sysctl_rmem =  sysctl_sctp_rmem,
 	.sysctl_wmem =  sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
 	.unhash		= sctp_unhash,
 	.get_port	= sctp_get_port,
 	.obj_size	= sizeof(struct sctp6_sock),
-	.sysctl_mem	= sysctl_sctp_mem,
+	.prot_mem	= sctp_sysctl_mem,
 	.sysctl_rmem	= sysctl_sctp_rmem,
 	.sysctl_wmem	= sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox