Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Eric Dumazet @ 2011-09-06 19:01 UTC (permalink / raw)
  To: Tim Chen
  Cc: Yan, Zheng, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, jirislaby@gmail.com, sedat.dilek@gmail.com,
	alex.shi
In-Reply-To: <1315335019.2576.3048.camel@schen9-DESK>

Le mardi 06 septembre 2011 à 11:50 -0700, Tim Chen a écrit :
> On Tue, 2011-09-06 at 19:40 +0200, Eric Dumazet wrote:
> > Le mardi 06 septembre 2011 à 09:25 -0700, Tim Chen a écrit :
> > > On Sun, 2011-09-04 at 13:44 +0800, Yan, Zheng wrote:
> > > > Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
> > > > in Unix socket's send and receive path) introduced a use-after-free bug.
> > > > It passes the scm reference to the first skb. Skb(s) afterwards may
> > > > reference freed data structure because the first skb can be destructed
> > > > by the receiver at anytime. The fix is by passing the scm reference to
> > > > the very last skb.
> > > > 
> > > > Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> > > > Reported-by: Jiri Slaby <jirislaby@gmail.com>
> > > > ---
> > > 
> > > Thanks for finding this bug in my original patch.  I've missed the case
> > > where receiving side could have released the all the references to the
> > > credential before the send side is using the credential again for
> > > subsequent skbs in the stream, thus causing the problem we saw.  Getting
> > > an extra reference for pid/credentials at the beginning of the stream
> > > and not getting reference for the last skb is the right approach.
> > > 
> > > Thanks also to Sedat, Valdis and Jiri for their extensive testing to
> > > discover the bug and testing the subsequent fixes. 
> > > 
> > > Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
> > 
> > What happens if message must be split in two skb,
> > first skb is built, queued (without scm reference)
> 
> An extra scm reference is already first obtained in scm_send at the
> beginning of unix_stream_sendmsg in Yan Zheng's patch.  So things should
> be okay as long as we only use this extra reference we got in scm_send
> for the last skb in unix_stream_sendmsg instead of the first skb.
> 
> > 
> > Second skb allocation fails.
> > 
> > Rule about refs/norefs games is : As soon as you put skb into a list, it
> > should have all appropriate references if this skb has pointer(s) to
> > objects(s)
> 
> All the skbs put on the list does have proper reference on pid/scm.  In
> the example you give, the first skb got the reference at this line:
> 
> err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);

This is the current code. We know its buggy.

I was discussing of things after proposed patch, not current net-next.

This reads :

err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, scm_ref);

So first skb is sent without ref taken, as mentioned in Changelog ?

If second skb cannot be built, we exit this system call with an already
queued skb. Receiver can then access to freed memory.

> 
> the second skb use the reference already obtained at the beginning of
> unix_stream_sendmsg if the skb allocation is successful:
> 
> err = scm_send(sock, msg, siocb->scm);
> 
> Now if the second skb allocation failed, the extra scm reference will be
> released by scm_destroy in the error handling path.
> 
> > 
> > We should revert 0856a304091b33a and code the thing differently.
> > 
> > Instead of storing pointer to pid and cred in UNIXSKB(), why dont we
> > copy all needed information ? No ref counts at all.
> > 
> > skb->cb[] is large enough.
> > 
> 
> If we can simply copy some information over, that will be ideal and
> will resolve all the scalability problems.  
> 
> However, I don't see other obvious info that we can pass to avoid
> passing pid.  Our current credential is pid and uid based, and requires
> the knowledge of sender's pid to interpret uid to do credentials
> checking.  So without passing the sender pid, I don't see an easy way
> for the receive side to interpret sender uid it got, which is needed in
> user_ns_map_uid function when we call cred_to_ucred.  
> 
> I was trying to do minimal changes to gain some performance.  The
> approach you suggest is great but will probably require much more
> changes to the credentials infrastructure.  Or maybe there are some easy
> way to do it that I don't see.

My approach would basically revert the 7361c36c commit too :(

I am sorry, but the only way to avoid too many pid/cred references is to
lock the socket [aka unix_state_lock(other);] for the whole send()
duration.

This way, you can really increment the pid/cred reference on the last
pushed skb, because no reader can 'catch first skb'

As soon as unix_state_unlock(other) is called, everything can happen, so
skb must be self contained, as I stated in my earlier mail.

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Tim Chen @ 2011-09-06 19:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yan, Zheng, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, jirislaby@gmail.com, sedat.dilek@gmail.com,
	alex.shi
In-Reply-To: <1315335660.3400.7.camel@edumazet-laptop>

On Tue, 2011-09-06 at 21:01 +0200, Eric Dumazet wrote:

> > All the skbs put on the list does have proper reference on pid/scm.  In
> > the example you give, the first skb got the reference at this line:
> > 
> > err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
> 
> This is the current code. We know its buggy.
> 
> I was discussing of things after proposed patch, not current net-next.

I think we are on the same page.


> My approach would basically revert the 7361c36c commit too :(

I think so.  I was not fond of commit 7361c36c as it caused a 90%
regression in threaded case of hackbench that we noticed back in 2.6.36
days.  If there's some way to undo its evil, I'm all for it.

> 
> I am sorry, but the only way to avoid too many pid/cred references is to
> lock the socket [aka unix_state_lock(other);] for the whole send()
> duration.

Yes, I think locking the sendmsg for the entire duration of
unix_stream_sendmsg makes a lot of sense.  It simplifies the logic a lot
more.  I'll try to cook something up in the next couple of days.

> 
> This way, you can really increment the pid/cred reference on the last
> pushed skb, because no reader can 'catch first skb'
> 
> As soon as unix_state_unlock(other) is called, everything can happen, so
> skb must be self contained, as I stated in my earlier mail.
> 
> 
> 

Thanks.

Tim

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Eric Dumazet @ 2011-09-06 19:43 UTC (permalink / raw)
  To: Tim Chen
  Cc: Yan, Zheng, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, jirislaby@gmail.com, sedat.dilek@gmail.com,
	alex.shi
In-Reply-To: <1315337580.2576.3066.camel@schen9-DESK>

Le mardi 06 septembre 2011 à 12:33 -0700, Tim Chen a écrit :

> Yes, I think locking the sendmsg for the entire duration of
> unix_stream_sendmsg makes a lot of sense.  It simplifies the logic a lot
> more.  I'll try to cook something up in the next couple of days.

Thats not really possible, we cant hold a spinlock and call
sock_alloc_send_skb() and/or memcpy_fromiovec(), wich might sleep.

You would need to prepare the full skb list, then :
- stick the ref on the last skb of the list.

Transfert the whole skb list in other->sk_receive_queue in one go,
instead of one after another.

Unfortunately, this would break streaming (big send(), and another
thread doing the receive)

Listen, I am wondering why hackbench even triggers SCM code. This is
really odd. We should not have a _single_ pid/cred ref/unref at all.

^ permalink raw reply

* RE: [PATCH] net: Prefer non link-local source addresses
From: Julian Anastasov @ 2011-09-06 19:48 UTC (permalink / raw)
  To: Harris, Jeff
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <A0C485B929C0994A889B2580ECDF807648D4D928@EXCHANGE.corp.kentrox.com>


	Hello,

On Tue, 6 Sep 2011, Harris, Jeff wrote:

> In this case, the address scope values are being set properly.  The link-local address has link scope and the routable address has global scope.  When the inet_select_addr function is called, the dst address is 0 and the scope is 253 (link scope).  So, in this case, both address could match.  It just happens that the link local address is first in the list for the device.

	Yes, all primary addresses are sorted in decreasing
scope, link before global. For non-gatewayed routes (nh_gw=0)
inet_select_addr always selects the first address with
scope <= the route's scope.

> This condition looks to be arising from the use of interface routes on our device (e.g. ip route add default dev eth0).  The routes are being installed with link scope.  Forcing a scope of global causes a scope of 0 in inet_select_addr which then selects the routable address always.  I have not found any definite documentation on whether local or global should be used for the route, but the default behavior of the 'ip' command is to use link scope on these routes and global on routes with a gateway address. 

	It is not good idea to have scope link for default route
if you plan to contact the world because link routes can
select link addresses for the target network. That is what
you see as a problem. May be ip route should use scope
global as default value for route 0.0.0.0/0.

	You have 2 choices:

1. Preferred: Use global scope for the default route:

ip route add default scope global dev eth0
or
ip route add default nexthop dev eth0 (not recommended)

	Global routes will not select link addresses.

2. Specify prefsrc for the route

ip route add default dev eth0 src <Global_IP>

	but you risk to cause problems for MSG_DONTROUTE users if
scope remains 'link' for target addresses that are not really
onlink.

> Also, I have only been able to test against 2.6.33 which we use on our embedded device.  It is not easy to update to a more recent version.  The patch, though, applied cleanly to the latest stable version.
> 
> Jeff
> 
> -----Original Message-----
> From: Julian Anastasov [mailto:ja@ssi.bg] 
> Sent: Thursday, September 01, 2011 6:15 PM
> To: Harris, Jeff
> Cc: David S. Miller; Alexey Kuznetsov; James Morris; Hideaki YOSHIFUJI; Patrick McHardy; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] net: Prefer non link-local source addresses
> 
> 
> 	Hello,
> 
> On Thu, 1 Sep 2011, Jeff Harris wrote:
> 
> > Section 2.6.1 of RFC 3927 specifies that if link-local and routable addresses
> > are available on an interface, a routable address is preferred.  Update the
> > IPv4 source address selection algorithm to use a 169.254.x.x address only if
> > another matching address is not found.
> > 
> > Tested combinations of configured IP addresses with and without link-local to
> > verify a link-local address was chosen only if no routable address was
> > present.
> 
> 	As David Lamparter already said, isn't the scope value
> suitable for this purpose? Eg.
> ip addr add 169.254.5.5/16 brd + dev eth0 scope link
> 
> 	iproute2 already has function default_scope() in
> ip/ipaddress.c that assigns scope if it is not specified
> while adding address. May be we can add RT_SCOPE_LINK for
> 169.254 there?
> 
> 	Another such place is inet_set_ifa() in
> net/ipv4/devinet.c where we can assign scope, so that
> ifconfig works too.
> 
> 	I see also that net/ipv6/addrconf.c (sit_add_v4_addrs)
> avoids link-local addresses. What I mean is that the scope
> can be checked at many places and it is a mechanism that
> already works.
> 
> 	As result, we will not complicate inet_select_addr.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Tim Chen @ 2011-09-06 19:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yan, Zheng, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, jirislaby@gmail.com, sedat.dilek@gmail.com,
	alex.shi
In-Reply-To: <1315338186.3400.20.camel@edumazet-laptop>

On Tue, 2011-09-06 at 21:43 +0200, Eric Dumazet wrote:
> Le mardi 06 septembre 2011 à 12:33 -0700, Tim Chen a écrit :
> 
> > Yes, I think locking the sendmsg for the entire duration of
> > unix_stream_sendmsg makes a lot of sense.  It simplifies the logic a lot
> > more.  I'll try to cook something up in the next couple of days.
> 
> Thats not really possible, we cant hold a spinlock and call
> sock_alloc_send_skb() and/or memcpy_fromiovec(), wich might sleep.
> 
> You would need to prepare the full skb list, then :
> - stick the ref on the last skb of the list.
> 
> Transfert the whole skb list in other->sk_receive_queue in one go,
> instead of one after another.
> 
> Unfortunately, this would break streaming (big send(), and another
> thread doing the receive)
> 
> Listen, I am wondering why hackbench even triggers SCM code. This is
> really odd. We should not have a _single_ pid/cred ref/unref at all.
> 

Hackbench triggers the code because it has a bunch of threads sending
msgs on UNIX socket.
> 

Well, if the lock socket approach doesn't work, then my original patch
plus Yan Zheng's fix should still work.  I'll try to answer your
objections below:


> I was discussing of things after proposed patch, not current net-next.
> 
> This reads :
> 
> err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, scm_ref);
> 
> So first skb is sent without ref taken, as mentioned in Changelog ?
> 

No. the first skb is sent *with* ref taken, as scm_ref is set to true for
first skb.

> 
> If second skb cannot be built, we exit this system call with an already
> queued skb. Receiver can then access to freed memory.
> 

No, we do have reference set.  For first skb, in unix_scm_to_skb.  For the 
second skb (which is the last skb), in scm_sent.  Should the second skb alloc failed,
we'll release the ref in scm_destroy.  Otherwise, the receiver will release
the references will consuming the skb.

Tim

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Eric Dumazet @ 2011-09-06 20:19 UTC (permalink / raw)
  To: Tim Chen
  Cc: Yan, Zheng, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, jirislaby@gmail.com, sedat.dilek@gmail.com,
	alex.shi
In-Reply-To: <1315339157.2576.3079.camel@schen9-DESK>

Le mardi 06 septembre 2011 à 12:59 -0700, Tim Chen a écrit :
> On Tue, 2011-09-06 at 21:43 +0200, Eric Dumazet wrote:
> > Le mardi 06 septembre 2011 à 12:33 -0700, Tim Chen a écrit :
> > 
> > > Yes, I think locking the sendmsg for the entire duration of
> > > unix_stream_sendmsg makes a lot of sense.  It simplifies the logic a lot
> > > more.  I'll try to cook something up in the next couple of days.
> > 
> > Thats not really possible, we cant hold a spinlock and call
> > sock_alloc_send_skb() and/or memcpy_fromiovec(), wich might sleep.
> > 
> > You would need to prepare the full skb list, then :
> > - stick the ref on the last skb of the list.
> > 
> > Transfert the whole skb list in other->sk_receive_queue in one go,
> > instead of one after another.
> > 
> > Unfortunately, this would break streaming (big send(), and another
> > thread doing the receive)
> > 
> > Listen, I am wondering why hackbench even triggers SCM code. This is
> > really odd. We should not have a _single_ pid/cred ref/unref at all.
> > 
> 
> Hackbench triggers the code because it has a bunch of threads sending
> msgs on UNIX socket.
> > 
> 
> Well, if the lock socket approach doesn't work, then my original patch
> plus Yan Zheng's fix should still work.  I'll try to answer your
> objections below:
> 
> 
> > I was discussing of things after proposed patch, not current net-next.
> > 
> > This reads :
> > 
> > err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, scm_ref);
> > 
> > So first skb is sent without ref taken, as mentioned in Changelog ?
> > 
> 
> No. the first skb is sent *with* ref taken, as scm_ref is set to true for
> first skb.
> 
> > 
> > If second skb cannot be built, we exit this system call with an already
> > queued skb. Receiver can then access to freed memory.
> > 
> 
> No, we do have reference set.  For first skb, in unix_scm_to_skb.  For the 
> second skb (which is the last skb), in scm_sent.  Should the second skb alloc failed,
> we'll release the ref in scm_destroy.  Otherwise, the receiver will release
> the references will consuming the skb.
> 

This is crap. This is not the intent of the code I read from the patch.

unless scm_ref really means scm_noref ?

I really hate this patch. I mean it. 

I read it 10 times, spent 2 hours and still dont understand it.


@@ -1577,6 +1577,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
        int sent = 0;
        struct scm_cookie tmp_scm;
        bool fds_sent = false;
+       bool scm_ref = true;
        int max_level;
 
        if (NULL == siocb->scm)
@@ -1637,12 +1638,15 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
                 */
                size = min_t(int, size, skb_tailroom(skb));
 
+               /* pass the scm reference to the very last skb */

HERE: I understand : on the last skb, set scm_ref to false.
So comment is wrong.

+               if (sent + size >= len)
+                       scm_ref = false;
 
-               /* Only send the fds and no ref to pid in the first buffer */
-               err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
+               /* Only send the fds in the first buffer */
+               err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, scm_ref);
                if (err < 0) {
                        kfree_skb(skb);
-                       goto out;
+                       goto out_err;
                }



As I said, we should revert the buggy patch, and rewrite a performance
fix from scratch, with not a single get_pid()/put_pid() in fast path.

read()/write() on AF_UNIX sockets should not use a single
get_pid()/put_pid().

This is a serious regression we should fix at 100%, not 50% or even 75%,
adding serious bugs.

^ permalink raw reply

* ip_rt_bug route.c:1677 in 3.0
From: Andi Kleen @ 2011-09-06 20:45 UTC (permalink / raw)
  To: netdev

FYI,

My 3.0 workstation just spew:

I wonder if a application could have caused this?

ip_rt_bug: 10.7.201.108 -> 255.255.255.255, ?
------------[ cut here ]------------
WARNING: at /home/ak/lsrc/git/linux-2.6/net/ipv4/route.c:1677 ip_rt_bug+0x5f/0x70()
Hardware name: ...
Modules linked in: vfat fat nls_utf8 udf ses enclosure nfs lockd fscache auth_rpcgss nfs_acl fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 kvm_intel kvm uinput snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt microcode e1000 snd_timer snd soundcore broadcom iTCO_vendor_support tg3 i7core_edac edac_core snd_page_alloc i2c_i801 dcdbas serio_raw joydev pcspkr firewire_ohci firewire_core crc_itu_t usb_storage radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
Pid: 28530, comm: xsane Not tainted 3.0.0+ #15
Call Trace:
 [<ffffffff810510ef>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8105114a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff814946cf>] ip_rt_bug+0x5f/0x70
 [<ffffffff8149f149>] ip_local_out+0x29/0x30
 [<ffffffff814a047b>] ip_send_skb+0x1b/0x70
 [<ffffffff814c25fa>] udp_send_skb+0x11a/0x3b0
 [<ffffffff8149d3d0>] ? ip_setup_cork+0x170/0x170
 [<ffffffff814c475c>] udp_sendmsg+0x37c/0x9b0
 [<ffffffff814c31b9>] ? udp_lib_get_port+0x2a9/0x3e0
 [<ffffffff81528d25>] ? _raw_spin_unlock_bh+0x15/0x20
 [<ffffffff81448ca3>] ? release_sock+0xe3/0x110
 [<ffffffff814ce214>] inet_sendmsg+0x64/0xb0
 [<ffffffff81443e19>] sock_sendmsg+0xe9/0x120
 [<ffffffff8111bb74>] ? handle_pte_fault+0x84/0x980
 [<ffffffff8112be91>] ? free_pages_and_swap_cache+0xb1/0xe0
 [<ffffffff812750cd>] ? cpumask_any_but+0x2d/0x40
 [<ffffffff814464d1>] ? move_addr_to_kernel+0x71/0x80
 [<ffffffff81446f39>] sys_sendto+0x139/0x190
 [<ffffffff8144b20e>] ? sock_setsockopt+0x1ae/0x790
 [<ffffffff811511ed>] ? fd_install+0x3d/0x70
 [<ffffffff814471f3>] ? sys_setsockopt+0xb3/0xc0
 [<ffffffff8153092b>] system_call_fastpath+0x16/0x1b
---[ end trace ada1d01fefb0d73c ]---
ip_rt_bug: 10.7.201.108 -> 255.255.255.255, ?
------------[ cut here ]------------
WARNING: at /home/ak/lsrc/git/linux-2.6/net/ipv4/route.c:1677 ip_rt_bug+0x5f/0x70()
Hardware name: ...
Modules linked in: vfat fat nls_utf8 udf ses enclosure nfs lockd fscache auth_rpcgss nfs_acl fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 kvm_intel kvm uinput snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt microcode e1000 snd_timer snd soundcore broadcom iTCO_vendor_support tg3 i7core_edac edac_core snd_page_alloc i2c_i801 dcdbas serio_raw joydev pcspkr firewire_ohci firewire_core crc_itu_t usb_storage radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
Pid: 28530, comm: xsane Tainted: G        W   3.0.0+ #15
Call Trace:
 [<ffffffff810510ef>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8105114a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff814946cf>] ip_rt_bug+0x5f/0x70
 [<ffffffff8149f149>] ip_local_out+0x29/0x30
 [<ffffffff814a047b>] ip_send_skb+0x1b/0x70
 [<ffffffff814c25fa>] udp_send_skb+0x11a/0x3b0
 [<ffffffff8149d3d0>] ? ip_setup_cork+0x170/0x170
 [<ffffffff814c475c>] udp_sendmsg+0x37c/0x9b0
 [<ffffffff814c31b9>] ? udp_lib_get_port+0x2a9/0x3e0
 [<ffffffff81528d25>] ? _raw_spin_unlock_bh+0x15/0x20
 [<ffffffff81448ca3>] ? release_sock+0xe3/0x110
 [<ffffffff814ce214>] inet_sendmsg+0x64/0xb0
 [<ffffffff81443e19>] sock_sendmsg+0xe9/0x120
...


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* Re: ip_rt_bug route.c:1677 in 3.0
From: David Miller @ 2011-09-06 20:51 UTC (permalink / raw)
  To: andi; +Cc: netdev
In-Reply-To: <20110906204545.GA29628@tassilo.jf.intel.com>

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 6 Sep 2011 13:45:45 -0700

> My 3.0 workstation just spew:

"3.0.what?"

> I wonder if a application could have caused this?

It's the routing cache hashing bug, fixed by:

commit d547f727df86059104af2234804fdd538e112015
Author: Julian Anastasov <ja@ssi.bg>
Date:   Sun Aug 7 22:20:20 2011 -0700

    ipv4: fix the reusing of routing cache entries
    
    	compare_keys and ip_route_input_common rely on
    rt_oif for distinguishing of input and output routes
    with same keys values. But sometimes the input route has
    also same hash chain (keyed by iif != 0) with the output
    routes (keyed by orig_oif=0). Problem visible if running
    with small number of rhash_entries.
    
    	Fix them to use rt_route_iif instead. By this way
    input route can not be returned to users that request
    output route.
    
    	The patch fixes the ip_rt_bug errors that were
    reported in ip_local_out context, mostly for 255.255.255.255
    destinations.
    
    Signed-off-by: Julian Anastasov <ja@ssi.bg>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e3dec1c..cb7efe0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -731,6 +731,7 @@ static inline int compare_keys(struct rtable *rt1, struct rtable *rt2)
 		((__force u32)rt1->rt_key_src ^ (__force u32)rt2->rt_key_src) |
 		(rt1->rt_mark ^ rt2->rt_mark) |
 		(rt1->rt_key_tos ^ rt2->rt_key_tos) |
+		(rt1->rt_route_iif ^ rt2->rt_route_iif) |
 		(rt1->rt_oif ^ rt2->rt_oif) |
 		(rt1->rt_iif ^ rt2->rt_iif)) == 0;
 }
@@ -2321,8 +2322,8 @@ int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		if ((((__force u32)rth->rt_key_dst ^ (__force u32)daddr) |
 		     ((__force u32)rth->rt_key_src ^ (__force u32)saddr) |
 		     (rth->rt_iif ^ iif) |
-		     rth->rt_oif |
 		     (rth->rt_key_tos ^ tos)) == 0 &&
+		    rt_is_input_route(rth) &&
 		    rth->rt_mark == skb->mark &&
 		    net_eq(dev_net(rth->dst.dev), net) &&
 		    !rt_is_expired(rth)) {

^ permalink raw reply related

* Re: ip_rt_bug route.c:1677 in 3.0
From: Andi Kleen @ 2011-09-06 20:55 UTC (permalink / raw)
  To: David Miller; +Cc: andi, netdev
In-Reply-To: <20110906.165135.572835380555941402.davem@davemloft.net>

On Tue, Sep 06, 2011 at 04:51:35PM -0400, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Tue, 6 Sep 2011 13:45:45 -0700
> 
> > My 3.0 workstation just spew:
> 
> "3.0.what?"

3.0.0 + some local changes unrelated to network.
> 
> > I wonder if a application could have caused this?
> 
> It's the routing cache hashing bug, fixed by:

Thanks.
-Andi

^ permalink raw reply

* Re: ip_rt_bug route.c:1677 in 3.0
From: Eric Dumazet @ 2011-09-06 21:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev
In-Reply-To: <20110906204545.GA29628@tassilo.jf.intel.com>

Le mardi 06 septembre 2011 à 13:45 -0700, Andi Kleen a écrit :
> FYI,
> 
> My 3.0 workstation just spew:
> 
> I wonder if a application could have caused this?
> 
> ip_rt_bug: 10.7.201.108 -> 255.255.255.255, ?
> ------------[ cut here ]------------
> WARNING: at /home/ak/lsrc/git/linux-2.6/net/ipv4/route.c:1677 ip_rt_bug+0x5f/0x70()
> Hardware name: ...
> Modules linked in: vfat fat nls_utf8 udf ses enclosure nfs lockd fscache auth_rpcgss nfs_acl fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 kvm_intel kvm uinput snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt microcode e1000 snd_timer snd soundcore broadcom iTCO_vendor_support tg3 i7core_edac edac_core snd_page_alloc i2c_i801 dcdbas serio_raw joydev pcspkr firewire_ohci firewire_core crc_itu_t usb_storage radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
> Pid: 28530, comm: xsane Not tainted 3.0.0+ #15
> Call Trace:
>  [<ffffffff810510ef>] warn_slowpath_common+0x7f/0xc0
>  [<ffffffff8105114a>] warn_slowpath_null+0x1a/0x20
>  [<ffffffff814946cf>] ip_rt_bug+0x5f/0x70
>  [<ffffffff8149f149>] ip_local_out+0x29/0x30
>  [<ffffffff814a047b>] ip_send_skb+0x1b/0x70
>  [<ffffffff814c25fa>] udp_send_skb+0x11a/0x3b0
>  [<ffffffff8149d3d0>] ? ip_setup_cork+0x170/0x170
>  [<ffffffff814c475c>] udp_sendmsg+0x37c/0x9b0
>  [<ffffffff814c31b9>] ? udp_lib_get_port+0x2a9/0x3e0
>  [<ffffffff81528d25>] ? _raw_spin_unlock_bh+0x15/0x20
>  [<ffffffff81448ca3>] ? release_sock+0xe3/0x110
>  [<ffffffff814ce214>] inet_sendmsg+0x64/0xb0
>  [<ffffffff81443e19>] sock_sendmsg+0xe9/0x120
>  [<ffffffff8111bb74>] ? handle_pte_fault+0x84/0x980
>  [<ffffffff8112be91>] ? free_pages_and_swap_cache+0xb1/0xe0
>  [<ffffffff812750cd>] ? cpumask_any_but+0x2d/0x40
>  [<ffffffff814464d1>] ? move_addr_to_kernel+0x71/0x80
>  [<ffffffff81446f39>] sys_sendto+0x139/0x190
>  [<ffffffff8144b20e>] ? sock_setsockopt+0x1ae/0x790
>  [<ffffffff811511ed>] ? fd_install+0x3d/0x70
>  [<ffffffff814471f3>] ? sys_setsockopt+0xb3/0xc0
>  [<ffffffff8153092b>] system_call_fastpath+0x16/0x1b
> ---[ end trace ada1d01fefb0d73c ]---
> ip_rt_bug: 10.7.201.108 -> 255.255.255.255, ?
> ------------[ cut here ]------------
> WARNING: at /home/ak/lsrc/git/linux-2.6/net/ipv4/route.c:1677 ip_rt_bug+0x5f/0x70()
> Hardware name: ...
> Modules linked in: vfat fat nls_utf8 udf ses enclosure nfs lockd fscache auth_rpcgss nfs_acl fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 kvm_intel kvm uinput snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt microcode e1000 snd_timer snd soundcore broadcom iTCO_vendor_support tg3 i7core_edac edac_core snd_page_alloc i2c_i801 dcdbas serio_raw joydev pcspkr firewire_ohci firewire_core crc_itu_t usb_storage radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
> Pid: 28530, comm: xsane Tainted: G        W   3.0.0+ #15
> Call Trace:
>  [<ffffffff810510ef>] warn_slowpath_common+0x7f/0xc0
>  [<ffffffff8105114a>] warn_slowpath_null+0x1a/0x20
>  [<ffffffff814946cf>] ip_rt_bug+0x5f/0x70
>  [<ffffffff8149f149>] ip_local_out+0x29/0x30
>  [<ffffffff814a047b>] ip_send_skb+0x1b/0x70
>  [<ffffffff814c25fa>] udp_send_skb+0x11a/0x3b0
>  [<ffffffff8149d3d0>] ? ip_setup_cork+0x170/0x170
>  [<ffffffff814c475c>] udp_sendmsg+0x37c/0x9b0
>  [<ffffffff814c31b9>] ? udp_lib_get_port+0x2a9/0x3e0
>  [<ffffffff81528d25>] ? _raw_spin_unlock_bh+0x15/0x20
>  [<ffffffff81448ca3>] ? release_sock+0xe3/0x110
>  [<ffffffff814ce214>] inet_sendmsg+0x64/0xb0
>  [<ffffffff81443e19>] sock_sendmsg+0xe9/0x120
> ...
> 
> 


Make sure you have this commit 

commit d547f727df86059104af2234804fdd538e112015
Author: Julian Anastasov <ja@ssi.bg>
Date:   Sun Aug 7 22:20:20 2011 -0700

    ipv4: fix the reusing of routing cache entries
    
        compare_keys and ip_route_input_common rely on
    rt_oif for distinguishing of input and output routes
    with same keys values. But sometimes the input route has
    also same hash chain (keyed by iif != 0) with the output
    routes (keyed by orig_oif=0). Problem visible if running
    with small number of rhash_entries.
    
        Fix them to use rt_route_iif instead. By this way
    input route can not be returned to users that request
    output route.
    
        The patch fixes the ip_rt_bug errors that were
    reported in ip_local_out context, mostly for 255.255.255.255
    destinations.
    

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Tim Chen @ 2011-09-06 22:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yan, Zheng, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, jirislaby@gmail.com, sedat.dilek@gmail.com,
	alex.shi
In-Reply-To: <1315340388.3400.28.camel@edumazet-laptop>

On Tue, 2011-09-06 at 22:19 +0200, Eric Dumazet wrote:

> 
> unless scm_ref really means scm_noref ?
> 
> I really hate this patch. I mean it. 
> 
> I read it 10 times, spent 2 hours and still dont understand it.
> 

Eric,

I've tried another patch to fix my original one.  I've used a boolean
ref_avail to indicate if there is an outstanding ref to scm not yet
encoded into the skb.  Hopefully the logic is clearer in this new patch.

> 
> As I said, we should revert the buggy patch, and rewrite a performance
> fix from scratch, with not a single get_pid()/put_pid() in fast path.
> 
> read()/write() on AF_UNIX sockets should not use a single
> get_pid()/put_pid().
> 
> This is a serious regression we should fix at 100%, not 50% or even 75%,
> adding serious bugs.

That will be ideal if there is another way to fix it 100%, other than reverting
commit 7361c36c.  Probably if there is some way we know beforehand that 
both sender and receiver share the same pid, which is quite common, a
lot of these pid code can be bypassed. 

Tim


Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 136298c..78be921 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1582,11 +1582,13 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	struct scm_cookie tmp_scm;
 	bool fds_sent = false;
 	int max_level;
+	bool ref_avail; /* scm ref not yet used in skb */
 
 	if (NULL == siocb->scm)
 		siocb->scm = &tmp_scm;
 	wait_for_unix_gc();
 	err = scm_send(sock, msg, siocb->scm);
+	ref_avail = true;
 	if (err < 0)
 		return err;
 
@@ -1642,11 +1644,18 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		size = min_t(int, size, skb_tailroom(skb));
 
 
-		/* Only send the fds and no ref to pid in the first buffer */
-		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
+		/* encode scm in skb and use the scm ref */
+		ref_avail = false;
+		if (sent + size < len) { 
+			/* Only send the fds in the first buffer */
+			/* get additional ref if more skbs will be created */
+			err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, true);
+			ref_avail = true;
+		} else
+			err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, false);
 		if (err < 0) {
 			kfree_skb(skb);
-			goto out;
+			goto out_err;
 		}
 		max_level = err + 1;
 		fds_sent = true;
@@ -1654,7 +1663,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
 		if (err) {
 			kfree_skb(skb);
-			goto out;
+			goto out_err;
 		}
 
 		unix_state_lock(other);
@@ -1671,10 +1680,10 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sent += size;
 	}
 
-	if (skb)
-		scm_release(siocb->scm);
-	else
+	if (ref_avail)
 		scm_destroy(siocb->scm);
+	else
+		scm_release(siocb->scm);
 	siocb->scm = NULL;
 
 	return sent;
@@ -1687,9 +1696,10 @@ pipe_err:
 		send_sig(SIGPIPE, current, 0);
 	err = -EPIPE;
 out_err:
-	if (skb == NULL)
+	if (ref_avail)
 		scm_destroy(siocb->scm);
-out:
+	else
+		scm_release(siocb->scm);
 	siocb->scm = NULL;
 	return sent ? : err;
 }

^ permalink raw reply related

* Re: [PATCH] per-cgroup tcp buffer limitation
From: Greg Thelen @ 2011-09-06 22:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman
In-Reply-To: <4E664766.40200@parallels.com>

On Tue, Sep 6, 2011 at 9:16 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 09/06/2011 01:08 PM, Greg Thelen wrote:
>>
>> On Mon, Sep 5, 2011 at 7:35 PM, Glauber Costa<glommer@parallels.com>
>>  wrote:
>>>
>>> This patch introduces per-cgroup tcp buffers limitation. This allows
>>> sysadmins to specify a maximum amount of kernel memory that
>>> tcp connections can use at any point in time. TCP is the main interest
>>> in this work, but extending it to other protocols would be easy.
>
> Hello Greg,
>
>> With this approach we would be giving admins the ability to
>> independently limit user memory with memcg and kernel memory with this
>> new kmem cgroup.
>>
>> At least in some situations admins prefer to give a particular
>> container X bytes without thinking about the kernel vs user split.
>> Sometimes the admin would prefer the kernel to keep the total
>> user+kernel memory below a certain threshold.  To achieve this with
>> this approach would we need a user space agent to monitor both kernel
>> and user usage for a container and grow/shrink memcg/kmem limits?
>
> Yes, I believe so. And this is not only valid for containers: the
> information we expose in proc, sys, cgroups, etc, is always much more fine
> grained than a considerable part of the users want. Tools come to fill this
> gap.

In your use cases do jobs separately specify independent kmem usage
limits and user memory usage limits?

I presume for people who want to simply dedicate X bytes of memory to
container C that a user-space agent would need to poll both
memcg/X/memory.usage_in_bytes and kmem/X/kmem.usage_in_bytes (or some
other file) to determine if memory limits should be adjusted (i.e. if
kernel memory is growing, then user memory would need to shrink).  So
far my use cases involve a single memory limit which includes both
kernel and user memory.  So I would need a user space agent to poll
{memcg,kmem}.usage_in_bytes to apply pressure to memcg if kmem grows
and visa versa.

Do you foresee instantiation of multiple kmem cgroups, so that a
process could be added into kmem/K1 or kmem/K2?  If so do you plan on
supporting migration between cgroups and/or migration of kmem charge
between K1 to K2?

>> Do you foresee the kmem cgroup growing to include reclaimable slab,
>> where freeing one type of memory allows for reclaim of the other?
>
> Yes, absolutely.

Small comments below.

>>> It piggybacks in the memory control mechanism already present in
>>> /proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
>>> that will suppress allocation when reached. For each cgroup, however,
>>> the file kmem.tcp_maxmem will be used to cap those values.
>>>
>>> The usage I have in mind here is containers. Each container will
>>> define its own values for soft and hard limits, but none of them will
>>> be possibly bigger than the value the box' sysadmin specified from
>>> the outside.
>>>
>>> To test for any performance impacts of this patch, I used netperf's
>>> TCP_RR benchmark on localhost, so we can have both recv and snd in
>>> action.
>>>
>>> Command line used was ./src/netperf -t TCP_RR -H localhost, and the
>>> results:
>>>
>>> Without the patch
>>> =================
>>>
>>> Socket Size   Request  Resp.   Elapsed  Trans.
>>> Send   Recv   Size     Size    Time     Rate
>>> bytes  Bytes  bytes    bytes   secs.    per sec
>>>
>>> 16384  87380  1        1       10.00    26996.35
>>> 16384  87380
>>>
>>> With the patch
>>> ===============
>>>
>>> Local /Remote
>>> Socket Size   Request  Resp.   Elapsed  Trans.
>>> Send   Recv   Size     Size    Time     Rate
>>> bytes  Bytes  bytes    bytes   secs.    per sec
>>>
>>> 16384  87380  1        1       10.00    27291.86
>>> 16384  87380
>>>
>>> The difference is within a one-percent range.
>>>
>>> Nesting cgroups doesn't seem to be the dominating factor as well,
>>> with nestings up to 10 levels not showing a significant performance
>>> difference.
>>>
>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>> CC: David S. Miller<davem@davemloft.net>
>>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>>> CC: Eric W. Biederman<ebiederm@xmission.com>
>>> ---
>>>  crypto/af_alg.c               |    7 ++-
>>>  include/linux/cgroup_subsys.h |    4 +
>>>  include/net/netns/ipv4.h      |    1 +
>>>  include/net/sock.h            |   66 +++++++++++++++-
>>>  include/net/tcp.h             |   12 ++-
>>>  include/net/udp.h             |    3 +-
>>>  include/trace/events/sock.h   |   10 +-
>>>  init/Kconfig                  |   11 +++
>>>  mm/Makefile                   |    1 +
>>>  net/core/sock.c               |  136 +++++++++++++++++++++++++++-------
>>>  net/decnet/af_decnet.c        |   21 +++++-
>>>  net/ipv4/proc.c               |    8 +-
>>>  net/ipv4/sysctl_net_ipv4.c    |   59 +++++++++++++--
>>>  net/ipv4/tcp.c                |  164
>>> +++++++++++++++++++++++++++++++++++-----
>>>  net/ipv4/tcp_input.c          |   17 ++--
>>>  net/ipv4/tcp_ipv4.c           |   27 +++++--
>>>  net/ipv4/tcp_output.c         |    2 +-
>>>  net/ipv4/tcp_timer.c          |    2 +-
>>>  net/ipv4/udp.c                |   20 ++++-
>>>  net/ipv6/tcp_ipv6.c           |   16 +++-
>>>  net/ipv6/udp.c                |    4 +-
>>>  net/sctp/socket.c             |   35 +++++++--
>>>  22 files changed, 514 insertions(+), 112 deletions(-)
>>>
>>> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
>>> index ac33d5f..df168d8 100644
>>> --- a/crypto/af_alg.c
>>> +++ b/crypto/af_alg.c
>>> @@ -29,10 +29,15 @@ struct alg_type_list {
>>>
>>>  static atomic_long_t alg_memory_allocated;
>>>
>>> +static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
>>> +{
>>> +       return&alg_memory_allocated;
>>> +}
>>> +
>>>  static struct proto alg_proto = {
>>>        .name                   = "ALG",
>>>        .owner                  = THIS_MODULE,
>>> -       .memory_allocated       =&alg_memory_allocated,
>>> +       .memory_allocated       = memory_allocated_alg,
>>>        .obj_size               = sizeof(struct alg_sock),
>>>  };
>>>
>>> diff --git a/include/linux/cgroup_subsys.h
>>> b/include/linux/cgroup_subsys.h
>>> index ac663c1..363b8e8 100644
>>> --- a/include/linux/cgroup_subsys.h
>>> +++ b/include/linux/cgroup_subsys.h
>>> @@ -35,6 +35,10 @@ SUBSYS(cpuacct)
>>>  SUBSYS(mem_cgroup)
>>>  #endif
>>>
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +SUBSYS(kmem)
>>> +#endif
>>> +
>>>  /* */
>>>
>>>  #ifdef CONFIG_CGROUP_DEVICE
>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>> index d786b4f..bbd023a 100644
>>> --- a/include/net/netns/ipv4.h
>>> +++ b/include/net/netns/ipv4.h
>>> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>>>        int current_rt_cache_rebuild_count;
>>>
>>>        unsigned int sysctl_ping_group_range[2];
>>> +       long sysctl_tcp_mem[3];
>>>
>>>        atomic_t rt_genid;
>>>        atomic_t dev_addr_genid;
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index 8e4062f..e085148 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -62,7 +62,9 @@
>>>  #include<linux/atomic.h>
>>>  #include<net/dst.h>
>>>  #include<net/checksum.h>
>>> +#include<linux/kmem_cgroup.h>
>>>
>>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>>>  /*
>>>  * This structure really needs to be cleaned up.
>>>  * Most of it is for TCP, and not used by any of
>>> @@ -339,6 +341,7 @@ struct sock {
>>>  #endif
>>>        __u32                   sk_mark;
>>>        u32                     sk_classid;
>>> +       struct kmem_cgroup      *sk_cgrp;
>>>        void                    (*sk_state_change)(struct sock *sk);
>>>        void                    (*sk_data_ready)(struct sock *sk, int
>>> bytes);
>>>        void                    (*sk_write_space)(struct sock *sk);
>>> @@ -786,16 +789,21 @@ struct proto {
>>>
>>>        /* Memory pressure */
>>>        void                    (*enter_memory_pressure)(struct sock *sk);
>>> -       atomic_long_t           *memory_allocated;      /* Current
>>> allocated memory. */
>>> -       struct percpu_counter   *sockets_allocated;     /* Current number
>>> of sockets. */
>>> +       /* Current allocated memory. */
>>> +       atomic_long_t           *(*memory_allocated)(struct kmem_cgroup
>>> *sg);
>>> +       /* Current number of sockets. */
>>> +       struct percpu_counter   *(*sockets_allocated)(struct kmem_cgroup
>>> *sg);
>>> +
>>> +       int                     (*init_cgroup)(struct cgroup *cgrp,
>>> +                                              struct cgroup_subsys *ss);
>>>        /*
>>>         * Pressure flag: try to collapse.
>>>         * Technical note: it is used by multiple contexts non atomically.
>>>         * All the __sk_mem_schedule() is of this nature: accounting
>>>         * is strict, actions are advisory and have some latency.
>>>         */
>>> -       int                     *memory_pressure;
>>> -       long                    *sysctl_mem;
>>> +       int                     *(*memory_pressure)(struct kmem_cgroup
>>> *sg);
>>> +       long                    *(*prot_mem)(struct kmem_cgroup *sg);
>>>        int                     *sysctl_wmem;
>>>        int                     *sysctl_rmem;
>>>        int                     max_header;
>>> @@ -826,6 +834,56 @@ struct proto {
>>>  #endif
>>>  };
>>>
>>> +#define sk_memory_pressure(sk)                                         \
>>> +({                                                                     \
>>> +       int *__ret = NULL;                                              \
>>> +       if ((sk)->sk_prot->memory_pressure)                             \
>>> +               __ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);    \
>>> +       __ret;                                                          \
>>> +})
>>> +
>>> +#define sk_sockets_allocated(sk)                               \
>>> +({                                                             \
>>> +       struct percpu_counter *__p;                             \
>>> +       __p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);    \
>>> +       __p;                                                    \
>>> +})

Could this be simplified as (same applies to following few macros):

static inline struct percpu_counter *sk_sockets_allocated(struct sock *sk)
{
        return sk->sk_prot->sockets_allocated(sk->sk_cgrp);
}

>>> +#define sk_memory_allocated(sk)                                        \
>>> +({                                                             \
>>> +       atomic_long_t *__mem;                                   \
>>> +       __mem = (sk)->sk_prot->memory_allocated(sk->sk_cgrp);   \
>>> +       __mem;                                                  \
>>> +})
>>> +
>>> +#define sk_prot_mem(sk)                                                \
>>> +({                                                             \
>>> +       long *__mem = (sk)->sk_prot->prot_mem(sk->sk_cgrp);     \
>>> +       __mem;                                                  \
>>> +})
>>> +
>>> +#define sg_memory_pressure(prot, sg)                           \
>>> +({                                                             \
>>> +       int *__ret = NULL;                                      \
>>> +       if (prot->memory_pressure)                              \
>>> +               __ret = (prot)->memory_pressure(sg);            \
>>> +       __ret;                                                  \
>>> +})
>>> +
>>> +#define sg_memory_allocated(prot, sg)                          \
>>> +({                                                             \
>>> +       atomic_long_t *__mem;                                   \
>>> +       __mem = (prot)->memory_allocated(sg);                   \
>>> +       __mem;                                                  \
>>> +})
>>> +
>>> +#define sg_sockets_allocated(prot, sg)                         \
>>> +({                                                             \
>>> +       struct percpu_counter *__p;                             \
>>> +       __p = (prot)->sockets_allocated(sg);                    \
>>> +       __p;                                                    \
>>> +})
>>> +
>>>  extern int proto_register(struct proto *prot, int alloc_slab);
>>>  extern void proto_unregister(struct proto *prot);
>>>
>>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>>> index 149a415..8e1ec4a 100644
>>> --- a/include/net/tcp.h
>>> +++ b/include/net/tcp.h
>>> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>>>  extern int sysctl_tcp_reordering;
>>>  extern int sysctl_tcp_ecn;
>>>  extern int sysctl_tcp_dsack;
>>> -extern long sysctl_tcp_mem[3];
>>>  extern int sysctl_tcp_wmem[3];
>>>  extern int sysctl_tcp_rmem[3];
>>>  extern int sysctl_tcp_app_win;
>>> @@ -253,9 +252,12 @@ extern int sysctl_tcp_cookie_size;
>>>  extern int sysctl_tcp_thin_linear_timeouts;
>>>  extern int sysctl_tcp_thin_dupack;
>>>
>>> -extern atomic_long_t tcp_memory_allocated;
>>> -extern struct percpu_counter tcp_sockets_allocated;
>>> -extern int tcp_memory_pressure;
>>> +struct kmem_cgroup;
>>> +extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
>>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
>>> +int *memory_pressure_tcp(struct kmem_cgroup *sg);
>>> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
>>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
>>>
>>>  /*
>>>  * The next routines deal with comparing 32 bit unsigned ints
>>> @@ -286,7 +288,7 @@ static inline bool tcp_too_many_orphans(struct sock
>>> *sk, int shift)
>>>        }
>>>
>>>        if (sk->sk_wmem_queued>  SOCK_MIN_SNDBUF&&
>>> -           atomic_long_read(&tcp_memory_allocated)>  sysctl_tcp_mem[2])
>>> +           atomic_long_read(sk_memory_allocated(sk))>
>>>  sk_prot_mem(sk)[2])
>>>                return true;
>>>        return false;
>>>  }
>>> diff --git a/include/net/udp.h b/include/net/udp.h
>>> index 67ea6fc..0e27388 100644
>>> --- a/include/net/udp.h
>>> +++ b/include/net/udp.h
>>> @@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct
>>> udp_table *table,
>>>
>>>  extern struct proto udp_prot;
>>>
>>> -extern atomic_long_t udp_memory_allocated;
>>> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
>>> +long *udp_sysctl_mem(struct kmem_cgroup *sg);
>>>
>>>  /* sysctl variables for udp */
>>>  extern long sysctl_udp_mem[3];
>>> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
>>> index 779abb9..12a6083 100644
>>> --- a/include/trace/events/sock.h
>>> +++ b/include/trace/events/sock.h
>>> @@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>>
>>>        TP_STRUCT__entry(
>>>                __array(char, name, 32)
>>> -               __field(long *, sysctl_mem)
>>> +               __field(long *, prot_mem)
>>>                __field(long, allocated)
>>>                __field(int, sysctl_rmem)
>>>                __field(int, rmem_alloc)
>>> @@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>>
>>>        TP_fast_assign(
>>>                strncpy(__entry->name, prot->name, 32);
>>> -               __entry->sysctl_mem = prot->sysctl_mem;
>>> +               __entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
>>>                __entry->allocated = allocated;
>>>                __entry->sysctl_rmem = prot->sysctl_rmem[0];
>>>                __entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
>>> @@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>>        TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
>>>                "sysctl_rmem=%d rmem_alloc=%d",
>>>                __entry->name,
>>> -               __entry->sysctl_mem[0],
>>> -               __entry->sysctl_mem[1],
>>> -               __entry->sysctl_mem[2],
>>> +               __entry->prot_mem[0],
>>> +               __entry->prot_mem[1],
>>> +               __entry->prot_mem[2],
>>>                __entry->allocated,
>>>                __entry->sysctl_rmem,
>>>                __entry->rmem_alloc)
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index d627783..5955ac2 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>          select this option (if, for some reason, they need to disable it
>>>          then swapaccount=0 does the trick).
>>>
>>> +config CGROUP_KMEM
>>> +       bool "Kernel Memory Resource Controller for Control Groups"
>>> +       depends on CGROUPS
>>> +       help
>>> +         The Kernel Memory cgroup can limit the amount of memory used by
>>> +         certain kernel objects in the system. Those are fundamentally
>>> +         different from the entities handled by the Memory Controller,
>>> +         which are page-based, and can be swapped. Users of the kmem
>>> +         cgroup can use it to guarantee that no group of processes will
>>> +         ever exhaust kernel resources alone.
>>> +
>>>  config CGROUP_PERF
>>>        bool "Enable perf_event per-cpu per-container group (cgroup)
>>> monitoring"
>>>        depends on PERF_EVENTS&&  CGROUPS
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index 836e416..1b1aa24 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>>>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>>> +obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
>>>  obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>>>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>>>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>> index 3449df8..2d968ea 100644
>>> --- a/net/core/sock.c
>>> +++ b/net/core/sock.c
>>> @@ -134,6 +134,24 @@
>>>  #include<net/tcp.h>
>>>  #endif
>>>
>>> +static DEFINE_RWLOCK(proto_list_lock);
>>> +static LIST_HEAD(proto_list);
>>> +
>>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
>>> +{
>>> +       struct proto *proto;
>>> +       int ret = 0;
>>> +
>>> +       read_lock(&proto_list_lock);
>>> +       list_for_each_entry(proto,&proto_list, node) {
>>> +               if (proto->init_cgroup)
>>> +                       ret |= proto->init_cgroup(cgrp, ss);
>>> +       }
>>> +       read_unlock(&proto_list_lock);
>>> +
>>> +       return ret;
>>> +}
>>> +
>>>  /*
>>>  * Each address family might have different locking rules, so we have
>>>  * one slock key per address family:
>>> @@ -1114,6 +1132,31 @@ void sock_update_classid(struct sock *sk)
>>>  EXPORT_SYMBOL(sock_update_classid);
>>>  #endif
>>>
>>> +void sock_update_kmem_cgrp(struct sock *sk)
>>> +{
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       sk->sk_cgrp = kcg_from_task(current);
>>> +
>>> +       /*
>>> +        * We don't need to protect against anything task-related,
>>> because
>>> +        * we are basically stuck with the sock pointer that won't
>>> change,
>>> +        * even if the task that originated the socket changes cgroups.
>>> +        *
>>> +        * What we do have to guarantee, is that the chain leading us to
>>> +        * the top level won't change under our noses. Incrementing the
>>> +        * reference count via cgroup_exclude_rmdir guarantees that.
>>> +        */
>>> +       cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>>> +#endif
>>> +}
>>> +
>>> +void sock_release_kmem_cgrp(struct sock *sk)
>>> +{
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>>> +#endif
>>> +}
>>> +
>>>  /**
>>>  *     sk_alloc - All socket objects are allocated here
>>>  *     @net: the applicable net namespace
>>> @@ -1139,6 +1182,7 @@ struct sock *sk_alloc(struct net *net, int family,
>>> gfp_t priority,
>>>                atomic_set(&sk->sk_wmem_alloc, 1);
>>>
>>>                sock_update_classid(sk);
>>> +               sock_update_kmem_cgrp(sk);
>>>        }
>>>
>>>        return sk;
>>> @@ -1170,6 +1214,7 @@ static void __sk_free(struct sock *sk)
>>>                put_cred(sk->sk_peer_cred);
>>>        put_pid(sk->sk_peer_pid);
>>>        put_net(sock_net(sk));
>>> +       sock_release_kmem_cgrp(sk);
>>>        sk_prot_free(sk->sk_prot_creator, sk);
>>>  }
>>>
>>> @@ -1287,8 +1332,8 @@ struct sock *sk_clone(const struct sock *sk, const
>>> gfp_t priority)
>>>                sk_set_socket(newsk, NULL);
>>>                newsk->sk_wq = NULL;
>>>
>>> -               if (newsk->sk_prot->sockets_allocated)
>>> -
>>> percpu_counter_inc(newsk->sk_prot->sockets_allocated);
>>> +               if (sk_sockets_allocated(sk))
>>> +                       percpu_counter_inc(sk_sockets_allocated(sk));
>>>
>>>                if (sock_flag(newsk, SOCK_TIMESTAMP) ||
>>>                    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
>>> @@ -1676,29 +1721,51 @@ EXPORT_SYMBOL(sk_wait_data);
>>>  */
>>>  int __sk_mem_schedule(struct sock *sk, int size, int kind)
>>>  {
>>> -       struct proto *prot = sk->sk_prot;
>>>        int amt = sk_mem_pages(size);
>>> +       struct proto *prot = sk->sk_prot;
>>>        long allocated;
>>> +       int *memory_pressure;
>>> +       long *prot_mem;
>>> +       int parent_failure = 0;
>>> +       struct kmem_cgroup *sg;
>>>
>>>        sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
>>> -       allocated = atomic_long_add_return(amt, prot->memory_allocated);
>>> +
>>> +       memory_pressure = sk_memory_pressure(sk);
>>> +       prot_mem = sk_prot_mem(sk);
>>> +
>>> +       allocated = atomic_long_add_return(amt, sk_memory_allocated(sk));
>>> +
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>>> +               long alloc;
>>> +               /*
>>> +                * Large nestings are not the common case, and stopping
>>> in the
>>> +                * middle would be complicated enough, that we bill it
>>> all the
>>> +                * way through the root, and if needed, unbill everything
>>> later
>>> +                */
>>> +               alloc = atomic_long_add_return(amt,
>>> +                                              sg_memory_allocated(prot,
>>> sg));
>>> +               parent_failure |= (alloc>  sk_prot_mem(sk)[2]);
>>> +       }
>>> +#endif
>>> +
>>> +       /* Over hard limit (we, or our parents) */
>>> +       if (parent_failure || (allocated>  prot_mem[2]))
>>> +               goto suppress_allocation;
>>>
>>>        /* Under limit. */
>>> -       if (allocated<= prot->sysctl_mem[0]) {
>>> -               if (prot->memory_pressure&&  *prot->memory_pressure)
>>> -                       *prot->memory_pressure = 0;
>>> +       if (allocated<= prot_mem[0]) {
>>> +               if (memory_pressure&&  *memory_pressure)
>>> +                       *memory_pressure = 0;
>>>                return 1;
>>>        }
>>>
>>>        /* Under pressure. */
>>> -       if (allocated>  prot->sysctl_mem[1])
>>> +       if (allocated>  prot_mem[1])
>>>                if (prot->enter_memory_pressure)
>>>                        prot->enter_memory_pressure(sk);
>>>
>>> -       /* Over hard limit. */
>>> -       if (allocated>  prot->sysctl_mem[2])
>>> -               goto suppress_allocation;
>>> -
>>>        /* guarantee minimum buffer size under pressure */
>>>        if (kind == SK_MEM_RECV) {
>>>                if (atomic_read(&sk->sk_rmem_alloc)<
>>>  prot->sysctl_rmem[0])
>>> @@ -1712,13 +1779,13 @@ int __sk_mem_schedule(struct sock *sk, int size,
>>> int kind)
>>>                                return 1;
>>>        }
>>>
>>> -       if (prot->memory_pressure) {
>>> +       if (memory_pressure) {
>>>                int alloc;
>>>
>>> -               if (!*prot->memory_pressure)
>>> +               if (!*memory_pressure)
>>>                        return 1;
>>> -               alloc =
>>> percpu_counter_read_positive(prot->sockets_allocated);
>>> -               if (prot->sysctl_mem[2]>  alloc *
>>> +               alloc =
>>> percpu_counter_read_positive(sk_sockets_allocated(sk));
>>> +               if (prot_mem[2]>  alloc *
>>>                    sk_mem_pages(sk->sk_wmem_queued +
>>>                                 atomic_read(&sk->sk_rmem_alloc) +
>>>                                 sk->sk_forward_alloc))
>>> @@ -1741,7 +1808,13 @@ suppress_allocation:
>>>
>>>        /* Alas. Undo changes. */
>>>        sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
>>> -       atomic_long_sub(amt, prot->memory_allocated);
>>> +
>>> +       atomic_long_sub(amt, sk_memory_allocated(sk));
>>> +
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent)
>>> +               atomic_long_sub(amt, sg_memory_allocated(prot, sg));
>>> +#endif
>>>        return 0;
>>>  }
>>>  EXPORT_SYMBOL(__sk_mem_schedule);
>>> @@ -1753,14 +1826,24 @@ EXPORT_SYMBOL(__sk_mem_schedule);
>>>  void __sk_mem_reclaim(struct sock *sk)
>>>  {
>>>        struct proto *prot = sk->sk_prot;
>>> +       struct kmem_cgroup *sg = sk->sk_cgrp;
>>> +       int *memory_pressure = sk_memory_pressure(sk);
>>>
>>>        atomic_long_sub(sk->sk_forward_alloc>>  SK_MEM_QUANTUM_SHIFT,
>>> -                  prot->memory_allocated);
>>> +                  sk_memory_allocated(sk));
>>> +
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>>> +               atomic_long_sub(sk->sk_forward_alloc>>
>>>  SK_MEM_QUANTUM_SHIFT,
>>> +                                               sg_memory_allocated(prot,
>>> sg));
>>> +       }
>>> +#endif
>>> +
>>>        sk->sk_forward_alloc&= SK_MEM_QUANTUM - 1;
>>>
>>> -       if (prot->memory_pressure&&  *prot->memory_pressure&&
>>> -           (atomic_long_read(prot->memory_allocated)<
>>>  prot->sysctl_mem[0]))
>>> -               *prot->memory_pressure = 0;
>>> +       if (memory_pressure&&  *memory_pressure&&
>>> +           (atomic_long_read(sk_memory_allocated(sk))<
>>>  sk_prot_mem(sk)[0]))
>>> +               *memory_pressure = 0;
>>>  }
>>>  EXPORT_SYMBOL(__sk_mem_reclaim);
>>>
>>> @@ -2252,9 +2335,6 @@ void sk_common_release(struct sock *sk)
>>>  }
>>>  EXPORT_SYMBOL(sk_common_release);
>>>
>>> -static DEFINE_RWLOCK(proto_list_lock);
>>> -static LIST_HEAD(proto_list);
>>> -
>>>  #ifdef CONFIG_PROC_FS
>>>  #define PROTO_INUSE_NR 64      /* should be enough for the first time */
>>>  struct prot_inuse {
>>> @@ -2479,13 +2559,17 @@ static char proto_method_implemented(const void
>>> *method)
>>>
>>>  static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
>>>  {
>>> +       struct kmem_cgroup *sg = kcg_from_task(current);
>>> +
>>>        seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
>>>                        "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c
>>> %2c %2c %2c %2c %2c %2c %2c\n",
>>>                   proto->name,
>>>                   proto->obj_size,
>>>                   sock_prot_inuse_get(seq_file_net(seq), proto),
>>> -                  proto->memory_allocated != NULL ?
>>> atomic_long_read(proto->memory_allocated) : -1L,
>>> -                  proto->memory_pressure != NULL ?
>>> *proto->memory_pressure ? "yes" : "no" : "NI",
>>> +                  proto->memory_allocated != NULL ?
>>> +                       atomic_long_read(sg_memory_allocated(proto, sg))
>>> : -1L,
>>> +                  proto->memory_pressure != NULL ?
>>> +                       *sg_memory_pressure(proto, sg) ? "yes" : "no" :
>>> "NI",
>>>                   proto->max_header,
>>>                   proto->slab == NULL ? "no" : "yes",
>>>                   module_name(proto->owner),
>>> diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
>>> index 19acd00..463b299 100644
>>> --- a/net/decnet/af_decnet.c
>>> +++ b/net/decnet/af_decnet.c
>>> @@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock
>>> *sk)
>>>        }
>>>  }
>>>
>>> +static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
>>> +{
>>> +       return&decnet_memory_allocated;
>>> +}
>>> +
>>> +static int *memory_pressure_dn(struct kmem_cgroup *sg)
>>> +{
>>> +       return&dn_memory_pressure;
>>> +}
>>> +
>>> +static long *dn_sysctl_mem(struct kmem_cgroup *sg)
>>> +{
>>> +       return sysctl_decnet_mem;
>>> +}
>>> +
>>>  static struct proto dn_proto = {
>>>        .name                   = "NSP",
>>>        .owner                  = THIS_MODULE,
>>>        .enter_memory_pressure  = dn_enter_memory_pressure,
>>> -       .memory_pressure        =&dn_memory_pressure,
>>> -       .memory_allocated       =&decnet_memory_allocated,
>>> -       .sysctl_mem             = sysctl_decnet_mem,
>>> +       .memory_pressure        = memory_pressure_dn,
>>> +       .memory_allocated       = memory_allocated_dn,
>>> +       .prot_mem               = dn_sysctl_mem,
>>>        .sysctl_wmem            = sysctl_decnet_wmem,
>>>        .sysctl_rmem            = sysctl_decnet_rmem,
>>>        .max_header             = DN_MAX_NSP_DATA_HEADER + 64,
>>> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
>>> index b14ec7d..9c80acf 100644
>>> --- a/net/ipv4/proc.c
>>> +++ b/net/ipv4/proc.c
>>> @@ -52,20 +52,22 @@ static int sockstat_seq_show(struct seq_file *seq,
>>> void *v)
>>>  {
>>>        struct net *net = seq->private;
>>>        int orphans, sockets;
>>> +       struct kmem_cgroup *sg = kcg_from_task(current);
>>> +       struct percpu_counter *allocated =
>>> sg_sockets_allocated(&tcp_prot, sg);
>>>
>>>        local_bh_disable();
>>>        orphans = percpu_counter_sum_positive(&tcp_orphan_count);
>>> -       sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
>>> +       sockets = percpu_counter_sum_positive(allocated);
>>>        local_bh_enable();
>>>
>>>        socket_seq_show(seq);
>>>        seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem
>>> %ld\n",
>>>                   sock_prot_inuse_get(net,&tcp_prot), orphans,
>>>                   tcp_death_row.tw_count, sockets,
>>> -                  atomic_long_read(&tcp_memory_allocated));
>>> +                  atomic_long_read(sg_memory_allocated((&tcp_prot),
>>> sg)));
>>>        seq_printf(seq, "UDP: inuse %d mem %ld\n",
>>>                   sock_prot_inuse_get(net,&udp_prot),
>>> -                  atomic_long_read(&udp_memory_allocated));
>>> +                  atomic_long_read(sg_memory_allocated((&udp_prot),
>>> sg)));
>>>        seq_printf(seq, "UDPLITE: inuse %d\n",
>>>                   sock_prot_inuse_get(net,&udplite_prot));
>>>        seq_printf(seq, "RAW: inuse %d\n",
>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>> index 69fd720..5e89480 100644
>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>> @@ -14,6 +14,8 @@
>>>  #include<linux/init.h>
>>>  #include<linux/slab.h>
>>>  #include<linux/nsproxy.h>
>>> +#include<linux/kmem_cgroup.h>
>>> +#include<linux/swap.h>
>>>  #include<net/snmp.h>
>>>  #include<net/icmp.h>
>>>  #include<net/ip.h>
>>> @@ -174,6 +176,43 @@ static int proc_allowed_congestion_control(ctl_table
>>> *ctl,
>>>        return ret;
>>>  }
>>>
>>> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
>>> +                          void __user *buffer, size_t *lenp,
>>> +                          loff_t *ppos)
>>> +{
>>> +       int ret;
>>> +       unsigned long vec[3];
>>> +       struct kmem_cgroup *kmem = kcg_from_task(current);
>>> +       struct net *net = current->nsproxy->net_ns;
>>> +       int i;
>>> +
>>> +       ctl_table tmp = {
>>> +               .data =&vec,
>>> +               .maxlen = sizeof(vec),
>>> +               .mode = ctl->mode,
>>> +       };
>>> +
>>> +       if (!write) {
>>> +               ctl->data =&net->ipv4.sysctl_tcp_mem;
>>> +               return proc_doulongvec_minmax(ctl, write, buffer, lenp,
>>> ppos);
>>> +       }
>>> +
>>> +       ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
>>> +       if (ret)
>>> +               return ret;
>>> +
>>> +       for (i = 0; i<  3; i++)
>>> +               if (vec[i]>  kmem->tcp_max_memory)
>>> +                       return -EINVAL;
>>> +
>>> +       for (i = 0; i<  3; i++) {
>>> +               net->ipv4.sysctl_tcp_mem[i] = vec[i];
>>> +               kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
>>> +       }
>>> +
>>> +       return 0;
>>> +}
>>> +
>>>  static struct ctl_table ipv4_table[] = {
>>>        {
>>>                .procname       = "tcp_timestamps",
>>> @@ -433,13 +472,6 @@ static struct ctl_table ipv4_table[] = {
>>>                .proc_handler   = proc_dointvec
>>>        },
>>>        {
>>> -               .procname       = "tcp_mem",
>>> -               .data           =&sysctl_tcp_mem,
>>> -               .maxlen         = sizeof(sysctl_tcp_mem),
>>> -               .mode           = 0644,
>>> -               .proc_handler   = proc_doulongvec_minmax
>>> -       },
>>> -       {
>>>                .procname       = "tcp_wmem",
>>>                .data           =&sysctl_tcp_wmem,
>>>                .maxlen         = sizeof(sysctl_tcp_wmem),
>>> @@ -721,6 +753,12 @@ static struct ctl_table ipv4_net_table[] = {
>>>                .mode           = 0644,
>>>                .proc_handler   = ipv4_ping_group_range,
>>>        },
>>> +       {
>>> +               .procname       = "tcp_mem",
>>> +               .maxlen         = sizeof(init_net.ipv4.sysctl_tcp_mem),
>>> +               .mode           = 0644,
>>> +               .proc_handler   = ipv4_tcp_mem,
>>> +       },
>>>        { }
>>>  };
>>>
>>> @@ -734,6 +772,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>>>  static __net_init int ipv4_sysctl_init_net(struct net *net)
>>>  {
>>>        struct ctl_table *table;
>>> +       unsigned long limit;
>>>
>>>        table = ipv4_net_table;
>>>        if (!net_eq(net,&init_net)) {
>>> @@ -769,6 +808,12 @@ static __net_init int ipv4_sysctl_init_net(struct
>>> net *net)
>>>
>>>        net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>>>
>>> +       limit = nr_free_buffer_pages() / 8;
>>> +       limit = max(limit, 128UL);
>>> +       net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>>> +       net->ipv4.sysctl_tcp_mem[1] = limit;
>>> +       net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>>> +
>>>        net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>>>                        net_ipv4_ctl_path, table);
>>>        if (net->ipv4.ipv4_hdr == NULL)
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 46febca..e1918fa 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -266,6 +266,7 @@
>>>  #include<linux/crypto.h>
>>>  #include<linux/time.h>
>>>  #include<linux/slab.h>
>>> +#include<linux/nsproxy.h>
>>>
>>>  #include<net/icmp.h>
>>>  #include<net/tcp.h>
>>> @@ -282,23 +283,12 @@ int sysctl_tcp_fin_timeout __read_mostly =
>>> TCP_FIN_TIMEOUT;
>>>  struct percpu_counter tcp_orphan_count;
>>>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>>>
>>> -long sysctl_tcp_mem[3] __read_mostly;
>>>  int sysctl_tcp_wmem[3] __read_mostly;
>>>  int sysctl_tcp_rmem[3] __read_mostly;
>>>
>>> -EXPORT_SYMBOL(sysctl_tcp_mem);
>>>  EXPORT_SYMBOL(sysctl_tcp_rmem);
>>>  EXPORT_SYMBOL(sysctl_tcp_wmem);
>>>
>>> -atomic_long_t tcp_memory_allocated;    /* Current allocated memory. */
>>> -EXPORT_SYMBOL(tcp_memory_allocated);
>>> -
>>> -/*
>>> - * Current number of TCP sockets.
>>> - */
>>> -struct percpu_counter tcp_sockets_allocated;
>>> -EXPORT_SYMBOL(tcp_sockets_allocated);
>>> -
>>>  /*
>>>  * TCP splice context
>>>  */
>>> @@ -308,23 +298,157 @@ struct tcp_splice_state {
>>>        unsigned int flags;
>>>  };
>>>
>>> +#ifdef CONFIG_CGROUP_KMEM
>>>  /*
>>>  * Pressure flag: try to collapse.
>>>  * Technical note: it is used by multiple contexts non atomically.
>>>  * All the __sk_mem_schedule() is of this nature: accounting
>>>  * is strict, actions are advisory and have some latency.
>>>  */
>>> -int tcp_memory_pressure __read_mostly;
>>> -EXPORT_SYMBOL(tcp_memory_pressure);
>>> -
>>>  void tcp_enter_memory_pressure(struct sock *sk)
>>>  {
>>> +       struct kmem_cgroup *sg = sk->sk_cgrp;
>>> +       if (!sg->tcp_memory_pressure) {
>>> +               NET_INC_STATS(sock_net(sk),
>>> LINUX_MIB_TCPMEMORYPRESSURES);
>>> +               sg->tcp_memory_pressure = 1;
>>> +       }
>>> +}
>>> +
>>> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
>>> +{
>>> +       return sg->tcp_prot_mem;
>>> +}
>>> +
>>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&(sg->tcp_memory_allocated);
>>> +}
>>> +
>>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64
>>> val)
>>> +{
>>> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>>> +       struct net *net = current->nsproxy->net_ns;
>>> +       int i;
>>> +
>>> +       if (!cgroup_lock_live_group(cgrp))
>>> +               return -ENODEV;
>>> +
>>> +       /*
>>> +        * We can't allow more memory than our parents. Since this
>>> +        * will be tested for all calls, by induction, there is no need
>>> +        * to test any parent other than our own
>>> +        * */
>>> +       if (sg->parent&&  (val>  sg->parent->tcp_max_memory))
>>> +               val = sg->parent->tcp_max_memory;
>>> +
>>> +       sg->tcp_max_memory = val;
>>> +
>>> +       for (i = 0; i<  3; i++)
>>> +               sg->tcp_prot_mem[i]  = min_t(long, val,
>>> +
>>>  net->ipv4.sysctl_tcp_mem[i]);
>>> +
>>> +       cgroup_unlock();
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>>> +{
>>> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>>> +       u64 ret;
>>> +
>>> +       if (!cgroup_lock_live_group(cgrp))
>>> +               return -ENODEV;
>>> +       ret = sg->tcp_max_memory;
>>> +
>>> +       cgroup_unlock();
>>> +       return ret;
>>> +}
>>> +
>>> +static struct cftype tcp_files[] = {
>>> +       {
>>> +               .name = "tcp_maxmem",
>>> +               .write_u64 = tcp_write_maxmem,
>>> +               .read_u64 = tcp_read_maxmem,
>>> +       },
>>> +};
>>> +
>>> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
>>> +{
>>> +       struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>>> +       unsigned long limit;
>>> +       struct net *net = current->nsproxy->net_ns;
>>> +
>>> +       sg->tcp_memory_pressure = 0;
>>> +       atomic_long_set(&sg->tcp_memory_allocated, 0);
>>> +       percpu_counter_init(&sg->tcp_sockets_allocated, 0);
>>> +
>>> +       limit = nr_free_buffer_pages() / 8;
>>> +       limit = max(limit, 128UL);
>>> +
>>> +       if (sg->parent)
>>> +               sg->tcp_max_memory = sg->parent->tcp_max_memory;
>>> +       else
>>> +               sg->tcp_max_memory = limit * 2;
>>> +
>>> +       sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
>>> +       sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
>>> +       sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
>>> +
>>> +       return cgroup_add_files(cgrp, ss, tcp_files,
>>> ARRAY_SIZE(tcp_files));
>>> +}
>>> +EXPORT_SYMBOL(tcp_init_cgroup);
>>> +
>>> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&sg->tcp_memory_pressure;
>>> +}
>>> +
>>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&sg->tcp_sockets_allocated;
>>> +}
>>> +#else
>>> +
>>> +/* Current number of TCP sockets. */
>>> +struct percpu_counter tcp_sockets_allocated;
>>> +atomic_long_t tcp_memory_allocated;    /* Current allocated memory. */
>>> +int tcp_memory_pressure;
>>> +
>>> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&tcp_memory_pressure;
>>> +}
>>> +
>>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&tcp_sockets_allocated;
>>> +}
>>> +
>>> +void tcp_enter_memory_pressure(struct sock *sock)
>>> +{
>>>        if (!tcp_memory_pressure) {
>>>                NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
>>>                tcp_memory_pressure = 1;
>>>        }
>>>  }
>>> +
>>> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
>>> +{
>>> +       return init_net.ipv4.sysctl_tcp_mem;
>>> +}
>>> +
>>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&tcp_memory_allocated;
>>> +}
>>> +#endif /* CONFIG_CGROUP_KMEM */
>>> +
>>> +EXPORT_SYMBOL(memory_pressure_tcp);
>>> +EXPORT_SYMBOL(sockets_allocated_tcp);
>>>  EXPORT_SYMBOL(tcp_enter_memory_pressure);
>>> +EXPORT_SYMBOL(tcp_sysctl_mem);
>>> +EXPORT_SYMBOL(memory_allocated_tcp);
>>>
>>>  /* Convert seconds to retransmits based on initial and max timeout */
>>>  static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
>>> @@ -3226,7 +3350,9 @@ void __init tcp_init(void)
>>>
>>>        BUILD_BUG_ON(sizeof(struct tcp_skb_cb)>  sizeof(skb->cb));
>>>
>>> +#ifndef CONFIG_CGROUP_KMEM
>>>        percpu_counter_init(&tcp_sockets_allocated, 0);
>>> +#endif
>>>        percpu_counter_init(&tcp_orphan_count, 0);
>>>        tcp_hashinfo.bind_bucket_cachep =
>>>                kmem_cache_create("tcp_bind_bucket",
>>> @@ -3277,14 +3403,10 @@ void __init tcp_init(void)
>>>        sysctl_tcp_max_orphans = cnt / 2;
>>>        sysctl_max_syn_backlog = max(128, cnt / 256);
>>>
>>> -       limit = nr_free_buffer_pages() / 8;
>>> -       limit = max(limit, 128UL);
>>> -       sysctl_tcp_mem[0] = limit / 4 * 3;
>>> -       sysctl_tcp_mem[1] = limit;
>>> -       sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
>>> -
>>>        /* Set per-socket limits to no more than 1/128 the pressure
>>> threshold */
>>> -       limit = ((unsigned long)sysctl_tcp_mem[1])<<  (PAGE_SHIFT - 7);
>>> +       limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
>>> +       limit<<= (PAGE_SHIFT - 7);
>>> +
>>>        max_share = min(4UL*1024*1024, limit);
>>>
>>>        sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index ea0d218..c44e830 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct
>>> sk_buff *skb)
>>>        /* Check #1 */
>>>        if (tp->rcv_ssthresh<  tp->window_clamp&&
>>>            (int)tp->rcv_ssthresh<  tcp_space(sk)&&
>>> -           !tcp_memory_pressure) {
>>> +           !sk_memory_pressure(sk)) {
>>>                int incr;
>>>
>>>                /* Check #2. Increase window, if skb with such overhead
>>> @@ -393,15 +393,16 @@ static void tcp_clamp_window(struct sock *sk)
>>>  {
>>>        struct tcp_sock *tp = tcp_sk(sk);
>>>        struct inet_connection_sock *icsk = inet_csk(sk);
>>> +       struct proto *prot = sk->sk_prot;
>>>
>>>        icsk->icsk_ack.quick = 0;
>>>
>>> -       if (sk->sk_rcvbuf<  sysctl_tcp_rmem[2]&&
>>> +       if (sk->sk_rcvbuf<  prot->sysctl_rmem[2]&&
>>>            !(sk->sk_userlocks&  SOCK_RCVBUF_LOCK)&&
>>> -           !tcp_memory_pressure&&
>>> -           atomic_long_read(&tcp_memory_allocated)<  sysctl_tcp_mem[0])
>>> {
>>> +           !sk_memory_pressure(sk)&&
>>> +           atomic_long_read(sk_memory_allocated(sk))<
>>>  sk_prot_mem(sk)[0]) {
>>>                sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
>>> -                                   sysctl_tcp_rmem[2]);
>>> +                                   prot->sysctl_rmem[2]);
>>>        }
>>>        if (atomic_read(&sk->sk_rmem_alloc)>  sk->sk_rcvbuf)
>>>                tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
>>> @@ -4806,7 +4807,7 @@ static int tcp_prune_queue(struct sock *sk)
>>>
>>>        if (atomic_read(&sk->sk_rmem_alloc)>= sk->sk_rcvbuf)
>>>                tcp_clamp_window(sk);
>>> -       else if (tcp_memory_pressure)
>>> +       else if (sk_memory_pressure(sk))
>>>                tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
>>>
>>>        tcp_collapse_ofo_queue(sk);
>>> @@ -4872,11 +4873,11 @@ static int tcp_should_expand_sndbuf(struct sock
>>> *sk)
>>>                return 0;
>>>
>>>        /* If we are under global TCP memory pressure, do not expand.  */
>>> -       if (tcp_memory_pressure)
>>> +       if (sk_memory_pressure(sk))
>>>                return 0;
>>>
>>>        /* If we are under soft global TCP memory pressure, do not expand.
>>>  */
>>> -       if (atomic_long_read(&tcp_memory_allocated)>= sysctl_tcp_mem[0])
>>> +       if (atomic_long_read(sk_memory_allocated(sk))>=
>>> sk_prot_mem(sk)[0])
>>>                return 0;
>>>
>>>        /* If we filled the congestion window, do not expand.  */
>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>> index 1c12b8e..af6c095 100644
>>> --- a/net/ipv4/tcp_ipv4.c
>>> +++ b/net/ipv4/tcp_ipv4.c
>>> @@ -1848,6 +1848,7 @@ static int tcp_v4_init_sock(struct sock *sk)
>>>  {
>>>        struct inet_connection_sock *icsk = inet_csk(sk);
>>>        struct tcp_sock *tp = tcp_sk(sk);
>>> +       struct kmem_cgroup *sg;
>>>
>>>        skb_queue_head_init(&tp->out_of_order_queue);
>>>        tcp_init_xmit_timers(sk);
>>> @@ -1901,7 +1902,13 @@ static int tcp_v4_init_sock(struct sock *sk)
>>>        sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>>>
>>>        local_bh_disable();
>>> -       percpu_counter_inc(&tcp_sockets_allocated);
>>> +       percpu_counter_inc(sk_sockets_allocated(sk));
>>> +
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
>>> +               percpu_counter_inc(sg_sockets_allocated(sk->sk_prot,
>>> sg));
>>> +#endif
>>> +
>>>        local_bh_enable();
>>>
>>>        return 0;
>>> @@ -1910,6 +1917,7 @@ static int tcp_v4_init_sock(struct sock *sk)
>>>  void tcp_v4_destroy_sock(struct sock *sk)
>>>  {
>>>        struct tcp_sock *tp = tcp_sk(sk);
>>> +       struct kmem_cgroup *sg;
>>>
>>>        tcp_clear_xmit_timers(sk);
>>>
>>> @@ -1957,7 +1965,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
>>>                tp->cookie_values = NULL;
>>>        }
>>>
>>> -       percpu_counter_dec(&tcp_sockets_allocated);
>>> +       percpu_counter_dec(sk_sockets_allocated(sk));
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
>>> +               percpu_counter_dec(sg_sockets_allocated(sk->sk_prot,
>>> sg));
>>> +#endif
>>>  }
>>>  EXPORT_SYMBOL(tcp_v4_destroy_sock);
>>>
>>> @@ -2598,11 +2610,14 @@ struct proto tcp_prot = {
>>>        .unhash                 = inet_unhash,
>>>        .get_port               = inet_csk_get_port,
>>>        .enter_memory_pressure  = tcp_enter_memory_pressure,
>>> -       .sockets_allocated      =&tcp_sockets_allocated,
>>> +       .memory_pressure        = memory_pressure_tcp,
>>> +       .sockets_allocated      = sockets_allocated_tcp,
>>>        .orphan_count           =&tcp_orphan_count,
>>> -       .memory_allocated       =&tcp_memory_allocated,
>>> -       .memory_pressure        =&tcp_memory_pressure,
>>> -       .sysctl_mem             = sysctl_tcp_mem,
>>> +       .memory_allocated       = memory_allocated_tcp,
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       .init_cgroup            = tcp_init_cgroup,
>>> +#endif
>>> +       .prot_mem               = tcp_sysctl_mem,
>>>        .sysctl_wmem            = sysctl_tcp_wmem,
>>>        .sysctl_rmem            = sysctl_tcp_rmem,
>>>        .max_header             = MAX_TCP_HEADER,
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 882e0b0..06aeb31 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
>>>        if (free_space<  (full_space>>  1)) {
>>>                icsk->icsk_ack.quick = 0;
>>>
>>> -               if (tcp_memory_pressure)
>>> +               if (sk_memory_pressure(sk))
>>>                        tp->rcv_ssthresh = min(tp->rcv_ssthresh,
>>>                                               4U * tp->advmss);
>>>
>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> index ecd44b0..2c67617 100644
>>> --- a/net/ipv4/tcp_timer.c
>>> +++ b/net/ipv4/tcp_timer.c
>>> @@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
>>>        }
>>>
>>>  out:
>>> -       if (tcp_memory_pressure)
>>> +       if (sk_memory_pressure(sk))
>>>                sk_mem_reclaim(sk);
>>>  out_unlock:
>>>        bh_unlock_sock(sk);
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 1b5a193..6c08c65 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
>>>  int sysctl_udp_wmem_min __read_mostly;
>>>  EXPORT_SYMBOL(sysctl_udp_wmem_min);
>>>
>>> -atomic_long_t udp_memory_allocated;
>>> -EXPORT_SYMBOL(udp_memory_allocated);
>>> -
>>>  #define MAX_UDP_PORTS 65536
>>>  #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
>>>
>>> @@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct
>>> socket *sock, poll_table *wait)
>>>  }
>>>  EXPORT_SYMBOL(udp_poll);
>>>
>>> +static atomic_long_t udp_memory_allocated;
>>> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&udp_memory_allocated;
>>> +}
>>> +EXPORT_SYMBOL(memory_allocated_udp);
>>> +
>>> +long *udp_sysctl_mem(struct kmem_cgroup *sg)
>>> +{
>>> +       return sysctl_udp_mem;
>>> +}
>>> +EXPORT_SYMBOL(udp_sysctl_mem);
>>> +
>>>  struct proto udp_prot = {
>>>        .name              = "UDP",
>>>        .owner             = THIS_MODULE,
>>> @@ -1936,8 +1946,8 @@ struct proto udp_prot = {
>>>        .unhash            = udp_lib_unhash,
>>>        .rehash            = udp_v4_rehash,
>>>        .get_port          = udp_v4_get_port,
>>> -       .memory_allocated  =&udp_memory_allocated,
>>> -       .sysctl_mem        = sysctl_udp_mem,
>>> +       .memory_allocated  =&memory_allocated_udp,
>>> +       .prot_mem          = udp_sysctl_mem,
>>>        .sysctl_wmem       =&sysctl_udp_wmem_min,
>>>        .sysctl_rmem       =&sysctl_udp_rmem_min,
>>>        .obj_size          = sizeof(struct udp_sock),
>>> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
>>> index d1fb63f..0762e68 100644
>>> --- a/net/ipv6/tcp_ipv6.c
>>> +++ b/net/ipv6/tcp_ipv6.c
>>> @@ -1959,6 +1959,7 @@ static int tcp_v6_init_sock(struct sock *sk)
>>>  {
>>>        struct inet_connection_sock *icsk = inet_csk(sk);
>>>        struct tcp_sock *tp = tcp_sk(sk);
>>> +       struct kmem_cgroup *sg;
>>>
>>>        skb_queue_head_init(&tp->out_of_order_queue);
>>>        tcp_init_xmit_timers(sk);
>>> @@ -2012,7 +2013,12 @@ static int tcp_v6_init_sock(struct sock *sk)
>>>        sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>>>
>>>        local_bh_disable();
>>> -       percpu_counter_inc(&tcp_sockets_allocated);
>>> +       percpu_counter_inc(sk_sockets_allocated(sk));
>>> +#ifdef CONFIG_CGROUP_KMEM
>>> +       for (sg = sk->sk_cgrp->parent; sg; sg = sg->parent)
>>> +               percpu_counter_dec(sg_sockets_allocated(sk->sk_prot,
>>> sg));
>>> +#endif
>>> +
>>>        local_bh_enable();
>>>
>>>        return 0;
>>> @@ -2221,11 +2227,11 @@ struct proto tcpv6_prot = {
>>>        .unhash                 = inet_unhash,
>>>        .get_port               = inet_csk_get_port,
>>>        .enter_memory_pressure  = tcp_enter_memory_pressure,
>>> -       .sockets_allocated      =&tcp_sockets_allocated,
>>> -       .memory_allocated       =&tcp_memory_allocated,
>>> -       .memory_pressure        =&tcp_memory_pressure,
>>> +       .sockets_allocated      = sockets_allocated_tcp,
>>> +       .memory_allocated       = memory_allocated_tcp,
>>> +       .memory_pressure        = memory_pressure_tcp,
>>>        .orphan_count           =&tcp_orphan_count,
>>> -       .sysctl_mem             = sysctl_tcp_mem,
>>> +       .prot_mem               = tcp_sysctl_mem,
>>>        .sysctl_wmem            = sysctl_tcp_wmem,
>>>        .sysctl_rmem            = sysctl_tcp_rmem,
>>>        .max_header             = MAX_TCP_HEADER,
>>> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
>>> index 29213b5..ef4b5b3 100644
>>> --- a/net/ipv6/udp.c
>>> +++ b/net/ipv6/udp.c
>>> @@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
>>>        .unhash            = udp_lib_unhash,
>>>        .rehash            = udp_v6_rehash,
>>>        .get_port          = udp_v6_get_port,
>>> -       .memory_allocated  =&udp_memory_allocated,
>>> -       .sysctl_mem        = sysctl_udp_mem,
>>> +       .memory_allocated  = memory_allocated_udp,
>>> +       .prot_mem          = udp_sysctl_mem,
>>>        .sysctl_wmem       =&sysctl_udp_wmem_min,
>>>        .sysctl_rmem       =&sysctl_udp_rmem_min,
>>>        .obj_size          = sizeof(struct udp6_sock),
>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>> index 836aa63..1b0300d 100644
>>> --- a/net/sctp/socket.c
>>> +++ b/net/sctp/socket.c
>>> @@ -119,11 +119,30 @@ static int sctp_memory_pressure;
>>>  static atomic_long_t sctp_memory_allocated;
>>>  struct percpu_counter sctp_sockets_allocated;
>>>
>>> +static long *sctp_sysctl_mem(struct kmem_cgroup *sg)
>>> +{
>>> +       return sysctl_sctp_mem;
>>> +}
>>> +
>>>  static void sctp_enter_memory_pressure(struct sock *sk)
>>>  {
>>>        sctp_memory_pressure = 1;
>>>  }
>>>
>>> +static int *memory_pressure_sctp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&sctp_memory_pressure;
>>> +}
>>> +
>>> +static atomic_long_t *memory_allocated_sctp(struct kmem_cgroup *sg)
>>> +{
>>> +       return&sctp_memory_allocated;
>>> +}
>>> +
>>> +static struct percpu_counter *sockets_allocated_sctp(struct kmem_cgroup
>>> *sg)
>>> +{
>>> +       return&sctp_sockets_allocated;
>>> +}
>>>
>>>  /* Get the sndbuf space available at the time on the association.  */
>>>  static inline int sctp_wspace(struct sctp_association *asoc)
>>> @@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
>>>        .unhash      =  sctp_unhash,
>>>        .get_port    =  sctp_get_port,
>>>        .obj_size    =  sizeof(struct sctp_sock),
>>> -       .sysctl_mem  =  sysctl_sctp_mem,
>>> +       .prot_mem    =  sctp_sysctl_mem,
>>>        .sysctl_rmem =  sysctl_sctp_rmem,
>>>        .sysctl_wmem =  sysctl_sctp_wmem,
>>> -       .memory_pressure =&sctp_memory_pressure,
>>> +       .memory_pressure = memory_pressure_sctp,
>>>        .enter_memory_pressure = sctp_enter_memory_pressure,
>>> -       .memory_allocated =&sctp_memory_allocated,
>>> -       .sockets_allocated =&sctp_sockets_allocated,
>>> +       .memory_allocated = memory_allocated_sctp,
>>> +       .sockets_allocated = sockets_allocated_sctp,
>>>  };
>>>
>>>  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>>> @@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
>>>        .unhash         = sctp_unhash,
>>>        .get_port       = sctp_get_port,
>>>        .obj_size       = sizeof(struct sctp6_sock),
>>> -       .sysctl_mem     = sysctl_sctp_mem,
>>> +       .prot_mem       = sctp_sysctl_mem,
>>>        .sysctl_rmem    = sysctl_sctp_rmem,
>>>        .sysctl_wmem    = sysctl_sctp_wmem,
>>> -       .memory_pressure =&sctp_memory_pressure,
>>> +       .memory_pressure = memory_pressure_sctp,
>>>        .enter_memory_pressure = sctp_enter_memory_pressure,
>>> -       .memory_allocated =&sctp_memory_allocated,
>>> -       .sockets_allocated =&sctp_sockets_allocated,
>>> +       .memory_allocated = memory_allocated_sctp,
>>> +       .sockets_allocated = sockets_allocated_sctp,
>>>  };
>>>  #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
>>> --
>>> 1.7.6
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Fight unfair telecom internet charges in Canada: sign
>>> http://stopthemeter.ca/
>>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] per-cgroup tcp buffer limitation
From: Glauber Costa @ 2011-09-06 22:37 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman
In-Reply-To: <CAHH2K0YJA7vZZ3QNAf63TZOnWhsRUwfuZYfntBL4muZ0G_Vt2w@mail.gmail.com>

On 09/06/2011 07:12 PM, Greg Thelen wrote:
> On Tue, Sep 6, 2011 at 9:16 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> On 09/06/2011 01:08 PM, Greg Thelen wrote:
>>>
>>> On Mon, Sep 5, 2011 at 7:35 PM, Glauber Costa<glommer@parallels.com>
>>>   wrote:
>>>>
>>>> This patch introduces per-cgroup tcp buffers limitation. This allows
>>>> sysadmins to specify a maximum amount of kernel memory that
>>>> tcp connections can use at any point in time. TCP is the main interest
>>>> in this work, but extending it to other protocols would be easy.
>>
>> Hello Greg,
>>
>>> With this approach we would be giving admins the ability to
>>> independently limit user memory with memcg and kernel memory with this
>>> new kmem cgroup.
>>>
>>> At least in some situations admins prefer to give a particular
>>> container X bytes without thinking about the kernel vs user split.
>>> Sometimes the admin would prefer the kernel to keep the total
>>> user+kernel memory below a certain threshold.  To achieve this with
>>> this approach would we need a user space agent to monitor both kernel
>>> and user usage for a container and grow/shrink memcg/kmem limits?
>>
>> Yes, I believe so. And this is not only valid for containers: the
>> information we expose in proc, sys, cgroups, etc, is always much more fine
>> grained than a considerable part of the users want. Tools come to fill this
>> gap.
>
> In your use cases do jobs separately specify independent kmem usage
> limits and user memory usage limits?

Yes, because they are different in nature: user memory can be 
overcommited, kernel memory is pinned by its objects, and can't go to swap.

> I presume for people who want to simply dedicate X bytes of memory to
> container C that a user-space agent would need to poll both
> memcg/X/memory.usage_in_bytes and kmem/X/kmem.usage_in_bytes (or some
> other file) to determine if memory limits should be adjusted (i.e. if
> kernel memory is growing, then user memory would need to shrink).
Ok.

I think memcg's usage is really all you need here. In the end of the 
day, it tells you how many pages your container has available. The whole
point of kmem cgroup is not any kind of reservation or accounting.

Once a container (or cgroup) reaches a number of objects *pinned* in 
memory (therefore, non-reclaimable), you won't be able to grab anything 
from it.
> So
> far my use cases involve a single memory limit which includes both
> kernel and user memory.  So I would need a user space agent to poll
> {memcg,kmem}.usage_in_bytes to apply pressure to memcg if kmem grows
> and visa versa.

Maybe not.
If userspace memory works for you today (supposing it does), why change?
Right now you assign X bytes of user memory to a container, and the 
kernel memory is shared among all of them. If this works for you, 
kmem_cgroup won't change that. It just will impose limits over which
your kernel objects can't grow.

So you don't *need* a userspace agent doing this calculation, because 
fundamentally, nothing changed: I am not unbilling memory in memcg to 
bill it back in kmem_cg. Of course, once it is in, you will be able to 
do it in such a fine grained fashion if you decide to do so.

> Do you foresee instantiation of multiple kmem cgroups, so that a
> process could be added into kmem/K1 or kmem/K2?  If so do you plan on
> supporting migration between cgroups and/or migration of kmem charge
> between K1 to K2?
Yes, each container should have its own cgroup, so at least in the use
cases I am concerned, we will have a lot of them. But the usual 
lifecycle, is create, execute and die. Mobility between them
is not something I am overly concerned right now.


>>> Do you foresee the kmem cgroup growing to include reclaimable slab,
>>> where freeing one type of memory allows for reclaim of the other?
>>
>> Yes, absolutely.
>
> Small comments below.
>

>>>>   };
>>>>
>>>> +#define sk_memory_pressure(sk)                                         \
>>>> +({                                                                     \
>>>> +       int *__ret = NULL;                                              \
>>>> +       if ((sk)->sk_prot->memory_pressure)                             \
>>>> +               __ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);    \
>>>> +       __ret;                                                          \
>>>> +})
>>>> +
>>>> +#define sk_sockets_allocated(sk)                               \
>>>> +({                                                             \
>>>> +       struct percpu_counter *__p;                             \
>>>> +       __p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);    \
>>>> +       __p;                                                    \
>>>> +})
>
> Could this be simplified as (same applies to following few macros):
>
> static inline struct percpu_counter *sk_sockets_allocated(struct sock *sk)
> {
>          return sk->sk_prot->sockets_allocated(sk->sk_cgrp);
> }
Yes and no. Right now, I need them to be valid lvalues. But in the 
upcoming version of the patch, I will drop this requirement. Then
I will move to inline functions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net] gianfar: Fix overflow check and return value for gfar_get_cls_all()
From: Ben Hutchings @ 2011-09-06 22:44 UTC (permalink / raw)
  To: David Miller, Sebastian Poehn; +Cc: netdev

This function may currently fill one entry beyond the end of the
array it is given.  It also doesn't return an error code in case
it does detect overflow.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
This code is new for 3.1, so no stable update required.

Not tested in any way, as this driver depends on FSL_SOC.

Ben.

 drivers/net/gianfar_ethtool.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/gianfar_ethtool.c b/drivers/net/gianfar_ethtool.c
index 25a8c2a..0caf3c3 100644
--- a/drivers/net/gianfar_ethtool.c
+++ b/drivers/net/gianfar_ethtool.c
@@ -1669,10 +1669,10 @@ static int gfar_get_cls_all(struct gfar_private *priv,
 	u32 i = 0;
 
 	list_for_each_entry(comp, &priv->rx_list.list, list) {
-		if (i <= cmd->rule_cnt) {
-			rule_locs[i] = comp->fs.location;
-			i++;
-		}
+		if (i == cmd->rule_cnt)
+			return -EMSGSIZE;
+		rule_locs[i] = comp->fs.location;
+		i++;
 	}
 
 	cmd->data = MAX_FILER_IDX;
-- 
1.7.4.4


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
From: Roopa Prabhu @ 2011-09-06 22:35 UTC (permalink / raw)
  To: netdev; +Cc: dragos.tatulea, arnd, mst, dwang2, benve, kaber, sri

This patch is an attempt at providing address filtering support for macvtap 
devices in PASSTHRU mode. Its still a work in progress.
Briefly tested for basic functionality. Wanted to get some feedback on the 
direction before proceeding.

I have hopefully CC'ed all concerned people.

PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode
there is a 1-1 mapping between macvtap device and physical nic or VF. And all
filtering is done in lowerdev hw. The lowerdev does not need to be in 
promiscous mode as long as the guest filters are passed down to the lowerdev. 
This patch tries to remove the need for putting the lowerdev in promiscous mode. 
I have also referred to the thread below where TUNSETTXFILTER was mentioned in 
this context: 
 http://patchwork.ozlabs.org/patch/69297/

This patch basically passes the addresses got by TUNSETTXFILTER to macvlan 
lowerdev.

I have looked at previous work and discussions on this for qemu-kvm 
by Michael Tsirkin, Alex Williamson and Dragos Tatulea
http://patchwork.ozlabs.org/patch/78595/
http://patchwork.ozlabs.org/patch/47160/
https://patchwork.kernel.org/patch/474481/

Redhat bugzilla by Michael Tsirkin:
https://bugzilla.redhat.com/show_bug.cgi?id=655013

I used Michael's qemu-kvm patch for testing the changes with KVM 

I would like to cover both MAC and vlan filtering in this work.

Open Questions/Issues:
- There is a need for vlan filtering to complete the patch. It will require 
  a new tap ioctl cmd for vlans. 
  Some ideas on this are: 

  a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap filter 
	(similar to tun_filter for addresses). Passing the vlan id's to lower
	device will mean going thru the whole list of vlans every time.

  OR

  b) TUNSETVLAN with vlan id and flag to set/unset

  Does option 'b' sound ok ?

- In this implementation we make the macvlan address list same as the address 
  list that came in the filter with TUNSETTXFILTER. This will not cover cases 
  where the macvlan device needs to have other addresses that are not 
  necessarily in the filter. Is this a problem ?

- The patch currently only supports passing of IFF_PROMISC and IFF_MULTICAST 
filter flags to lowerdev

This patch series implements the following 
01/3 - macvlan: Add support for unicast filtering in macvlan 
02/3 - macvlan: Add function to set addr filter on lower device in passthru mode
03/3 - macvtap: Add support for TUNSETTXFILTER

Please comment. Thanks.

Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: Christian Benvenuti <benve@cisco.com>
Signed-off-by: David Wang <dwang2@cisco.com>

^ permalink raw reply

* [net-next-2.6 PATCH 1/3 RFC] macvlan: Add support for unicast filtering in macvlan
From: Roopa Prabhu @ 2011-09-06 22:35 UTC (permalink / raw)
  To: netdev; +Cc: dragos.tatulea, arnd, mst, dwang2, benve, kaber, sri
In-Reply-To: <20110906223536.6552.2062.stgit@savbu-pc100.cisco.com>

From: Roopa Prabhu <roprabhu@cisco.com>

This patch adds support for ndo_set_rx_mode and sets IFF_UNICAST_FLT to enable
unicast filtering in macvlan. This is to support unicast and multicast
address filtering when this device is used in PASSTHRU mode

Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: Christian Benvenuti <benve@cisco.com>
Signed-off-by: David Wang <dwang2@cisco.com>
---
 drivers/net/macvlan.c |   20 ++++++++++++++++++--
 1 files changed, 18 insertions(+), 2 deletions(-)


diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 836e13f..528924f 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -316,11 +316,19 @@ static int macvlan_open(struct net_device *dev)
 		if (err < 0)
 			goto del_unicast;
 	}
+	if (dev->flags & IFF_PROMISC) {
+		err = dev_set_promiscuity(lowerdev, 1);
+		if (err < 0)
+			goto unset_allmulti;
+	}
 
 hash_add:
 	macvlan_hash_add(vlan);
 	return 0;
 
+unset_allmulti:
+	dev_set_allmulti(lowerdev, -1);
+
 del_unicast:
 	dev_uc_del(lowerdev, dev->dev_addr);
 out:
@@ -337,9 +345,12 @@ static int macvlan_stop(struct net_device *dev)
 		goto hash_del;
 	}
 
+	dev_uc_unsync(lowerdev, dev);
 	dev_mc_unsync(lowerdev, dev);
 	if (dev->flags & IFF_ALLMULTI)
 		dev_set_allmulti(lowerdev, -1);
+	if (dev->flags & IFF_PROMISC)
+		dev_set_promiscuity(lowerdev, -1);
 
 	dev_uc_del(lowerdev, dev->dev_addr);
 
@@ -384,12 +395,16 @@ static void macvlan_change_rx_flags(struct net_device *dev, int change)
 
 	if (change & IFF_ALLMULTI)
 		dev_set_allmulti(lowerdev, dev->flags & IFF_ALLMULTI ? 1 : -1);
+	if (change & IFF_PROMISC)
+		dev_set_promiscuity(lowerdev,
+			dev->flags & IFF_PROMISC ? 1 : -1);
 }
 
-static void macvlan_set_multicast_list(struct net_device *dev)
+static void macvlan_set_rx_mode(struct net_device *dev)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 
+	dev_uc_sync(vlan->lowerdev, dev);
 	dev_mc_sync(vlan->lowerdev, dev);
 }
 
@@ -561,7 +576,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
 	.ndo_change_mtu		= macvlan_change_mtu,
 	.ndo_change_rx_flags	= macvlan_change_rx_flags,
 	.ndo_set_mac_address	= macvlan_set_mac_address,
-	.ndo_set_rx_mode	= macvlan_set_multicast_list,
+	.ndo_set_rx_mode	= macvlan_set_rx_mode,
 	.ndo_get_stats64	= macvlan_dev_get_stats64,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_vlan_rx_add_vid	= macvlan_vlan_rx_add_vid,
@@ -573,6 +588,7 @@ void macvlan_common_setup(struct net_device *dev)
 	ether_setup(dev);
 
 	dev->priv_flags	       &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
+	dev->priv_flags	       |= IFF_UNICAST_FLT;
 	dev->netdev_ops		= &macvlan_netdev_ops;
 	dev->destructor		= free_netdev;
 	dev->header_ops		= &macvlan_hard_header_ops,

^ permalink raw reply related

* [net-next-2.6 PATCH 2/3 RFC] macvlan: Add function to set addr filters for device in passthru mode
From: Roopa Prabhu @ 2011-09-06 22:35 UTC (permalink / raw)
  To: netdev; +Cc: dragos.tatulea, arnd, mst, dwang2, benve, kaber, sri
In-Reply-To: <20110906223536.6552.2062.stgit@savbu-pc100.cisco.com>

From: Roopa Prabhu <roprabhu@cisco.com>

This patch introduces a function that accepts address list and filter flags
and sets them on the macvlan netdev. Currently it only supports PASSTHRU mode.
It also removes the code that puts the device in promiscous mode for PASSTHRU

Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: Christian Benvenuti <benve@cisco.com>
Signed-off-by: David Wang <dwang2@cisco.com>
---
 drivers/net/macvlan.c      |  130 ++++++++++++++++++++++++++++++++++++++------
 include/linux/if_macvlan.h |    3 +
 2 files changed, 114 insertions(+), 19 deletions(-)


diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 528924f..ba8b5a6 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -299,18 +299,16 @@ static int macvlan_open(struct net_device *dev)
 	struct net_device *lowerdev = vlan->lowerdev;
 	int err;
 
-	if (vlan->port->passthru) {
-		dev_set_promiscuity(lowerdev, 1);
-		goto hash_add;
-	}
+	if (!vlan->port->passthru) {
+		err = -EBUSY;
+		if (macvlan_addr_busy(vlan->port, dev->dev_addr))
+			goto out;
 
-	err = -EBUSY;
-	if (macvlan_addr_busy(vlan->port, dev->dev_addr))
-		goto out;
+		err = dev_uc_add(lowerdev, dev->dev_addr);
+		if (err < 0)
+			goto out;
+	}
 
-	err = dev_uc_add(lowerdev, dev->dev_addr);
-	if (err < 0)
-		goto out;
 	if (dev->flags & IFF_ALLMULTI) {
 		err = dev_set_allmulti(lowerdev, 1);
 		if (err < 0)
@@ -322,7 +320,6 @@ static int macvlan_open(struct net_device *dev)
 			goto unset_allmulti;
 	}
 
-hash_add:
 	macvlan_hash_add(vlan);
 	return 0;
 
@@ -330,7 +327,8 @@ unset_allmulti:
 	dev_set_allmulti(lowerdev, -1);
 
 del_unicast:
-	dev_uc_del(lowerdev, dev->dev_addr);
+	if (!vlan->port->passthru)
+		dev_uc_del(lowerdev, dev->dev_addr);
 out:
 	return err;
 }
@@ -340,11 +338,6 @@ static int macvlan_stop(struct net_device *dev)
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct net_device *lowerdev = vlan->lowerdev;
 
-	if (vlan->port->passthru) {
-		dev_set_promiscuity(lowerdev, -1);
-		goto hash_del;
-	}
-
 	dev_uc_unsync(lowerdev, dev);
 	dev_mc_unsync(lowerdev, dev);
 	if (dev->flags & IFF_ALLMULTI)
@@ -352,9 +345,9 @@ static int macvlan_stop(struct net_device *dev)
 	if (dev->flags & IFF_PROMISC)
 		dev_set_promiscuity(lowerdev, -1);
 
-	dev_uc_del(lowerdev, dev->dev_addr);
+	if (!vlan->port->passthru)
+		dev_uc_del(lowerdev, dev->dev_addr);
 
-hash_del:
 	macvlan_hash_del(vlan, !dev->dismantle);
 	return 0;
 }
@@ -854,6 +847,105 @@ static int macvlan_device_event(struct notifier_block *unused,
 	return NOTIFY_DONE;
 }
 
+static int macvlan_set_filter_passthru(struct net_device *dev,
+					unsigned int flags, int count,
+					u8 *addrlist)
+{
+	struct netdev_hw_addr *ha;
+	u8 *addr;
+	int i, found;
+	unsigned int flags_changed;
+
+	rtnl_lock();
+
+	/*
+	 *	Only look for changes in IFF_PROMISC and IFF_ALLMULTI for now.
+	 *	Ideally this list for flags to look must come from the caller
+	 */
+	flags_changed = (dev->flags ^ flags) & (IFF_PROMISC | IFF_ALLMULTI);
+	if (flags_changed)
+		dev_change_flags(dev, dev->flags ^ flags_changed);
+
+	if (!count) {
+		rtnl_unlock();
+		return 0;
+	}
+
+	/* Delete all multicast addresses not in use */
+	netdev_for_each_mc_addr(ha, dev) {
+		for (i = 0, addr = addrlist; i < count; i++) {
+			if (is_multicast_ether_addr(addr)
+				&& !compare_ether_addr(addr,
+				ha->addr))
+				break;
+			addr += ETH_ALEN;
+		}
+		if (i == count)
+			dev_mc_del(dev, ha->addr);
+	}
+
+	/* Delete all unicast addresses not in use */
+	netdev_for_each_uc_addr(ha, dev) {
+		for (i = 0, addr = addrlist; i < count; i++) {
+			if (!is_multicast_ether_addr(addr)
+				&& !compare_ether_addr(addr,
+				ha->addr))
+				break;
+			addr += ETH_ALEN;
+		}
+		if (i == count)
+			dev_uc_del(dev, ha->addr);
+	}
+
+	/* Add new addresses */
+	for (i = 0, addr = addrlist; i < count; i++) {
+		found = 0;
+		if (is_multicast_ether_addr(addr)) {
+			netdev_for_each_mc_addr(ha, dev) {
+				if (compare_ether_addr(addr,
+					ha->addr) == 0) {
+					found = 1;
+					break;
+				}
+			}
+			if (!found)
+				dev_mc_add(dev, addr);
+		} else {
+			netdev_for_each_uc_addr(ha, dev) {
+				if (compare_ether_addr(addr,
+					ha->addr) == 0) {
+					found = 1;
+					break;
+				}
+			}
+			if (!found)
+				dev_uc_add(dev, addr);
+		}
+		addr += ETH_ALEN;
+	}
+	rtnl_unlock();
+
+	return count;
+}
+
+int macvlan_set_filter(struct net_device *dev, unsigned int flags,
+			int count, u8 *addrlist)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+
+	if (!vlan)
+		return -EINVAL;
+
+	switch (vlan->mode) {
+	case MACVLAN_MODE_PASSTHRU:
+		return macvlan_set_filter_passthru(dev, flags, count,
+							addrlist);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL_GPL(macvlan_set_filter);
+
 static struct notifier_block macvlan_notifier_block __read_mostly = {
 	.notifier_call	= macvlan_device_event,
 };
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index e28b2e4..d55d4e6 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -104,4 +104,7 @@ extern int macvlan_link_register(struct rtnl_link_ops *ops);
 extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 				      struct net_device *dev);
 
+extern int macvlan_set_filter(struct net_device *dev, unsigned int flags,
+			      int count, u8 *addrlist);
+
 #endif /* _LINUX_IF_MACVLAN_H */

^ permalink raw reply related

* [net-next-2.6 PATCH 3/3 RFC] macvtap: Add support for TUNSETTXFILTER
From: Roopa Prabhu @ 2011-09-06 22:35 UTC (permalink / raw)
  To: netdev; +Cc: dragos.tatulea, arnd, mst, dwang2, benve, kaber, sri
In-Reply-To: <20110906223536.6552.2062.stgit@savbu-pc100.cisco.com>

From: Roopa Prabhu <roprabhu@cisco.com>

This patch adds support for TUNSETTXFILTER. Calls macvlan set filter function
with address list and flags received via TUNSETTXFILTER.

Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: Christian Benvenuti <benve@cisco.com>
Signed-off-by: David Wang <dwang2@cisco.com>
---
 drivers/net/macvtap.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)


diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index ab96c31..9943683 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -825,6 +825,10 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 	int __user *sp = argp;
 	int s;
 	int ret;
+	u8 *addrs;
+	struct tun_filter tf;
+	unsigned int flags = 0;
+	int alen;
 
 	switch (cmd) {
 	case TUNSETIFF:
@@ -896,6 +900,51 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 			return  -EINVAL;
 		return 0;
 
+	case TUNSETTXFILTER:
+		rcu_read_lock_bh();
+		vlan = rcu_dereference_bh(q->vlan);
+		if (vlan)
+			dev_hold(vlan->dev);
+		rcu_read_unlock_bh();
+
+		if (!vlan)
+			return -ENOLINK;
+
+		if (copy_from_user(&tf, argp, sizeof(tf))) {
+			dev_put(vlan->dev);
+			return -EFAULT;
+		}
+
+		/* XXX: If broadcast address present, set IFF_BROADCAST */
+		/* XXX: If multicast address present, set IFF_MULTICAST */
+		flags |= (tf.flags & TUN_FLT_ALLMULTI ? IFF_ALLMULTI : 0) |
+			 (!tf.count ? IFF_PROMISC : 0);
+		ret = 0;
+		if (tf.count > 0) {
+			alen = ETH_ALEN * tf.count;
+			addrs = kmalloc(alen, GFP_KERNEL);
+			if (!addrs) {
+				dev_put(vlan->dev);
+				return -ENOMEM;
+			}
+
+			if (copy_from_user(addrs, argp + sizeof(tf), alen)) {
+				kfree(addrs);
+				dev_put(vlan->dev);
+				return -EFAULT;
+			}
+		}
+
+		if (tf.count > 0 || flags)
+			ret = macvlan_set_filter(vlan->dev, flags,
+				tf.count, addrs);
+		dev_put(vlan->dev);
+
+		if (tf.count > 0)
+			kfree(addrs);
+
+		return ret;
+
 	default:
 		return -EINVAL;
 	}

^ permalink raw reply related

* Re: [PATCH] per-cgroup tcp buffer limitation
From: Glauber Costa @ 2011-09-06 22:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman
In-Reply-To: <20110906190038.7a0a8807.kamezawa.hiroyu@jp.fujitsu.com>

On 09/06/2011 07:00 AM, KAMEZAWA Hiroyuki wrote:
> On Mon,  5 Sep 2011 23:35:56 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch introduces per-cgroup tcp buffers limitation. This allows
>> sysadmins to specify a maximum amount of kernel memory that
>> tcp connections can use at any point in time. TCP is the main interest
>> in this work, but extending it to other protocols would be easy.
>>
>> It piggybacks in the memory control mechanism already present in
>> /proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
>> that will suppress allocation when reached. For each cgroup, however,
>> the file kmem.tcp_maxmem will be used to cap those values.
>>
>> The usage I have in mind here is containers. Each container will
>> define its own values for soft and hard limits, but none of them will
>> be possibly bigger than the value the box' sysadmin specified from
>> the outside.
>>
>> To test for any performance impacts of this patch, I used netperf's
>> TCP_RR benchmark on localhost, so we can have both recv and snd in action.
>>
>> Command line used was ./src/netperf -t TCP_RR -H localhost, and the
>> results:
>>
>> Without the patch
>> =================
>>
>> Socket Size   Request  Resp.   Elapsed  Trans.
>> Send   Recv   Size     Size    Time     Rate
>> bytes  Bytes  bytes    bytes   secs.    per sec
>>
>> 16384  87380  1        1       10.00    26996.35
>> 16384  87380
>>
>> With the patch
>> ===============
>>
>> Local /Remote
>> Socket Size   Request  Resp.   Elapsed  Trans.
>> Send   Recv   Size     Size    Time     Rate
>> bytes  Bytes  bytes    bytes   secs.    per sec
>>
>> 16384  87380  1        1       10.00    27291.86
>> 16384  87380
>>
>> The difference is within a one-percent range.
>>
>> Nesting cgroups doesn't seem to be the dominating factor as well,
>> with nestings up to 10 levels not showing a significant performance
>> difference.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   crypto/af_alg.c               |    7 ++-
>>   include/linux/cgroup_subsys.h |    4 +
>>   include/net/netns/ipv4.h      |    1 +
>>   include/net/sock.h            |   66 +++++++++++++++-
>>   include/net/tcp.h             |   12 ++-
>>   include/net/udp.h             |    3 +-
>>   include/trace/events/sock.h   |   10 +-
>>   init/Kconfig                  |   11 +++
>>   mm/Makefile                   |    1 +
>>   net/core/sock.c               |  136 +++++++++++++++++++++++++++-------
>>   net/decnet/af_decnet.c        |   21 +++++-
>>   net/ipv4/proc.c               |    8 +-
>>   net/ipv4/sysctl_net_ipv4.c    |   59 +++++++++++++--
>>   net/ipv4/tcp.c                |  164 +++++++++++++++++++++++++++++++++++-----
>>   net/ipv4/tcp_input.c          |   17 ++--
>>   net/ipv4/tcp_ipv4.c           |   27 +++++--
>>   net/ipv4/tcp_output.c         |    2 +-
>>   net/ipv4/tcp_timer.c          |    2 +-
>>   net/ipv4/udp.c                |   20 ++++-
>>   net/ipv6/tcp_ipv6.c           |   16 +++-
>>   net/ipv6/udp.c                |    4 +-
>>   net/sctp/socket.c             |   35 +++++++--
>>   22 files changed, 514 insertions(+), 112 deletions(-)
>
> Hmm...could you please devide patches into a few patches ?
>
> If I was you, I'll devide the patches into
>
>   - Kconfig/Makefile/kmem_cgroup.c skelton.
>   - changes to struct sock and macro definition
>   - hooks to tcp.
>   - hooks to udp
>   - hooks to sctp

BTW: One thing to keep in mind, is that the hooks can't go
entirely in separate patches. Most of it has to go in the same
patch in which we change struct sock. Otherwise bisectability is
gone.

> And why not including mm/kmem_cgroup.c ?
>
> some comments below.
>
>
>>
>> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
>> index ac33d5f..df168d8 100644
>> --- a/crypto/af_alg.c
>> +++ b/crypto/af_alg.c
>> @@ -29,10 +29,15 @@ struct alg_type_list {
>>
>>   static atomic_long_t alg_memory_allocated;
>>
>> +static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
>> +{
>> +	return&alg_memory_allocated;
>> +}
>> +
>>   static struct proto alg_proto = {
>>   	.name			= "ALG",
>>   	.owner			= THIS_MODULE,
>> -	.memory_allocated	=&alg_memory_allocated,
>> +	.memory_allocated	= memory_allocated_alg,
>>   	.obj_size		= sizeof(struct alg_sock),
>>   };
>>
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index ac663c1..363b8e8 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -35,6 +35,10 @@ SUBSYS(cpuacct)
>>   SUBSYS(mem_cgroup)
>>   #endif
>>
>> +#ifdef CONFIG_CGROUP_KMEM
>> +SUBSYS(kmem)
>> +#endif
>> +
>>   /* */
>>
>>   #ifdef CONFIG_CGROUP_DEVICE
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index d786b4f..bbd023a 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>>   	int current_rt_cache_rebuild_count;
>>
>>   	unsigned int sysctl_ping_group_range[2];
>> +	long sysctl_tcp_mem[3];
>>
>>   	atomic_t rt_genid;
>>   	atomic_t dev_addr_genid;
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 8e4062f..e085148 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -62,7 +62,9 @@
>>   #include<linux/atomic.h>
>>   #include<net/dst.h>
>>   #include<net/checksum.h>
>> +#include<linux/kmem_cgroup.h>
>>
>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>>   /*
>>    * This structure really needs to be cleaned up.
>>    * Most of it is for TCP, and not used by any of
>> @@ -339,6 +341,7 @@ struct sock {
>>   #endif
>>   	__u32			sk_mark;
>>   	u32			sk_classid;
>> +	struct kmem_cgroup	*sk_cgrp;
>>   	void			(*sk_state_change)(struct sock *sk);
>>   	void			(*sk_data_ready)(struct sock *sk, int bytes);
>>   	void			(*sk_write_space)(struct sock *sk);
>> @@ -786,16 +789,21 @@ struct proto {
>>
>>   	/* Memory pressure */
>>   	void			(*enter_memory_pressure)(struct sock *sk);
>> -	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
>> -	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
>> +	/* Current allocated memory. */
>> +	atomic_long_t		*(*memory_allocated)(struct kmem_cgroup *sg);
>> +	/* Current number of sockets. */
>> +	struct percpu_counter	*(*sockets_allocated)(struct kmem_cgroup *sg);
>> +
>> +	int			(*init_cgroup)(struct cgroup *cgrp,
>> +					       struct cgroup_subsys *ss);
>>   	/*
>>   	 * Pressure flag: try to collapse.
>>   	 * Technical note: it is used by multiple contexts non atomically.
>>   	 * All the __sk_mem_schedule() is of this nature: accounting
>>   	 * is strict, actions are advisory and have some latency.
>>   	 */
>> -	int			*memory_pressure;
>> -	long			*sysctl_mem;
>> +	int			*(*memory_pressure)(struct kmem_cgroup *sg);
>> +	long			*(*prot_mem)(struct kmem_cgroup *sg);
>
> Hmm. Socket interface callbacks doesn't have documentation ?
> Adding explanation in Documenation is better, isn't it ?
>
>
>>   	int			*sysctl_wmem;
>>   	int			*sysctl_rmem;
>>   	int			max_header;
>> @@ -826,6 +834,56 @@ struct proto {
>>   #endif
>>   };
>>
>> +#define sk_memory_pressure(sk)						\
>> +({									\
>> +	int *__ret = NULL;						\
>> +	if ((sk)->sk_prot->memory_pressure)				\
>> +		__ret = (sk)->sk_prot->memory_pressure(sk->sk_cgrp);	\
>> +	__ret;								\
>> +})
>> +
>> +#define sk_sockets_allocated(sk)				\
>> +({								\
>> +	struct percpu_counter *__p;				\
>> +	__p = (sk)->sk_prot->sockets_allocated(sk->sk_cgrp);	\
>> +	__p;							\
>> +})
>> +
>> +#define sk_memory_allocated(sk)					\
>> +({								\
>> +	atomic_long_t *__mem;					\
>> +	__mem = (sk)->sk_prot->memory_allocated(sk->sk_cgrp);	\
>> +	__mem;							\
>> +})
>> +
>> +#define sk_prot_mem(sk)						\
>> +({								\
>> +	long *__mem = (sk)->sk_prot->prot_mem(sk->sk_cgrp);	\
>> +	__mem;							\
>> +})
>> +
>> +#define sg_memory_pressure(prot, sg)				\
>> +({								\
>> +	int *__ret = NULL;					\
>> +	if (prot->memory_pressure)				\
>> +		__ret = (prot)->memory_pressure(sg);		\
>> +	__ret;							\
>> +})
>> +
>> +#define sg_memory_allocated(prot, sg)				\
>> +({								\
>> +	atomic_long_t *__mem;					\
>> +	__mem = (prot)->memory_allocated(sg);			\
>> +	__mem;							\
>> +})
>> +
>> +#define sg_sockets_allocated(prot, sg)				\
>> +({								\
>> +	struct percpu_counter *__p;				\
>> +	__p = (prot)->sockets_allocated(sg);			\
>> +	__p;							\
>> +})
>> +
>
> All functions are worth to be inlined ?
>
>
>
>>   extern int proto_register(struct proto *prot, int alloc_slab);
>>   extern void proto_unregister(struct proto *prot);
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 149a415..8e1ec4a 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>>   extern int sysctl_tcp_reordering;
>>   extern int sysctl_tcp_ecn;
>>   extern int sysctl_tcp_dsack;
>> -extern long sysctl_tcp_mem[3];
>>   extern int sysctl_tcp_wmem[3];
>>   extern int sysctl_tcp_rmem[3];
>>   extern int sysctl_tcp_app_win;
>> @@ -253,9 +252,12 @@ extern int sysctl_tcp_cookie_size;
>>   extern int sysctl_tcp_thin_linear_timeouts;
>>   extern int sysctl_tcp_thin_dupack;
>>
>> -extern atomic_long_t tcp_memory_allocated;
>> -extern struct percpu_counter tcp_sockets_allocated;
>> -extern int tcp_memory_pressure;
>> +struct kmem_cgroup;
>> +extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
>> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
>> +int *memory_pressure_tcp(struct kmem_cgroup *sg);
>> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
>>
>>   /*
>>    * The next routines deal with comparing 32 bit unsigned ints
>> @@ -286,7 +288,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
>>   	}
>>
>>   	if (sk->sk_wmem_queued>  SOCK_MIN_SNDBUF&&
>> -	    atomic_long_read(&tcp_memory_allocated)>  sysctl_tcp_mem[2])
>> +	    atomic_long_read(sk_memory_allocated(sk))>  sk_prot_mem(sk)[2])
>
> Why not sk_memory_allocated() returns the value ?
> Is it required to return pointer ?
>
>
>>   		return true;
>>   	return false;
>>   }
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index 67ea6fc..0e27388 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
>>
>>   extern struct proto udp_prot;
>>
>> -extern atomic_long_t udp_memory_allocated;
>> +atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
>> +long *udp_sysctl_mem(struct kmem_cgroup *sg);
>>
>>   /* sysctl variables for udp */
>>   extern long sysctl_udp_mem[3];
>> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
>> index 779abb9..12a6083 100644
>> --- a/include/trace/events/sock.h
>> +++ b/include/trace/events/sock.h
>> @@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>
>>   	TP_STRUCT__entry(
>>   		__array(char, name, 32)
>> -		__field(long *, sysctl_mem)
>> +		__field(long *, prot_mem)
>>   		__field(long, allocated)
>>   		__field(int, sysctl_rmem)
>>   		__field(int, rmem_alloc)
>> @@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>
>>   	TP_fast_assign(
>>   		strncpy(__entry->name, prot->name, 32);
>> -		__entry->sysctl_mem = prot->sysctl_mem;
>> +		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
>>   		__entry->allocated = allocated;
>>   		__entry->sysctl_rmem = prot->sysctl_rmem[0];
>>   		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
>> @@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>   	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
>>   		"sysctl_rmem=%d rmem_alloc=%d",
>>   		__entry->name,
>> -		__entry->sysctl_mem[0],
>> -		__entry->sysctl_mem[1],
>> -		__entry->sysctl_mem[2],
>> +		__entry->prot_mem[0],
>> +		__entry->prot_mem[1],
>> +		__entry->prot_mem[2],
>>   		__entry->allocated,
>>   		__entry->sysctl_rmem,
>>   		__entry->rmem_alloc)
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..5955ac2 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>   	  select this option (if, for some reason, they need to disable it
>>   	  then swapaccount=0 does the trick).
>>
>> +config CGROUP_KMEM
>> +	bool "Kernel Memory Resource Controller for Control Groups"
>> +	depends on CGROUPS
>> +	help
>> +	  The Kernel Memory cgroup can limit the amount of memory used by
>> +	  certain kernel objects in the system. Those are fundamentally
>> +	  different from the entities handled by the Memory Controller,
>> +	  which are page-based, and can be swapped. Users of the kmem
>> +	  cgroup can use it to guarantee that no group of processes will
>> +	  ever exhaust kernel resources alone.
>> +
>
> This help seems nice but please add Documentation/cgroup/kmem.
>
>
>
>
>>   config CGROUP_PERF
>>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>   	depends on PERF_EVENTS&&  CGROUPS
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 836e416..1b1aa24 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>   obj-$(CONFIG_QUICKLIST) += quicklist.o
>>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>>   obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>> +obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
>>   obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>>   obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>>   obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 3449df8..2d968ea 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -134,6 +134,24 @@
>>   #include<net/tcp.h>
>>   #endif
>>
>> +static DEFINE_RWLOCK(proto_list_lock);
>> +static LIST_HEAD(proto_list);
>> +
>> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
>> +{
>> +	struct proto *proto;
>> +	int ret = 0;
>> +
>> +	read_lock(&proto_list_lock);
>> +	list_for_each_entry(proto,&proto_list, node) {
>> +		if (proto->init_cgroup)
>> +			ret |= proto->init_cgroup(cgrp, ss);
>> +	}
>> +	read_unlock(&proto_list_lock);
>> +
>> +	return ret;
>> +}
>> +
>>   /*
>>    * Each address family might have different locking rules, so we have
>>    * one slock key per address family:
>> @@ -1114,6 +1132,31 @@ void sock_update_classid(struct sock *sk)
>>   EXPORT_SYMBOL(sock_update_classid);
>>   #endif
>>
>> +void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>
> I'm not sure this kind of bare cgroup code in core/sock.c will be
> welcomed by network guys.
>
>
>
>
>> +
>> +void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>>   /**
>>    *	sk_alloc - All socket objects are allocated here
>>    *	@net: the applicable net namespace
>> @@ -1139,6 +1182,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>>   		atomic_set(&sk->sk_wmem_alloc, 1);
>>
>>   		sock_update_classid(sk);
>> +		sock_update_kmem_cgrp(sk);
>>   	}
>>
>>   	return sk;
>> @@ -1170,6 +1214,7 @@ static void __sk_free(struct sock *sk)
>>   		put_cred(sk->sk_peer_cred);
>>   	put_pid(sk->sk_peer_pid);
>>   	put_net(sock_net(sk));
>> +	sock_release_kmem_cgrp(sk);
>>   	sk_prot_free(sk->sk_prot_creator, sk);
>>   }
>>
>> @@ -1287,8 +1332,8 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
>>   		sk_set_socket(newsk, NULL);
>>   		newsk->sk_wq = NULL;
>>
>> -		if (newsk->sk_prot->sockets_allocated)
>> -			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
>> +		if (sk_sockets_allocated(sk))
>> +			percpu_counter_inc(sk_sockets_allocated(sk));
> How about
> 	sk_sockets_allocated_inc(sk);
> ?
>
>
>>
>>   		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
>>   		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
>> @@ -1676,29 +1721,51 @@ EXPORT_SYMBOL(sk_wait_data);
>>    */
>>   int __sk_mem_schedule(struct sock *sk, int size, int kind)
>>   {
>> -	struct proto *prot = sk->sk_prot;
>>   	int amt = sk_mem_pages(size);
>> +	struct proto *prot = sk->sk_prot;
>>   	long allocated;
>> +	int *memory_pressure;
>> +	long *prot_mem;
>> +	int parent_failure = 0;
>> +	struct kmem_cgroup *sg;
>>
>>   	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
>> -	allocated = atomic_long_add_return(amt, prot->memory_allocated);
>> +
>> +	memory_pressure = sk_memory_pressure(sk);
>> +	prot_mem = sk_prot_mem(sk);
>> +
>> +	allocated = atomic_long_add_return(amt, sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>> +		long alloc;
>> +		/*
>> +		 * Large nestings are not the common case, and stopping in the
>> +		 * middle would be complicated enough, that we bill it all the
>> +		 * way through the root, and if needed, unbill everything later
>> +		 */
>> +		alloc = atomic_long_add_return(amt,
>> +					       sg_memory_allocated(prot, sg));
>> +		parent_failure |= (alloc>  sk_prot_mem(sk)[2]);
>> +	}
>> +#endif
>> +
>> +	/* Over hard limit (we, or our parents) */
>> +	if (parent_failure || (allocated>  prot_mem[2]))
>> +		goto suppress_allocation;
>>
>>   	/* Under limit. */
>> -	if (allocated<= prot->sysctl_mem[0]) {
>> -		if (prot->memory_pressure&&  *prot->memory_pressure)
>> -			*prot->memory_pressure = 0;
>> +	if (allocated<= prot_mem[0]) {
>> +		if (memory_pressure&&  *memory_pressure)
>> +			*memory_pressure = 0;
>>   		return 1;
>>   	}
>>
>>   	/* Under pressure. */
>> -	if (allocated>  prot->sysctl_mem[1])
>> +	if (allocated>  prot_mem[1])
>>   		if (prot->enter_memory_pressure)
>>   			prot->enter_memory_pressure(sk);
>>
>> -	/* Over hard limit. */
>> -	if (allocated>  prot->sysctl_mem[2])
>> -		goto suppress_allocation;
>> -
>>   	/* guarantee minimum buffer size under pressure */
>>   	if (kind == SK_MEM_RECV) {
>>   		if (atomic_read(&sk->sk_rmem_alloc)<  prot->sysctl_rmem[0])
>> @@ -1712,13 +1779,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
>>   				return 1;
>>   	}
>>
>> -	if (prot->memory_pressure) {
>> +	if (memory_pressure) {
>>   		int alloc;
>>
>> -		if (!*prot->memory_pressure)
>> +		if (!*memory_pressure)
>>   			return 1;
>> -		alloc = percpu_counter_read_positive(prot->sockets_allocated);
>> -		if (prot->sysctl_mem[2]>  alloc *
>> +		alloc = percpu_counter_read_positive(sk_sockets_allocated(sk));
>> +		if (prot_mem[2]>  alloc *
>>   		    sk_mem_pages(sk->sk_wmem_queued +
>>   				 atomic_read(&sk->sk_rmem_alloc) +
>>   				 sk->sk_forward_alloc))
>> @@ -1741,7 +1808,13 @@ suppress_allocation:
>>
>>   	/* Alas. Undo changes. */
>>   	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
>> -	atomic_long_sub(amt, prot->memory_allocated);
>> +
>> +	atomic_long_sub(amt, sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent)
>> +		atomic_long_sub(amt, sg_memory_allocated(prot, sg));
>> +#endif
>>   	return 0;
>>   }
>>   EXPORT_SYMBOL(__sk_mem_schedule);
>> @@ -1753,14 +1826,24 @@ EXPORT_SYMBOL(__sk_mem_schedule);
>>   void __sk_mem_reclaim(struct sock *sk)
>>   {
>>   	struct proto *prot = sk->sk_prot;
>> +	struct kmem_cgroup *sg = sk->sk_cgrp;
>> +	int *memory_pressure = sk_memory_pressure(sk);
>>
>>   	atomic_long_sub(sk->sk_forward_alloc>>  SK_MEM_QUANTUM_SHIFT,
>> -		   prot->memory_allocated);
>> +		   sk_memory_allocated(sk));
>> +
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	for (sg = sk->sk_cgrp->parent; sg != NULL; sg = sg->parent) {
>> +		atomic_long_sub(sk->sk_forward_alloc>>  SK_MEM_QUANTUM_SHIFT,
>> +						sg_memory_allocated(prot, sg));
>> +	}
>> +#endif
>> +
>>   	sk->sk_forward_alloc&= SK_MEM_QUANTUM - 1;
>>
>> -	if (prot->memory_pressure&&  *prot->memory_pressure&&
>> -	    (atomic_long_read(prot->memory_allocated)<  prot->sysctl_mem[0]))
>> -		*prot->memory_pressure = 0;
>> +	if (memory_pressure&&  *memory_pressure&&
>> +	    (atomic_long_read(sk_memory_allocated(sk))<  sk_prot_mem(sk)[0]))
>> +		*memory_pressure = 0;
>>   }
>>   EXPORT_SYMBOL(__sk_mem_reclaim);
>>
>
> IMHO, I like to hide atomic_long_xxxx ops under kmem cgroup ops.
>
> And use callbacks like
> 	kmem_cgroup_read(SOCKET_MEM_ALLOCATED, sk)
>
> If other component uses kmem_cgroup, a generic interface will be
> helpful because optimization/fix in generic interface will improve
> all users of kmem_cgroup.
>
>
>
>> @@ -2252,9 +2335,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>>   #ifdef CONFIG_PROC_FS
>>   #define PROTO_INUSE_NR	64	/* should be enough for the first time */
>>   struct prot_inuse {
>> @@ -2479,13 +2559,17 @@ static char proto_method_implemented(const void *method)
>>
>>   static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
>>   {
>> +	struct kmem_cgroup *sg = kcg_from_task(current);
>> +
>>   	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
>>   			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
>>   		   proto->name,
>>   		   proto->obj_size,
>>   		   sock_prot_inuse_get(seq_file_net(seq), proto),
>> -		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
>> -		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
>> +		   proto->memory_allocated != NULL ?
>> +			atomic_long_read(sg_memory_allocated(proto, sg)) : -1L,
>> +		   proto->memory_pressure != NULL ?
>> +			*sg_memory_pressure(proto, sg) ? "yes" : "no" : "NI",
>>   		   proto->max_header,
>>   		   proto->slab == NULL ? "no" : "yes",
>>   		   module_name(proto->owner),
>> diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
>> index 19acd00..463b299 100644
>> --- a/net/decnet/af_decnet.c
>> +++ b/net/decnet/af_decnet.c
>> @@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
>>   	}
>>   }
>>
>> +static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
>> +{
>> +	return&decnet_memory_allocated;
>> +}
>> +
>> +static int *memory_pressure_dn(struct kmem_cgroup *sg)
>> +{
>> +	return&dn_memory_pressure;
>> +}
>> +
>> +static long *dn_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +	return sysctl_decnet_mem;
>> +}
>> +
>>   static struct proto dn_proto = {
>>   	.name			= "NSP",
>>   	.owner			= THIS_MODULE,
>>   	.enter_memory_pressure	= dn_enter_memory_pressure,
>> -	.memory_pressure	=&dn_memory_pressure,
>> -	.memory_allocated	=&decnet_memory_allocated,
>> -	.sysctl_mem		= sysctl_decnet_mem,
>> +	.memory_pressure	= memory_pressure_dn,
>> +	.memory_allocated	= memory_allocated_dn,
>> +	.prot_mem		= dn_sysctl_mem,
>>   	.sysctl_wmem		= sysctl_decnet_wmem,
>>   	.sysctl_rmem		= sysctl_decnet_rmem,
>>   	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
>> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
>> index b14ec7d..9c80acf 100644
>> --- a/net/ipv4/proc.c
>> +++ b/net/ipv4/proc.c
>> @@ -52,20 +52,22 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
>>   {
>>   	struct net *net = seq->private;
>>   	int orphans, sockets;
>> +	struct kmem_cgroup *sg = kcg_from_task(current);
>> +	struct percpu_counter *allocated = sg_sockets_allocated(&tcp_prot, sg);
>>
>>   	local_bh_disable();
>>   	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
>> -	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
>> +	sockets = percpu_counter_sum_positive(allocated);
>>   	local_bh_enable();
>>
>>   	socket_seq_show(seq);
>>   	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
>>   		   sock_prot_inuse_get(net,&tcp_prot), orphans,
>>   		   tcp_death_row.tw_count, sockets,
>> -		   atomic_long_read(&tcp_memory_allocated));
>> +		   atomic_long_read(sg_memory_allocated((&tcp_prot), sg)));
>>   	seq_printf(seq, "UDP: inuse %d mem %ld\n",
>>   		   sock_prot_inuse_get(net,&udp_prot),
>> -		   atomic_long_read(&udp_memory_allocated));
>> +		   atomic_long_read(sg_memory_allocated((&udp_prot), sg)));
>>   	seq_printf(seq, "UDPLITE: inuse %d\n",
>>   		   sock_prot_inuse_get(net,&udplite_prot));
>>   	seq_printf(seq, "RAW: inuse %d\n",
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 69fd720..5e89480 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,8 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/kmem_cgroup.h>
>> +#include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>>   #include<net/ip.h>
>> @@ -174,6 +176,43 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>>   	return ret;
>>   }
>>
>> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> +			   void __user *buffer, size_t *lenp,
>> +			   loff_t *ppos)
>> +{
>> +	int ret;
>> +	unsigned long vec[3];
>> +	struct kmem_cgroup *kmem = kcg_from_task(current);
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int i;
>> +
>> +	ctl_table tmp = {
>> +		.data =&vec,
>> +		.maxlen = sizeof(vec),
>> +		.mode = ctl->mode,
>> +	};
>> +
>> +	if (!write) {
>> +		ctl->data =&net->ipv4.sysctl_tcp_mem;
>> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
>> +	}
>> +
>> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
>> +	if (ret)
>> +		return ret;
>> +
>> +	for (i = 0; i<  3; i++)
>> +		if (vec[i]>  kmem->tcp_max_memory)
>> +			return -EINVAL;
>> +
>> +	for (i = 0; i<  3; i++) {
>> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
>> +		kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static struct ctl_table ipv4_table[] = {
>>   	{
>>   		.procname	= "tcp_timestamps",
>> @@ -433,13 +472,6 @@ static struct ctl_table ipv4_table[] = {
>>   		.proc_handler	= proc_dointvec
>>   	},
>>   	{
>> -		.procname	= "tcp_mem",
>> -		.data		=&sysctl_tcp_mem,
>> -		.maxlen		= sizeof(sysctl_tcp_mem),
>> -		.mode		= 0644,
>> -		.proc_handler	= proc_doulongvec_minmax
>> -	},
>> -	{
>>   		.procname	= "tcp_wmem",
>>   		.data		=&sysctl_tcp_wmem,
>>   		.maxlen		= sizeof(sysctl_tcp_wmem),
>> @@ -721,6 +753,12 @@ static struct ctl_table ipv4_net_table[] = {
>>   		.mode		= 0644,
>>   		.proc_handler	= ipv4_ping_group_range,
>>   	},
>> +	{
>> +		.procname	= "tcp_mem",
>> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
>> +		.mode		= 0644,
>> +		.proc_handler	= ipv4_tcp_mem,
>> +	},
>>   	{ }
>>   };
>>
>> @@ -734,6 +772,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>>   static __net_init int ipv4_sysctl_init_net(struct net *net)
>>   {
>>   	struct ctl_table *table;
>> +	unsigned long limit;
>>
>>   	table = ipv4_net_table;
>>   	if (!net_eq(net,&init_net)) {
>> @@ -769,6 +808,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>>
>>   	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>>
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>> +	net->ipv4.sysctl_tcp_mem[1] = limit;
>> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>> +
>
> What this calculation means ? Documented somewhere ?
>
>
>
>>   	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>>   			net_ipv4_ctl_path, table);
>>   	if (net->ipv4.ipv4_hdr == NULL)
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 46febca..e1918fa 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -266,6 +266,7 @@
>>   #include<linux/crypto.h>
>>   #include<linux/time.h>
>>   #include<linux/slab.h>
>> +#include<linux/nsproxy.h>
>>
>>   #include<net/icmp.h>
>>   #include<net/tcp.h>
>> @@ -282,23 +283,12 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>>   struct percpu_counter tcp_orphan_count;
>>   EXPORT_SYMBOL_GPL(tcp_orphan_count);
>>
>> -long sysctl_tcp_mem[3] __read_mostly;
>>   int sysctl_tcp_wmem[3] __read_mostly;
>>   int sysctl_tcp_rmem[3] __read_mostly;
>>
>> -EXPORT_SYMBOL(sysctl_tcp_mem);
>>   EXPORT_SYMBOL(sysctl_tcp_rmem);
>>   EXPORT_SYMBOL(sysctl_tcp_wmem);
>>
>> -atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
>> -EXPORT_SYMBOL(tcp_memory_allocated);
>> -
>> -/*
>> - * Current number of TCP sockets.
>> - */
>> -struct percpu_counter tcp_sockets_allocated;
>> -EXPORT_SYMBOL(tcp_sockets_allocated);
>> -
>>   /*
>>    * TCP splice context
>>    */
>> @@ -308,23 +298,157 @@ struct tcp_splice_state {
>>   	unsigned int flags;
>>   };
>>
>> +#ifdef CONFIG_CGROUP_KMEM
>>   /*
>>    * Pressure flag: try to collapse.
>>    * Technical note: it is used by multiple contexts non atomically.
>>    * All the __sk_mem_schedule() is of this nature: accounting
>>    * is strict, actions are advisory and have some latency.
>>    */
>> -int tcp_memory_pressure __read_mostly;
>> -EXPORT_SYMBOL(tcp_memory_pressure);
>> -
>>   void tcp_enter_memory_pressure(struct sock *sk)
>>   {
>> +	struct kmem_cgroup *sg = sk->sk_cgrp;
>> +	if (!sg->tcp_memory_pressure) {
>> +		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
>> +		sg->tcp_memory_pressure = 1;
>> +	}
>> +}
>> +
>> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
>> +{
>> +	return sg->tcp_prot_mem;
>> +}
>> +
>> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
>> +{
>> +	return&(sg->tcp_memory_allocated);
>> +}
>> +
>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int i;
>> +
>> +	if (!cgroup_lock_live_group(cgrp))
>> +		return -ENODEV;
>> +
>> +	/*
>> +	 * We can't allow more memory than our parents. Since this
>> +	 * will be tested for all calls, by induction, there is no need
>> +	 * to test any parent other than our own
>> +	 * */
>> +	if (sg->parent&&  (val>  sg->parent->tcp_max_memory))
>> +		val = sg->parent->tcp_max_memory;
>> +
>> +	sg->tcp_max_memory = val;
>> +
>> +	for (i = 0; i<  3; i++)
>> +		sg->tcp_prot_mem[i]  = min_t(long, val,
>> +					     net->ipv4.sysctl_tcp_mem[i]);
>> +
>> +	cgroup_unlock();
>> +
>> +	return 0;
>> +}
>> +
>
> Do we really need cgroup_lock/unlock ?
>
>
>
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
>> +	u64 ret;
>> +
>> +	if (!cgroup_lock_live_group(cgrp))
>> +		return -ENODEV;
>> +	ret = sg->tcp_max_memory;
>> +
>> +	cgroup_unlock();
>> +	return ret;
>> +}
>> +
>
>
> Hmm, can't you implement this function as
>
> 	kmem_cgroup_read(SOCK_TCP_MAXMEM, sg);
>
> ? How do you think ?
>
> Thanks,
> -Kame
>

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Yan, Zheng @ 2011-09-06 23:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tim Chen, Yan, Zheng, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, jirislaby@gmail.com, sedat.dilek@gmail.com,
	alex.shi
In-Reply-To: <1315340388.3400.28.camel@edumazet-laptop>

On Wed, Sep 7, 2011 at 4:19 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 06 septembre 2011 à 12:59 -0700, Tim Chen a écrit :
>> On Tue, 2011-09-06 at 21:43 +0200, Eric Dumazet wrote:
>> > Le mardi 06 septembre 2011 à 12:33 -0700, Tim Chen a écrit :
>> >
>> > > Yes, I think locking the sendmsg for the entire duration of
>> > > unix_stream_sendmsg makes a lot of sense.  It simplifies the logic a lot
>> > > more.  I'll try to cook something up in the next couple of days.
>> >
>> > Thats not really possible, we cant hold a spinlock and call
>> > sock_alloc_send_skb() and/or memcpy_fromiovec(), wich might sleep.
>> >
>> > You would need to prepare the full skb list, then :
>> > - stick the ref on the last skb of the list.
>> >
>> > Transfert the whole skb list in other->sk_receive_queue in one go,
>> > instead of one after another.
>> >
>> > Unfortunately, this would break streaming (big send(), and another
>> > thread doing the receive)
>> >
>> > Listen, I am wondering why hackbench even triggers SCM code. This is
>> > really odd. We should not have a _single_ pid/cred ref/unref at all.
>> >
>>
>> Hackbench triggers the code because it has a bunch of threads sending
>> msgs on UNIX socket.
>> >
>>
>> Well, if the lock socket approach doesn't work, then my original patch
>> plus Yan Zheng's fix should still work.  I'll try to answer your
>> objections below:
>>
>>
>> > I was discussing of things after proposed patch, not current net-next.
>> >
>> > This reads :
>> >
>> > err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, scm_ref);
>> >
>> > So first skb is sent without ref taken, as mentioned in Changelog ?
>> >
>>
>> No. the first skb is sent *with* ref taken, as scm_ref is set to true for
>> first skb.
>>
>> >
>> > If second skb cannot be built, we exit this system call with an already
>> > queued skb. Receiver can then access to freed memory.
>> >
>>
>> No, we do have reference set.  For first skb, in unix_scm_to_skb.  For the
>> second skb (which is the last skb), in scm_sent.  Should the second skb alloc failed,
>> we'll release the ref in scm_destroy.  Otherwise, the receiver will release
>> the references will consuming the skb.
>>
>
> This is crap. This is not the intent of the code I read from the patch.
>
> unless scm_ref really means scm_noref ?
>
> I really hate this patch. I mean it.
>
> I read it 10 times, spent 2 hours and still dont understand it.
>

Sorry, scm_ref means "sender hold a scm reference". I should add comment for it.

>
> @@ -1577,6 +1577,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>        int sent = 0;
>        struct scm_cookie tmp_scm;
>        bool fds_sent = false;
> +       bool scm_ref = true;
>        int max_level;
>
>        if (NULL == siocb->scm)
> @@ -1637,12 +1638,15 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>                 */
>                size = min_t(int, size, skb_tailroom(skb));
>
> +               /* pass the scm reference to the very last skb */
>
> HERE: I understand : on the last skb, set scm_ref to false.
> So comment is wrong.
>
> +               if (sent + size >= len)
> +                       scm_ref = false;
>
> -               /* Only send the fds and no ref to pid in the first buffer */
> -               err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
> +               /* Only send the fds in the first buffer */
> +               err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, scm_ref);
>                if (err < 0) {
>                        kfree_skb(skb);
> -                       goto out;
> +                       goto out_err;
>                }
>
>
>
> As I said, we should revert the buggy patch, and rewrite a performance
> fix from scratch, with not a single get_pid()/put_pid() in fast path.
>
> read()/write() on AF_UNIX sockets should not use a single
> get_pid()/put_pid().
>
> This is a serious regression we should fix at 100%, not 50% or even 75%,
> adding serious bugs.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH net-next 0/4] ethtool: Fixes for RX NFC API
From: Ben Hutchings @ 2011-09-06 23:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Alexander Duyck, Sebastian Poehn, Santwona Behera

This series fixes some minor flaws in the documentation and the API itself.

Ben.

Ben Hutchings (4):
  ethtool: Make struct ethtool_rxnfc kernel-doc more self-consistent
  ethtool: Explicitly state that RX NFC rule locations are priorities
  ethtool: Clean up definitions of rule location arrays in RX NFC
  ethtool: Update ethtool_rxnfc::rule_cnt on return from
    ETHTOOL_GRXCLSRLALL

 .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c    |    2 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    |    2 +-
 drivers/net/ethernet/freescale/gianfar_ethtool.c   |    5 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c   |    7 ++--
 drivers/net/ethernet/sfc/ethtool.c                 |    2 +-
 drivers/net/ethernet/sun/niu.c                     |    6 ++-
 drivers/net/vmxnet3/vmxnet3_ethtool.c              |    2 +-
 include/linux/ethtool.h                            |   33 +++++++++----------
 8 files changed, 31 insertions(+), 28 deletions(-)

-- 
1.7.4.4


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH net-next 1/4] ethtool: Make struct ethtool_rxnfc kernel-doc more self-consistent
From: Ben Hutchings @ 2011-09-06 23:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Alexander Duyck, Sebastian Poehn, Santwona Behera
In-Reply-To: <1315352693.2788.20.camel@bwh-desktop>

Refer consistently to 'classification rules' or just 'rules' rather
than 'filter specifications' or 'filter rules'.

Refer consistently to rule 'locations' and not 'indices'.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/linux/ethtool.h |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 3829712..dffab51 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -446,7 +446,7 @@ struct ethtool_flow_ext {
 };
 
 /**
- * struct ethtool_rx_flow_spec - specification for RX flow filter
+ * struct ethtool_rx_flow_spec - classification rule for RX flows
  * @flow_type: Type of match to perform, e.g. %TCP_V4_FLOW
  * @h_u: Flow fields to match (dependent on @flow_type)
  * @h_ext: Additional fields to match
@@ -456,7 +456,7 @@ struct ethtool_flow_ext {
  *	includes the %FLOW_EXT flag.
  * @ring_cookie: RX ring/queue index to deliver to, or %RX_CLS_FLOW_DISC
  *	if packets should be discarded
- * @location: Index of filter in hardware table
+ * @location: Location of rule in the table
  */
 struct ethtool_rx_flow_spec {
 	__u32		flow_type;
@@ -475,9 +475,9 @@ struct ethtool_rx_flow_spec {
  *	%ETHTOOL_GRXCLSRLALL, %ETHTOOL_SRXCLSRLDEL or %ETHTOOL_SRXCLSRLINS
  * @flow_type: Type of flow to be affected, e.g. %TCP_V4_FLOW
  * @data: Command-dependent value
- * @fs: Flow filter specification
+ * @fs: Flow classification rule
  * @rule_cnt: Number of rules to be affected
- * @rule_locs: Array of valid rule indices
+ * @rule_locs: Array of valid rule locations
  *
  * For %ETHTOOL_GRXFH and %ETHTOOL_SRXFH, @data is a bitmask indicating
  * the fields included in the flow hash, e.g. %RXH_IP_SRC.  The following
@@ -489,20 +489,19 @@ struct ethtool_rx_flow_spec {
  * For %ETHTOOL_GRXCLSRLCNT, @rule_cnt is set to the number of defined
  * rules on return.
  *
- * For %ETHTOOL_GRXCLSRULE, @fs.@location specifies the index of an
- * existing filter rule on entry and @fs contains the rule on return.
+ * For %ETHTOOL_GRXCLSRULE, @fs.@location specifies the location of an
+ * existing rule on entry and @fs contains the rule on return.
  *
  * For %ETHTOOL_GRXCLSRLALL, @rule_cnt specifies the array size of the
  * user buffer for @rule_locs on entry.  On return, @data is the size
- * of the filter table and @rule_locs contains the indices of the
+ * of the rule table and @rule_locs contains the locations of the
  * defined rules.
  *
- * For %ETHTOOL_SRXCLSRLINS, @fs specifies the filter rule to add or
- * update.  @fs.@location specifies the index to use and must not be
- * ignored.
+ * For %ETHTOOL_SRXCLSRLINS, @fs specifies the rule to add or update.
+ * @fs.@location specifies the location to use and must not be ignored.
  *
- * For %ETHTOOL_SRXCLSRLDEL, @fs.@location specifies the index of an
- * existing filter rule on entry.
+ * For %ETHTOOL_SRXCLSRLDEL, @fs.@location specifies the location of an
+ * existing rule on entry.
  *
  * Implementation of indexed classification rules generally requires a
  * TCAM.
-- 
1.7.4.4



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 2/4] ethtool: Explicitly state that RX NFC rule locations are priorities
From: Ben Hutchings @ 2011-09-06 23:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Alexander Duyck, Sebastian Poehn, Santwona Behera
In-Reply-To: <1315352693.2788.20.camel@bwh-desktop>

The location of an RX flow classification rule is needed to identify
it for retrieval, replacement or deletion.  However it also defines
the priority of the rule in the case that a flow is matched by
multiple rules.  This is what I intended to imply by referring to the
use of a TCAM, commonly used to implement that behaviour.

However there are other ways this can be done, and it is better to
specify this explicitly.  Further, I want to add the option for
automatic selection of rule locations.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
I hope we're all in agreement that this is how locations are supposed to
be numbered.

Ben.

 include/linux/ethtool.h |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index dffab51..58407e3 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -456,7 +456,9 @@ struct ethtool_flow_ext {
  *	includes the %FLOW_EXT flag.
  * @ring_cookie: RX ring/queue index to deliver to, or %RX_CLS_FLOW_DISC
  *	if packets should be discarded
- * @location: Location of rule in the table
+ * @location: Location of rule in the table.  Locations must be
+ *	numbered such that a flow matching multiple rules will be
+ *	classified according to the first (lowest numbered) rule.
  */
 struct ethtool_rx_flow_spec {
 	__u32		flow_type;
@@ -502,9 +504,6 @@ struct ethtool_rx_flow_spec {
  *
  * For %ETHTOOL_SRXCLSRLDEL, @fs.@location specifies the location of an
  * existing rule on entry.
- *
- * Implementation of indexed classification rules generally requires a
- * TCAM.
  */
 struct ethtool_rxnfc {
 	__u32				cmd;
-- 
1.7.4.4



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 3/4] ethtool: Clean up definitions of rule location arrays in RX NFC
From: Ben Hutchings @ 2011-09-06 23:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Alexander Duyck, Sebastian Poehn, Santwona Behera
In-Reply-To: <1315352693.2788.20.camel@bwh-desktop>

Correct the description of ethtool_rxnfc::rule_locs; it is an array
of currently used locations, not all possible valid locations.

Add note that drivers must not use ethtool_rxnfc::rule_locs.

The rule_locs argument to ethtool_ops::get_rxnfc is either NULL or a
pointer to an array of u32, so change the parameter type accordingly.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c    |    2 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    |    2 +-
 drivers/net/ethernet/freescale/gianfar_ethtool.c   |    4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c   |    5 ++---
 drivers/net/ethernet/sfc/ethtool.c                 |    2 +-
 drivers/net/ethernet/sun/niu.c                     |    4 ++--
 drivers/net/vmxnet3/vmxnet3_ethtool.c              |    2 +-
 include/linux/ethtool.h                            |    7 ++++---
 8 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 767c229..ce14f11 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2241,7 +2241,7 @@ static int bnx2x_set_phys_id(struct net_device *dev,
 }
 
 static int bnx2x_get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info,
-			   void *rules __always_unused)
+			   u32 *rules __always_unused)
 {
 	struct bnx2x *bp = netdev_priv(dev);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 90b4921..40b395f 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -1902,7 +1902,7 @@ static int set_rss_table(struct net_device *dev,
 }
 
 static int get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info,
-		     void *rules)
+		     u32 *rules)
 {
 	const struct port_info *pi = netdev_priv(dev);
 
diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 25a8c2a..4223830 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -1712,7 +1712,7 @@ static int gfar_set_nfc(struct net_device *dev, struct ethtool_rxnfc *cmd)
 }
 
 static int gfar_get_nfc(struct net_device *dev, struct ethtool_rxnfc *cmd,
-		void *rule_locs)
+		u32 *rule_locs)
 {
 	struct gfar_private *priv = netdev_priv(dev);
 	int ret = 0;
@@ -1728,7 +1728,7 @@ static int gfar_get_nfc(struct net_device *dev, struct ethtool_rxnfc *cmd,
 		ret = gfar_get_cls(priv, cmd);
 		break;
 	case ETHTOOL_GRXCLSRLALL:
-		ret = gfar_get_cls_all(priv, cmd, (u32 *) rule_locs);
+		ret = gfar_get_cls_all(priv, cmd, rule_locs);
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 82d4244..bad2d27 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2283,7 +2283,7 @@ static int ixgbe_get_ethtool_fdir_all(struct ixgbe_adapter *adapter,
 }
 
 static int ixgbe_get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd,
-			   void *rule_locs)
+			   u32 *rule_locs)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	int ret = -EOPNOTSUPP;
@@ -2301,8 +2301,7 @@ static int ixgbe_get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd,
 		ret = ixgbe_get_ethtool_fdir_entry(adapter, cmd);
 		break;
 	case ETHTOOL_GRXCLSRLALL:
-		ret = ixgbe_get_ethtool_fdir_all(adapter, cmd,
-						 (u32 *)rule_locs);
+		ret = ixgbe_get_ethtool_fdir_all(adapter, cmd, rule_locs);
 		break;
 	default:
 		break;
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index 93f1fb9..9536925 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -824,7 +824,7 @@ static int efx_ethtool_reset(struct net_device *net_dev, u32 *flags)
 
 static int
 efx_ethtool_get_rxnfc(struct net_device *net_dev,
-		      struct ethtool_rxnfc *info, void *rules __always_unused)
+		      struct ethtool_rxnfc *info, u32 *rules __always_unused)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 
diff --git a/drivers/net/ethernet/sun/niu.c b/drivers/net/ethernet/sun/niu.c
index 3c9ef1c..8037059 100644
--- a/drivers/net/ethernet/sun/niu.c
+++ b/drivers/net/ethernet/sun/niu.c
@@ -7306,7 +7306,7 @@ static int niu_get_ethtool_tcam_all(struct niu *np,
 }
 
 static int niu_get_nfc(struct net_device *dev, struct ethtool_rxnfc *cmd,
-		       void *rule_locs)
+		       u32 *rule_locs)
 {
 	struct niu *np = netdev_priv(dev);
 	int ret = 0;
@@ -7325,7 +7325,7 @@ static int niu_get_nfc(struct net_device *dev, struct ethtool_rxnfc *cmd,
 		ret = niu_get_ethtool_tcam_entry(np, cmd);
 		break;
 	case ETHTOOL_GRXCLSRLALL:
-		ret = niu_get_ethtool_tcam_all(np, cmd, (u32 *)rule_locs);
+		ret = niu_get_ethtool_tcam_all(np, cmd, rule_locs);
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index 27400ed..e662cbc 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -558,7 +558,7 @@ out:
 
 static int
 vmxnet3_get_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *info,
-		  void *rules)
+		  u32 *rules)
 {
 	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
 	switch (info->cmd) {
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 58407e3..3df1f3b 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -479,7 +479,7 @@ struct ethtool_rx_flow_spec {
  * @data: Command-dependent value
  * @fs: Flow classification rule
  * @rule_cnt: Number of rules to be affected
- * @rule_locs: Array of valid rule locations
+ * @rule_locs: Array of used rule locations
  *
  * For %ETHTOOL_GRXFH and %ETHTOOL_SRXFH, @data is a bitmask indicating
  * the fields included in the flow hash, e.g. %RXH_IP_SRC.  The following
@@ -497,7 +497,8 @@ struct ethtool_rx_flow_spec {
  * For %ETHTOOL_GRXCLSRLALL, @rule_cnt specifies the array size of the
  * user buffer for @rule_locs on entry.  On return, @data is the size
  * of the rule table and @rule_locs contains the locations of the
- * defined rules.
+ * defined rules.  Drivers must use the second parameter to get_rxnfc()
+ * instead of @rule_locs.
  *
  * For %ETHTOOL_SRXCLSRLINS, @fs specifies the rule to add or update.
  * @fs.@location specifies the location to use and must not be ignored.
@@ -936,7 +937,7 @@ struct ethtool_ops {
 	int	(*set_priv_flags)(struct net_device *, u32);
 	int	(*get_sset_count)(struct net_device *, int);
 	int	(*get_rxnfc)(struct net_device *,
-			     struct ethtool_rxnfc *, void *);
+			     struct ethtool_rxnfc *, u32 *rule_locs);
 	int	(*set_rxnfc)(struct net_device *, struct ethtool_rxnfc *);
 	int	(*flash_device)(struct net_device *, struct ethtool_flash *);
 	int	(*reset)(struct net_device *, u32 *);
-- 
1.7.4.4



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 4/4] ethtool: Update ethtool_rxnfc::rule_cnt on return from ETHTOOL_GRXCLSRLALL
From: Ben Hutchings @ 2011-09-06 23:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Alexander Duyck, Sebastian Poehn, Santwona Behera
In-Reply-To: <1315352693.2788.20.camel@bwh-desktop>

A user-space process must use ETHTOOL_GRXCLSRLCNT to find the number
of classification rules, then allocate a buffer of the right size,
then use ETHTOOL_GRXCLSRLALL to fill the buffer.  If some other
process inserts or deletes a rule between those two operations,
the user buffer might turn out to be the wrong size.

If it's too small, the return value will be -EMSGSIZE.  But if it's
too large, there is no indication of this.  Fix this by updating
the rule_cnt field on return.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
The change to ixgbe and niu is compile-tested only.  The change to
gianfar isn't tested at all.

Ben.

 drivers/net/ethernet/freescale/gianfar_ethtool.c |    1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |    2 ++
 drivers/net/ethernet/sun/niu.c                   |    2 ++
 include/linux/ethtool.h                          |    6 +++---
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 4223830..f30b96f 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -1676,6 +1676,7 @@ static int gfar_get_cls_all(struct gfar_private *priv,
 	}
 
 	cmd->data = MAX_FILER_IDX;
+	cmd->rule_cnt = i;
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index bad2d27..34591c3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2279,6 +2279,8 @@ static int ixgbe_get_ethtool_fdir_all(struct ixgbe_adapter *adapter,
 		cnt++;
 	}
 
+	cmd->rule_cnt = cnt;
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/sun/niu.c b/drivers/net/ethernet/sun/niu.c
index 8037059..fff0f8b 100644
--- a/drivers/net/ethernet/sun/niu.c
+++ b/drivers/net/ethernet/sun/niu.c
@@ -7302,6 +7302,8 @@ static int niu_get_ethtool_tcam_all(struct niu *np,
 	}
 	niu_unlock_parent(np, flags);
 
+	nfc->rule_cnt = cnt;
+
 	return ret;
 }
 
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 3df1f3b..ddd536d 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -496,9 +496,9 @@ struct ethtool_rx_flow_spec {
  *
  * For %ETHTOOL_GRXCLSRLALL, @rule_cnt specifies the array size of the
  * user buffer for @rule_locs on entry.  On return, @data is the size
- * of the rule table and @rule_locs contains the locations of the
- * defined rules.  Drivers must use the second parameter to get_rxnfc()
- * instead of @rule_locs.
+ * of the rule table, @rule_cnt is the number of defined rules, and
+ * @rule_locs contains the locations of the defined rules.  Drivers
+ * must use the second parameter to get_rxnfc() instead of @rule_locs.
  *
  * For %ETHTOOL_SRXCLSRLINS, @fs specifies the rule to add or update.
  * @fs.@location specifies the location to use and must not be ignored.
-- 
1.7.4.4


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox