Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC net-next 0/4] ynl: YAML netlink protocol descriptions
From: Jakub Kicinski @ 2022-08-12 22:26 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, edumazet, pabeni, sdf, jacob.e.keller, vadfed,
	johannes, jiri, dsahern, stephen, fw, linux-doc
In-Reply-To: <9208fec1-60e9-dd2b-af27-ada3dfa50121@gmail.com>

On Fri, 12 Aug 2022 10:00:52 -0700 Florian Fainelli wrote:
> > The ability for a high level language like Python to talk to the kernel
> > so easily, without ctypes, manually packing structs, copy'n'pasting
> > values for defines etc. excites me more than C codegen, anyway.  
> 
> This is really cool BTW, and it makes a lot of sense to me that we are 
> moving that way, especially with Rust knocking at the door. I will try 
> to do a more thorough review, than "cool, I like it".

Thanks! Feel free to ping me for the latest version whenever you want
to take a look, because the code will hopefully be under very active
development for a few more weeks..

^ permalink raw reply

* Re: [RFC net-next 1/4] ynl: add intro docs for the concept
From: Jakub Kicinski @ 2022-08-12 22:23 UTC (permalink / raw)
  To: Edward Cree
  Cc: netdev, davem, edumazet, pabeni, sdf, jacob.e.keller, vadfed,
	johannes, jiri, dsahern, stephen, fw, linux-doc
In-Reply-To: <999354bc-4e79-73fc-e195-9b8d17b3d3b5@gmail.com>

On Thu, 11 Aug 2022 21:17:37 +0100 Edward Cree wrote:
> > +direction (e.g. a ``dump`` which doesn't not accept filter, or a ``do``  
> 
> Double negative.  I think you just meant "doesn't accept filter" here?

Yup, thanks!

^ permalink raw reply

* [linux-firmware][pull request] ice: Update package to 1.3.30.0
From: Tony Nguyen @ 2022-08-12 22:22 UTC (permalink / raw)
  To: linux-firmware; +Cc: Tony Nguyen, netdev, Gurucharan

Update package file and WHENCE entry to 1.3.30.0.

Changes include:
 - Selectable Tx Scheduler Topology support
 - Flow Director support on multicast packets

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
---
The following are changes since commit ad5ae82019480dc1feffb538c118397a40934e62:
  Merge branch 'main' of https://github.com/suraj714/BT-Upstream
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/firmware dev-queue

 WHENCE                                        |   4 ++--
 .../{ice-1.3.28.0.pkg => ice-1.3.30.0.pkg}    | Bin 684408 -> 692660 bytes
 2 files changed, 2 insertions(+), 2 deletions(-)
 rename intel/ice/ddp/{ice-1.3.28.0.pkg => ice-1.3.30.0.pkg} (97%)

diff --git a/WHENCE b/WHENCE
index ef43a8242308..011630b2a4cc 100644
--- a/WHENCE
+++ b/WHENCE
@@ -5829,8 +5829,8 @@ Licence: Redistributable. See LICENSE.amlogic_vdec for details.

 Driver: ice - Intel(R) Ethernet Connection E800 Series

-File: intel/ice/ddp/ice-1.3.28.0.pkg
-Link: intel/ice/ddp/ice.pkg -> ice-1.3.28.0.pkg
+File: intel/ice/ddp/ice-1.3.30.0.pkg
+Link: intel/ice/ddp/ice.pkg -> ice-1.3.30.0.pkg

 License: Redistributable. See LICENSE.ice for details

diff --git a/intel/ice/ddp/ice-1.3.28.0.pkg b/intel/ice/ddp/ice-1.3.30.0.pkg
similarity index 97%
rename from intel/ice/ddp/ice-1.3.28.0.pkg
rename to intel/ice/ddp/ice-1.3.30.0.pkg
index fdef8007694c112c4ded77395f87b4e8575fa144..454a2a6ea1930c0abcce83b1f2bf7a58061fa3b6 100644
GIT binary patch
delta 1979
zcmezINOQ|#Eonvu1_o9jmI2}dAie^`E`|&YHHBOg#q}BGCb~DX8k!n&F-$x^k(r0}
z+2+}Xxs3IvO~3ADO_LV1x~rJI;4{N{wj07<Ri7BYv_Ahtr=6?jeZj)E3I9~Stxq}-
z)>L-n>s!CBr%~13v0Vub=fjp&Z!VaUJO40OqTavyXJ5}A)15zS_O$<rNlEw1Ro*b=
zJzjJzdc*l|+wRSOb9alu0?|_k1Xy@Hp7+i-I$6_xp?coNJtp-@_wW4sbXUM=&cv?!
zU-pUKI(7Ejw)HD-p4`gxpY8IkQ1^99Ug}@}GmEWYZ{3&6#z&{m50n1!eI0ky581F!
zwHmpJmEOMTC-0y8wzM;M_N~j&W;ecV&M5t1;Li0&M9^iL+h&!W({s{{+OnOFbl%WT
zv7hnTy7B7#n6oc@J>NZBShw2fnI15hP(V*zV~poZovyPjnM+>J{<W^y>b6MXJhlRf
zM-k1_wY)i}H-<9jY&Wpt^kCoIz|zAy*+Ec1!AfNM?l($nx~i_&JC^;FG&RY6_)dDZ
zv2c>n7LVE1`G%Vd1P`z&IPAQ}>91nZxAu3V%1NWb`k%TUMN4k|DsoRLag&KFnz}ha
z!hu=g`k8+%{C2aZdu)1>=->G3i0N|crz+ROd}k_Z>|A4a)n)SrX`rh0&S%@(&ADFA
zU*6R7%enU6{#DYqg}Bbid|CXCYeUNC?#%-79YUk=JQCvBNN96aqB~PPI2p4turo3=
zI7Dnnx|(KMEvBpO;0Y=j7#zB>iy!F0F7D8eLmY=W4&82~YQV03vR|6m<T)wwAn!mN
z-EeD7+WAGPtQwkN9|8%8qcO!Hj>aMmb~I_?It&fOYuKEYYQV@O$1u@Hrg=+R`<66D
zAZFUWC5>63Z2FFe%oDa>xx@T{dHRG6j4JIF51F@DJY<PkjVUvcb$i7`w$@cl85z_6
zTeCX>>Fv%o?B-vWS2=PBEUz-;5ZGQ7$RWhZXf?ezio=o7di(h(j=lUy+$;{&;OT`|
zIc&EZ?&G-3$l@6gW;R{%HizYQ#{(Rejly5}l^GZsL>SiassQN&yhco>+j|{37xQf2
z;>XFs$jGsMuRrHf#_4T=oIN1-@=dP@<P_L062$4pKfNyzA}A)x#NfgKGC+iZg;^Cy
zOE7D(a&MPR;*95?3@jG6R}^wGB}m`kvjob?FihdH0@53}rZHIqX%F5$M$PS0FLHKs
zZeMbn^9n0aM+-1WIFGZk`}n(iIy?GI7kt5KIB^2&_L}FMe>%k#fRPSL#uh+^r?aa<
zP-$LCW^U?ae?!@NCa_!ug8-1`VaQ^TWHe@oROV3NP;dZ31|<*+1phMv0TT;{fCAYx
z&<)7&KL~7R=in^l1XcwiY=VpoJ|Jf{@Gv+qFfqt8C`>Oj;t=IZU=Rhm4a~pi#38eJ
MiOCfnpoRxb0GjpPJpcdz

delta 1892
zcmdn8SnJ0lO>ssB1_l-&mH^@$Aie;^4u%X9rS%zQCVDrsnwxPk{GWJvBC{~p)6Me@
za~bPzF1NkA%+IjYcE`&75nQ5C9z4;UR+GzPqqn|R=AQaFo%!>LCnkruycW0L7GAEf
zeVg|7do!xudTi0Ub@Sb)zC(*<t(Fj&_hR-Z9u^-Jsg+$Su7xH$YZf=u=httlHrKAs
zv@_hjQfI+7uUy9^5zmXgvTF+BOE+pQ-G9D4y5vF6?RrIrvra+>A3Tlyd)-ICX7(8u
zy|c6QxLH5U+|7EmLHwBW%Lz@bNpjaC&R<Gx2yA_R?ql^CtBLouS8Lw8k<^r(d~SVc
ztkRisp>Uo6ix%SrMG@6ai5q6j(W$jM;o?0r*mb7vhFy!=nr1CFeLQ#d>h_I%Welc;
zFC3IqPHkWOyy&c^5lbd8h){s}$wU2xzS*<1mp<Ioc6-)&>+)4a>ko5kerUFIdoGZ?
zWO}1MbI#@hmM1(49{fLk`&7JLac?hkuVV21<a-}N_jTUQd0MqIU&p5LJlp0qjtAHj
zej3EH7Yioe+cr1Q?G>-HtLVIfQ#bc***Wij>|L!Fwpp8dJRF!6jMtplK1sPVgW;Qq
ztY4~q!`Xx*BJP_ESk~=$H}Upg{<6()yn(82=3S89@$};7gj=!_`ELRo#Va)LIT@%%
z80B+qu|3yxW3!LHk<er}!?4j99=<VLDN@e_O0ujB?2HTz4iOuYuBMq*i|J}R00lQh
zi#v2<7eCO0UEHA`hd2&%9J<{|)qq|7<gOIcdboodZmmf>zbKVeLlfjVkVha6#uSG*
z7>hXA!K8`nFf<UaVRKfh0V9(P!^8lY<{fG6JJJ||m}&ctG-ide?HPBOKQK?<^N@MM
z<N_9z_L_&x+iM=O#H_}Un8>=lW+GebDyHn*>C86lPC$CQyA8Yf*X4DN90JSh3^@e0
z*9CG2aWa}upBTmA$Ya3};2a<99un^w5;=XN52x97k!X$@emv6I98ZEci&BeIi^@_{
zjHff+;nbcufoHqLK8_2FaB-9AhM^qh+dU3&m^BK&;a6s0Xb@pA;8Ow8E_}-v&9_f<
z<Xp@nXv*N`<LDmm9ug4m>JcB{5(+8-4Mok+h2woZ{iYW>acWMSz_Q)MkCTIuk!`!J
zKj%_LV0eg4uL<N7nEoVyvuk@-AZGwSNKBN8;Q|N9IuV8m%&I_o39}X}@AR7qoW0v8
zC342`Z?7rjWJ-{Z;kN`z$}mW9TLEbk?rBWcK>7@CAEWm6nHM>`IVGTO3~&qzb`3&v
zs+J7YsTi^_7prYoxy^Y67%LAskF)ZIg!=ip`oxF%IQqqVPIrvtG}zwoobyj70MMh>
AV*mgE

-- 
2.35.1

^ permalink raw reply related

* Re: [syzbot] memory leak in netlink_policy_dump_add_policy
From: syzbot @ 2022-08-12 21:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs
In-Reply-To: <20220812140439.6bb2bb17@kernel.org>

> On Fri, 12 Aug 2022 10:04:26 -0700 syzbot wrote:
>> Hello,
>> 
>> syzbot found the following issue on:
>> 
>> HEAD commit:    4e23eeebb2e5 Merge tag 'bitmap-6.0-rc1' of https://github...
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=165f4f6a080000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=3a433c7a2539f51c
>> dashboard link: https://syzkaller.appspot.com/bug?extid=dc54d9ba8153b216cae0
>> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1443be71080000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11e5918e080000
>> 
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com
>
> Let's see if attaching a patch works...
>
> #syz test

want 2 args (repo, branch), got 15

>
> -- 
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20220812140439.6bb2bb17%40kernel.org.

^ permalink raw reply

* Re: [syzbot] memory leak in netlink_policy_dump_add_policy
From: Jakub Kicinski @ 2022-08-12 21:04 UTC (permalink / raw)
  To: syzbot; +Cc: davem, edumazet, linux-kernel, netdev, pabeni, syzkaller-bugs
In-Reply-To: <0000000000003fcafc05e60e466e@google.com>

[-- Attachment #1: Type: text/plain, Size: 892 bytes --]

On Fri, 12 Aug 2022 10:04:26 -0700 syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    4e23eeebb2e5 Merge tag 'bitmap-6.0-rc1' of https://github...
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=165f4f6a080000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3a433c7a2539f51c
> dashboard link: https://syzkaller.appspot.com/bug?extid=dc54d9ba8153b216cae0
> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1443be71080000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11e5918e080000
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com

Let's see if attaching a patch works...

#syz test

[-- Attachment #2: 0001-net-genl-fix-error-path-memory-leak-in-policy-dumpin.patch --]
[-- Type: text/x-patch, Size: 1180 bytes --]

From 7b3f410d5c49568deffcc8ee9881a7d3de730699 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <kuba@kernel.org>
Date: Fri, 12 Aug 2022 13:56:48 -0700
Subject: [--remember tree name--] net: genl: fix error path memory leak in
 policy dumping

If construction of the array of policies fails when recording
non-first policy we need to unwind.

Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com
Fixes: 50a896cf2d6f ("genetlink: properly support per-op policy dumping")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/netlink/genetlink.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 1afca2a6c2ac..57010927e20a 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1174,13 +1174,17 @@ static int ctrl_dumppolicy_start(struct netlink_callback *cb)
 							     op.policy,
 							     op.maxattr);
 			if (err)
-				return err;
+				goto err_free_state;
 		}
 	}
 
 	if (!ctx->state)
 		return -ENODATA;
 	return 0;
+
+err_free_state:
+	netlink_policy_dump_free(ctx->state);
+	return err;
 }
 
 static void *ctrl_dumppolicy_prep(struct sk_buff *skb,
-- 
2.37.1


^ permalink raw reply related

* Re: [syzbot] memory leak in netlink_policy_dump_add_policy
From: syzbot @ 2022-08-12 21:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs
In-Reply-To: <20220812140439.6bb2bb17@kernel.org>

> On Fri, 12 Aug 2022 10:04:26 -0700 syzbot wrote:
>> Hello,
>> 
>> syzbot found the following issue on:
>> 
>> HEAD commit:    4e23eeebb2e5 Merge tag 'bitmap-6.0-rc1' of https://github...
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=165f4f6a080000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=3a433c7a2539f51c
>> dashboard link: https://syzkaller.appspot.com/bug?extid=dc54d9ba8153b216cae0
>> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1443be71080000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11e5918e080000
>> 
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com
>
> Let's see if attaching a patch works...
>
> #syz test

want 2 args (repo, branch), got 1


^ permalink raw reply

* Re: [RFC v2] net: introduce OpenVPN Data Channel Offload (ovpn-dco)
From: Antonio Quartulli @ 2022-08-12 21:05 UTC (permalink / raw)
  To: Sergey Ryazanov; +Cc: netdev, David Miller, Jakub Kicinski, open list
In-Reply-To: <CAHNKnsQnHAdxC-XhC9RP-cFp0d-E4YGb+7ie3WymXVL9N-QS6A@mail.gmail.com>

Hello Sergey,

Thank you VERY much for your feedback!

Please, see my comments inline

On 12/08/2022 20:34, Sergey Ryazanov wrote:
> Hello Antonio,
> 
> first of all, I would like to say thank you for taking care of this.
> Such useful software as OpenVPN deserves a kernel accelerated rate,
> and probably a lot of users, including me, are waiting for such
> feature.
> 
> On Wed, Aug 3, 2022 at 6:37 PM Antonio Quartulli <antonio@openvpn.net> wrote:
>> OpenVPN is a userspace software existing since around 2005 that allows
>> users to create secure tunnels.
>>
>> So far OpenVPN has implemented all operations in userspace, which
>> implies several back and forth between kernel and user land in order to
>> process packets (encapsulate/decapsulate, encrypt/decrypt, rerouting..).
>>
>> With ovpn-dco, we intend to move the fast path (data channel) entirely
>> in kernel space and thus improve user measured throughput over the
>> tunnel.
>>
>> ovpn-dco is implemented as a simple virtual network device driver, that
>> can be manipulated by means of the standard RTNL APIs. A device of kind
>> 'ovpn-dco' allows only IPv4/6 traffic and can be of type:
>> * P2P (peer-to-peer): any packet sent over the interface will be
>>    encapsulated and transmitted to the other side (typical OpenVPN
>>    client behaviour);
>> * P2MP (point-to-multipoint): packets sent over the interface are
>>    transmitted to peers based on existing routes (typical OpenVPN
>>    server behaviour).
>>
>> After the interface has been created, OpenVPN in userspace can
>> configure it using a new Netlink API. Specifically it is possible
>> to manage peers, configure per-peer keys and exchange packets with
>> userspace.
>>
>> The OpenVPN control channel is multiplexed over the same transport
>> socket by means of OP codes. Anything that is not DATA_V2 (OpenVPN
>> OP code for data traffic) is sent to userspace and handled there.
>> This way the ovpn-dco codebase is kept as compact as possible while
>> focusing on handling data traffic only.
>>
>> Any OpenVPN control feature (like cipher negotiation, TLS handshake,
>> rekeying, etc.) is still fully handled by the userspace process.
>>
>> When userspace establishes a new connection with a peer, it first
>> performs the handshake and then passes the socket to ovpn-dco, which
>> takes ownership. From this moment on ovpn-dco will handle data traffic
>> for the new peer. When control packets are received on the link, they
>> are forwarded to userspace via Netlink.
>>
>> (this approach is somewhat inspired by hostapd+mac80211)
>>
>> Some events (like peer deletion) are sent to a Netlink multicast group.
>>
>> Although it wasn't easy to convince the community, ovpn-dco implements
>> only a limited number of the data-channel features supported by the
>> userspace program.
>>
>> Each feature that made it to ovpn-dco was attentively vetted to
>> avoid carrying too much legacy along with us (and to give a clear cut to
>> old and probalby-not-so-useful features).
>>
>> Notably, only encryption using AEAD ciphers (specifically
>> ChaCha20Poly1305 and AES-GCM) was implemented. Supporting any other
>> cipher out there was not deemed useful.
>>
>> As explained above, in case of P2MP mode, OpenVPN will use the main system
>> routing table to decide which packet goes to which peer. This implies
>> that no routing table was re-implemented in ovpn-dco.
>>
>> This kernel module can be enabled by selecting the CONFIG_OVPN_DCO entry
>> in the networking drivers section.
>>
>> Cc: David Miller <davem@davemloft.net>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: netdev@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
>> ---
>>
>> Changes from v1:
>> * use net/netdev print helpers when possible
>> * properly set min/max_mtu
>> * get rid of ndo_change_mtu
>> * don't set version in ethtool output
>> * ensure can be compiled also when no IPv6 support exists
>>
>> ---
>>   MAINTAINERS                        |    8 +
>>   drivers/net/Kconfig                |   19 +
>>   drivers/net/Makefile               |    1 +
>>   drivers/net/ovpn-dco/Makefile      |   21 +
>>   drivers/net/ovpn-dco/addr.h        |   41 +
>>   drivers/net/ovpn-dco/bind.c        |   62 ++
>>   drivers/net/ovpn-dco/bind.h        |   67 ++
>>   drivers/net/ovpn-dco/crypto.c      |  154 ++++
>>   drivers/net/ovpn-dco/crypto.h      |  144 ++++
>>   drivers/net/ovpn-dco/crypto_aead.c |  367 +++++++++
>>   drivers/net/ovpn-dco/crypto_aead.h |   27 +
>>   drivers/net/ovpn-dco/main.c        |  271 +++++++
>>   drivers/net/ovpn-dco/main.h        |   32 +
>>   drivers/net/ovpn-dco/netlink.c     | 1143 ++++++++++++++++++++++++++++
>>   drivers/net/ovpn-dco/netlink.h     |   22 +
>>   drivers/net/ovpn-dco/ovpn.c        |  600 +++++++++++++++
>>   drivers/net/ovpn-dco/ovpn.h        |   43 ++
>>   drivers/net/ovpn-dco/ovpnstruct.h  |   59 ++
>>   drivers/net/ovpn-dco/peer.c        |  906 ++++++++++++++++++++++
>>   drivers/net/ovpn-dco/peer.h        |  168 ++++
>>   drivers/net/ovpn-dco/pktid.c       |  127 ++++
>>   drivers/net/ovpn-dco/pktid.h       |  116 +++
>>   drivers/net/ovpn-dco/proto.h       |  101 +++
>>   drivers/net/ovpn-dco/rcu.h         |   21 +
>>   drivers/net/ovpn-dco/skb.h         |   54 ++
>>   drivers/net/ovpn-dco/sock.c        |  134 ++++
>>   drivers/net/ovpn-dco/sock.h        |   54 ++
>>   drivers/net/ovpn-dco/stats.c       |   20 +
>>   drivers/net/ovpn-dco/stats.h       |   67 ++
>>   drivers/net/ovpn-dco/tcp.c         |  326 ++++++++
>>   drivers/net/ovpn-dco/tcp.h         |   38 +
>>   drivers/net/ovpn-dco/udp.c         |  343 +++++++++
>>   drivers/net/ovpn-dco/udp.h         |   25 +
>>   include/net/netlink.h              |    1 +
>>   include/uapi/linux/ovpn_dco.h      |  265 +++++++
>>   include/uapi/linux/udp.h           |    1 +
>>   36 files changed, 5848 insertions(+)
> 
> It is very hard to review almost 6 thousand lines of code at once.
> Some issues may be overlooked. I would like to ask you to do the
> following submission as a series of patches. It is easier to review
> the code separated into patches by logically completed functional
> blocks. E.g., module skeleton, management API framework, crypto
> framework, minimal data handling utils, UDP transport support, TCP
> transport support, other supplementary features like ethtools API,
> statistics, etc. This was just an example, you better know internal
> code dependencies and a best way to introduce the functionality.

You are right about the review being a bit harder, but I did not know 
what was the preferred approach.

It seemed that other modules are also introduced in one commit, so I 
just did the same.

But if it's accepted to have "compiling" commits that do not really 
deliver a working driver, I am fine with splitting this up.

> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 1920d82db83e..7cb16007dd5c 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -15103,6 +15103,14 @@ T:     git git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git
>>   F:     Documentation/filesystems/overlayfs.rst
>>   F:     fs/overlayfs/
>>
>> +OVPN-DCO NETWORK DRIVER
>> +M:     Antonio Quartulli <antonio@openvpn.net>
>> +L:     openvpn-devel@lists.sourceforge.net (moderated for non-subscribers)
>> +L:     netdev@vger.kernel.org
>> +S:     Maintained
>> +F:     drivers/net/ovpn-dco/
>> +F:     include/uapi/linux/ovpn_dco.h
>> +
>>   P54 WIRELESS DRIVER
>>   M:     Christian Lamparter <chunkeey@googlemail.com>
>>   L:     linux-wireless@vger.kernel.org
>> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>> index 94c889802566..349866bd4448 100644
>> --- a/drivers/net/Kconfig
>> +++ b/drivers/net/Kconfig
>> @@ -116,6 +116,25 @@ config WIREGUARD_DEBUG
>>
>>            Say N here unless you know what you're doing.
>>
>> +config OVPN_DCO
>> +       tristate "OpenVPN data channel offload"
> 
> Just curious, why do you need this "DCO" suffix? It is not some
> commonly recognized abbreviation. Why not just call this module
> "OpenVPN"? Or are you planning to move some other components into the
> kernel and reserve some naming space?

This device driver just implements data channel offloading and it is not 
a fully fledged OpenVPN impementation, so I thought it was meaningful to 
make this explicit.

Calling the driver just openvpn/ovpn may give the false impression that 
this is a reimplementation of the whole OpenVPN protocol in kernelspace.

Does it make sense?

The "DCO" acronym may not be so common in other areas, but in the 
OpenVPN community we have been talking about DCO (Data Channel Offload) 
since a while, so it was pretty natural for us to use this suffix.

> 
>> +       depends on NET && INET
>> +       select NET_UDP_TUNNEL
>> +       select DST_CACHE
>> +       select CRYPTO
>> +       select CRYPTO_AES
>> +       select CRYPTO_GCM
>> +       select CRYPTO_CHACHA20POLY1305
>> +       help
>> +         This module enhances the performance of an OpenVPN connection by
>> +         allowing the user to offload the data channel processing to
>> +         kernelspace.
>> +         Connection handshake, parameters negotiation and other non-data
>> +         related mechanisms are still performed in userspace.
>> +
>> +         The OpenVPN userspace software at version 2.6 or higher is required
>> +         to use this functionality.
>> +
>>   config EQUALIZER
>>          tristate "EQL (serial line load balancing) support"
>>          help
> 
> [skipped]
> 
>> +/**
>> + * ovpn_num_queues - define number of queues to allocate per device
>> + *
>> + * The value returned by this function is used to decide how many RX and TX
>> + * queues to allocate when creating the netdev object
>> + *
>> + * Return the number of queues to allocate
>> + */
>> +static unsigned int ovpn_num_queues(void)
>> +{
>> +       return num_online_cpus();
>> +}
>> +
>> +static struct rtnl_link_ops ovpn_link_ops __read_mostly = {
>> +       .kind                   = DRV_NAME,
>> +       .priv_size              = sizeof(struct ovpn_struct),
>> +       .setup                  = ovpn_setup,
>> +       .policy                 = ovpn_policy,
>> +       .maxtype                = IFLA_OVPN_MAX,
>> +       .newlink                = ovpn_newlink,
>> +       .dellink                = ovpn_dellink,
> 
> What is the purpose of creating and destroying interfaces via RTNL,
> but performing all other operations using the dedicated netlink
> protocol?
> 
> RTNL interface usually implemented for some standalone interface
> types, e.g. VLAN, GRE, etc. Here we need a userspace application
> anyway to be able to use the network device to forward traffic, and
> the module implements the dedicated GENL protocol. So why not just
> introduce OVPN_CMD_NEW_IFACE and OVPN_CMD_DEL_IFACE commands to the
> GENL interface? It looks like this will simplify the userspace part by
> using the single GENL interface for any management operations.

As Stephen also said in his reply, I tried to stick to the standard 
approach of creating interface via RTNL (which is also netlink).

With this implementation you can already create an interface as:

ip link add vpn0 type ovpn-dco

Eventually I will patch iproute2 to support some options as well (we 
have one only for now).


> 
>> +       .get_num_tx_queues      = ovpn_num_queues,
>> +       .get_num_rx_queues      = ovpn_num_queues,
> 
> What is the benefit of requesting multiple queues if the xmit callback
> places all packets from those kernel queues into the single internal
> queue anyway?

Good point. This is one of those aspects where I was hoping to get some 
guidance.
In any case, I will double check if having more than one queue is really 
what we want or not.

> 
>> +};
>> +
>> +static int __init ovpn_init(void)
>> +{
>> +       int err = 0;
>> +
>> +       pr_info("%s %s -- %s\n", DRV_DESCRIPTION, DRV_VERSION, DRV_COPYRIGHT);
> 
> Is this log line really necessary for the regular module usage?

Well, it's a reasonable way to give users feedback about the module 
loading successfully. I see it's pretty common across drivers, so I 
thought to use the same approach.

> 
>> +
>> +       /* init RTNL link ops */
>> +       err = rtnl_link_register(&ovpn_link_ops);
>> +       if (err) {
>> +               pr_err("ovpn: can't register RTNL link ops\n");
>> +               goto err;
>> +       }
>> +
>> +       err = ovpn_netlink_register();
>> +       if (err) {
>> +               pr_err("ovpn: can't register netlink family\n");
>> +               goto err_rtnl_unregister;
>> +       }
>> +
>> +       return 0;
>> +
>> +err_rtnl_unregister:
>> +       rtnl_link_unregister(&ovpn_link_ops);
>> +err:
>> +       pr_err("ovpn: initialization failed, error status=%d\n", err);
>> +       return err;
>> +}
> 
> [skipped]
> 
>> +static const struct genl_ops ovpn_netlink_ops[] = {
>> +       {
>> +               .cmd = OVPN_CMD_NEW_PEER,
>> +               .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> 
> AFAIK, the "don't validate strict" flag is for compatibility with old
> users of earlier existing subsystems. For new GENL families, it is
> better to avoid using this flag and strictly implement the netlink
> protocol.

Good point. I can probably just get rid of this flag then.

> 
>> +               .flags = GENL_ADMIN_PERM,
>> +               .doit = ovpn_netlink_new_peer,
>> +       },
>>
>>          ...
>>
>> +};
> 
> [skipped]
> 
>> +static int ovpn_transport_to_userspace(struct ovpn_struct *ovpn, const struct ovpn_peer *peer,
>> +                                      struct sk_buff *skb)
>> +{
>> +       int ret;
>> +
>> +       ret = skb_linearize(skb);
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       ret = ovpn_netlink_send_packet(ovpn, peer, skb->data, skb->len);
>> +       if (ret < 0)
>> +               return ret;
> 
> Another interesting decision. Why are you transporting the control
> messages via Netlink? Why not just pass them to userspace via an
> already existing TCP/UDP socket, like the LT2P module do, for example?
> Such design usually requires less changes to the userspace application
> since it is still able to process control messages as earlier by
> reading them from the socket.

As far as I understand the L2TP module implementes its own protocol 
(IPPROTO_L2TP).

In the ovpn-dco case userspace simply opens a normal UDP or TCP socket 
(as it always used to do), which is later passed to the kernel.

In ovpn-dco (kernel) we then use udp_tunnel to take over the UDP socket, 
while we change the socket ops in case of TCP.

While in the UDP case it would be possible to still use the socket to 
send control packets, in the TCP case this would not work and we would 
need another method (which is netlink) to send/receive packets from 
userspace.

At this point we did not want to treat TCP and UDP sockets differently, 
so we decided to always use netlink to send control packets to/from 
userspace.

> 
>> +       consume_skb(skb);
>> +       return 0;
>> +}
> 
> [skipped]
> 
>> +/* Net device start xmit
>> + */
>> +netdev_tx_t ovpn_net_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +       struct ovpn_struct *ovpn = netdev_priv(dev);
>> +       struct sk_buff *segments, *tmp, *curr, *next;
>> +       struct sk_buff_head skb_list;
>> +       __be16 proto;
>> +       int ret;
>> +
>> +       /* reset netfilter state */
>> +       nf_reset_ct(skb);
>> +
>> +       /* verify IP header size in network packet */
>> +       proto = ovpn_ip_check_protocol(skb);
>> +       if (unlikely(!proto || skb->protocol != proto)) {
>> +               net_dbg_ratelimited("%s: dropping malformed payload packet\n",
>> +                                   dev->name);
>> +               goto drop;
>> +       }
>> +
>> +       if (skb_is_gso(skb)) {
>> +               segments = skb_gso_segment(skb, 0);
>> +               if (IS_ERR(segments)) {
>> +                       ret = PTR_ERR(segments);
>> +                       net_dbg_ratelimited("%s: cannot segment packet: %d\n", dev->name, ret);
>> +                       goto drop;
>> +               }
>> +
>> +               consume_skb(skb);
>> +               skb = segments;
>> +       }
>> +
>> +       /* from this moment on, "skb" might be a list */
>> +
>> +       __skb_queue_head_init(&skb_list);
>> +       skb_list_walk_safe(skb, curr, next) {
>> +               skb_mark_not_on_list(curr);
>> +
>> +               tmp = skb_share_check(curr, GFP_ATOMIC);
>> +               if (unlikely(!tmp)) {
>> +                       kfree_skb_list(next);
>> +                       net_dbg_ratelimited("%s: skb_share_check failed\n", dev->name);
>> +                       goto drop_list;
>> +               }
>> +
>> +               __skb_queue_tail(&skb_list, tmp);
>> +       }
>> +       skb_list.prev->next = NULL;
>> +
>> +       ovpn_queue_skb(ovpn, skb_list.next, NULL);
>> +
>> +       return NETDEV_TX_OK;
>> +
>> +drop_list:
>> +       skb_queue_walk_safe(&skb_list, curr, next)
>> +               kfree_skb(curr);
>> +drop:
>> +       skb_tx_error(skb);
>> +       kfree_skb_list(skb);
>> +       return NET_XMIT_DROP;
>> +}
> 
> [skipped]
> 
>> diff --git a/include/uapi/linux/ovpn_dco.h b/include/uapi/linux/ovpn_dco.h
>> new file mode 100644
>> index 000000000000..6afee8b3fedd
>> --- /dev/null
>> +++ b/include/uapi/linux/ovpn_dco.h
>> @@ -0,0 +1,265 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + *  OpenVPN data channel accelerator
>> + *
>> + *  Copyright (C) 2019-2022 OpenVPN, Inc.
>> + *
>> + *  Author:    James Yonan <james@openvpn.net>
>> + *             Antonio Quartulli <antonio@openvpn.net>
>> + */
>> +
>> +#ifndef _UAPI_LINUX_OVPN_DCO_H_
>> +#define _UAPI_LINUX_OVPN_DCO_H_
>> +
>> +#define OVPN_NL_NAME "ovpn-dco"
>> +
>> +#define OVPN_NL_MULTICAST_GROUP_PEERS "peers"
>> +
>> +/**
>> + * enum ovpn_nl_commands - supported netlink commands
>> + */
>> +enum ovpn_nl_commands {
>> +       /**
>> +        * @OVPN_CMD_UNSPEC: unspecified command to catch errors
>> +        */
>> +       OVPN_CMD_UNSPEC = 0,
>> +
>> +       /**
>> +        * @OVPN_CMD_NEW_PEER: Configure peer with its crypto keys
>> +        */
>> +       OVPN_CMD_NEW_PEER,
>> +
>> +       /**
>> +        * @OVPN_CMD_SET_PEER: Tweak parameters for an existing peer
>> +        */
>> +       OVPN_CMD_SET_PEER,
>> +
>> +       /**
>> +        * @OVPN_CMD_DEL_PEER: Remove peer from internal table
>> +        */
>> +       OVPN_CMD_DEL_PEER,
>> +
>> +       OVPN_CMD_NEW_KEY,
>> +
>> +       OVPN_CMD_SWAP_KEYS,
>> +
>> +       OVPN_CMD_DEL_KEY,
>> +
>> +       /**
>> +        * @OVPN_CMD_REGISTER_PACKET: Register for specific packet types to be
>> +        * forwarded to userspace
>> +        */
>> +       OVPN_CMD_REGISTER_PACKET,
>> +
>> +       /**
>> +        * @OVPN_CMD_PACKET: Send a packet from userspace to kernelspace. Also
>> +        * used to send to userspace packets for which a process had registered
>> +        * with OVPN_CMD_REGISTER_PACKET
>> +        */
>> +       OVPN_CMD_PACKET,
>> +
>> +       /**
>> +        * @OVPN_CMD_GET_PEER: Retrieve the status of a peer or all peers
>> +        */
>> +       OVPN_CMD_GET_PEER,
>> +};
>> +
>> +enum ovpn_cipher_alg {
>> +       /**
>> +        * @OVPN_CIPHER_ALG_NONE: No encryption - reserved for debugging only
>> +        */
>> +       OVPN_CIPHER_ALG_NONE = 0,
>> +       /**
>> +        * @OVPN_CIPHER_ALG_AES_GCM: AES-GCM AEAD cipher with any allowed key size
>> +        */
>> +       OVPN_CIPHER_ALG_AES_GCM,
>> +       /**
>> +        * @OVPN_CIPHER_ALG_CHACHA20_POLY1305: ChaCha20Poly1305 AEAD cipher
>> +        */
>> +       OVPN_CIPHER_ALG_CHACHA20_POLY1305,
>> +};
>> +
>> +enum ovpn_del_peer_reason {
>> +       __OVPN_DEL_PEER_REASON_FIRST,
>> +       OVPN_DEL_PEER_REASON_TEARDOWN = __OVPN_DEL_PEER_REASON_FIRST,
>> +       OVPN_DEL_PEER_REASON_USERSPACE,
>> +       OVPN_DEL_PEER_REASON_EXPIRED,
>> +       OVPN_DEL_PEER_REASON_TRANSPORT_ERROR,
>> +       __OVPN_DEL_PEER_REASON_AFTER_LAST
>> +};
>> +
>> +enum ovpn_key_slot {
>> +       __OVPN_KEY_SLOT_FIRST,
>> +       OVPN_KEY_SLOT_PRIMARY = __OVPN_KEY_SLOT_FIRST,
>> +       OVPN_KEY_SLOT_SECONDARY,
>> +       __OVPN_KEY_SLOT_AFTER_LAST,
>> +};
>> +
>> +enum ovpn_netlink_attrs {
>> +       OVPN_ATTR_UNSPEC = 0,
>> +       OVPN_ATTR_IFINDEX,
>> +       OVPN_ATTR_NEW_PEER,
>> +       OVPN_ATTR_SET_PEER,
>> +       OVPN_ATTR_DEL_PEER,
> 
> What is the purpose of introducing separate attributes for each
> NEW/SET/GET/DEL operation? Why not just use a single OVPN_ATTR_PEER
> attribute?

The idea is to have a subobject for each operation. Each specific 
subobject would then contain only the specific attributes allowed for 
that object. This way attributes from different operations are not mixed.

> 
> BTW, generic netlink for some time allows you to have a dedicated set
> of attributes (and corresponding policies) for each message. So, if
> you have different object types (e.g. peers, interfaces, keys) you can
> avoid creating a common set of attributes to cover them all at once,
> but just create several attribute sets, one set per each object type
> with corresponding policies (see policy field of the genl_ops struct).

mh interesting. Any module I could look at that implements this approach?

> 
>> +       OVPN_ATTR_NEW_KEY,
>> +       OVPN_ATTR_SWAP_KEYS,
>> +       OVPN_ATTR_DEL_KEY,
>> +       OVPN_ATTR_PACKET,
>> +       OVPN_ATTR_GET_PEER,
>> +
>> +       __OVPN_ATTR_AFTER_LAST,
>> +       OVPN_ATTR_MAX = __OVPN_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_key_dir_attrs {
>> +       OVPN_KEY_DIR_ATTR_UNSPEC = 0,
>> +       OVPN_KEY_DIR_ATTR_CIPHER_KEY,
>> +       OVPN_KEY_DIR_ATTR_NONCE_TAIL,
>> +
>> +       __OVPN_KEY_DIR_ATTR_AFTER_LAST,
>> +       OVPN_KEY_DIR_ATTR_MAX = __OVPN_KEY_DIR_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_new_key_attrs {
>> +       OVPN_NEW_KEY_ATTR_UNSPEC = 0,
>> +       OVPN_NEW_KEY_ATTR_PEER_ID,
>> +       OVPN_NEW_KEY_ATTR_KEY_SLOT,
>> +       OVPN_NEW_KEY_ATTR_KEY_ID,
>> +       OVPN_NEW_KEY_ATTR_CIPHER_ALG,
>> +       OVPN_NEW_KEY_ATTR_ENCRYPT_KEY,
>> +       OVPN_NEW_KEY_ATTR_DECRYPT_KEY,
>> +
>> +       __OVPN_NEW_KEY_ATTR_AFTER_LAST,
>> +       OVPN_NEW_KEY_ATTR_MAX = __OVPN_NEW_KEY_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_del_key_attrs {
>> +       OVPN_DEL_KEY_ATTR_UNSPEC = 0,
>> +       OVPN_DEL_KEY_ATTR_PEER_ID,
>> +       OVPN_DEL_KEY_ATTR_KEY_SLOT,
>> +
>> +       __OVPN_DEL_KEY_ATTR_AFTER_LAST,
>> +       OVPN_DEL_KEY_ATTR_MAX = __OVPN_DEL_KEY_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_swap_keys_attrs {
>> +       OVPN_SWAP_KEYS_ATTR_UNSPEC = 0,
>> +       OVPN_SWAP_KEYS_ATTR_PEER_ID,
>> +
>> +       __OVPN_SWAP_KEYS_ATTR_AFTER_LAST,
>> +       OVPN_SWAP_KEYS_ATTR_MAX = __OVPN_SWAP_KEYS_ATTR_AFTER_LAST - 1,
>> +
>> +};
>> +
>> +enum ovpn_netlink_new_peer_attrs {
>> +       OVPN_NEW_PEER_ATTR_UNSPEC = 0,
>> +       OVPN_NEW_PEER_ATTR_PEER_ID,
>> +       OVPN_NEW_PEER_ATTR_SOCKADDR_REMOTE,
>> +       OVPN_NEW_PEER_ATTR_SOCKET,
>> +       OVPN_NEW_PEER_ATTR_IPV4,
>> +       OVPN_NEW_PEER_ATTR_IPV6,
>> +       OVPN_NEW_PEER_ATTR_LOCAL_IP,
>> +
>> +       __OVPN_NEW_PEER_ATTR_AFTER_LAST,
>> +       OVPN_NEW_PEER_ATTR_MAX = __OVPN_NEW_PEER_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_set_peer_attrs {
>> +       OVPN_SET_PEER_ATTR_UNSPEC = 0,
>> +       OVPN_SET_PEER_ATTR_PEER_ID,
>> +       OVPN_SET_PEER_ATTR_KEEPALIVE_INTERVAL,
>> +       OVPN_SET_PEER_ATTR_KEEPALIVE_TIMEOUT,
>> +
>> +       __OVPN_SET_PEER_ATTR_AFTER_LAST,
>> +       OVPN_SET_PEER_ATTR_MAX = __OVPN_SET_PEER_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_del_peer_attrs {
>> +       OVPN_DEL_PEER_ATTR_UNSPEC = 0,
>> +       OVPN_DEL_PEER_ATTR_REASON,
>> +       OVPN_DEL_PEER_ATTR_PEER_ID,
>> +
>> +       __OVPN_DEL_PEER_ATTR_AFTER_LAST,
>> +       OVPN_DEL_PEER_ATTR_MAX = __OVPN_DEL_PEER_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_get_peer_attrs {
>> +       OVPN_GET_PEER_ATTR_UNSPEC = 0,
>> +       OVPN_GET_PEER_ATTR_PEER_ID,
>> +
>> +       __OVPN_GET_PEER_ATTR_AFTER_LAST,
>> +       OVPN_GET_PEER_ATTR_MAX = __OVPN_GET_PEER_ATTR_AFTER_LAST - 1,
>> +};
> 
> What is the reason to create a separate set of attributes per each
> operation? In my experience, it is easier to use a common set of
> attributes for all operations on the same object type. At least it is
> easier to manage one enum instead of four. And you are always sure
> that attributes with the same semantics (e.g. remote IP) have the same
> id in any GET/SET message.

The reason is mostly documentation.

In my experience (which is mostly wifi related), when using a single set 
of attributes is (almost) impossible to understand which attributes have 
to be sent along with a specific command.
I always had to look at the actual netlink handlers implementation.

With this approach, instead, I can point out immediately what attributes 
are related to which command.

Maybe this is not the best approach, but I wanted a way that allows a 
developer to immediately understand the ovpn-dco API without having to 
read the code in ovpn-dco/netlink.c

What do you think?

> 
>> +enum ovpn_netlink_get_peer_response_attrs {
>> +       OVPN_GET_PEER_RESP_ATTR_UNSPEC = 0,
>> +       OVPN_GET_PEER_RESP_ATTR_PEER_ID,
>> +       OVPN_GET_PEER_RESP_ATTR_SOCKADDR_REMOTE,
>> +       OVPN_GET_PEER_RESP_ATTR_IPV4,
>> +       OVPN_GET_PEER_RESP_ATTR_IPV6,
>> +       OVPN_GET_PEER_RESP_ATTR_LOCAL_IP,
>> +       OVPN_GET_PEER_RESP_ATTR_LOCAL_PORT,
>> +       OVPN_GET_PEER_RESP_ATTR_KEEPALIVE_INTERVAL,
>> +       OVPN_GET_PEER_RESP_ATTR_KEEPALIVE_TIMEOUT,
>> +       OVPN_GET_PEER_RESP_ATTR_RX_BYTES,
>> +       OVPN_GET_PEER_RESP_ATTR_TX_BYTES,
>> +       OVPN_GET_PEER_RESP_ATTR_RX_PACKETS,
>> +       OVPN_GET_PEER_RESP_ATTR_TX_PACKETS,
>> +
>> +       __OVPN_GET_PEER_RESP_ATTR_AFTER_LAST,
>> +       OVPN_GET_PEER_RESP_ATTR_MAX = __OVPN_GET_PEER_RESP_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_peer_stats_attrs {
>> +       OVPN_PEER_STATS_ATTR_UNSPEC = 0,
>> +       OVPN_PEER_STATS_BYTES,
>> +       OVPN_PEER_STATS_PACKETS,
>> +
>> +       __OVPN_PEER_STATS_ATTR_AFTER_LAST,
>> +       OVPN_PEER_STATS_ATTR_MAX = __OVPN_PEER_STATS_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_peer_attrs {
>> +       OVPN_PEER_ATTR_UNSPEC = 0,
>> +       OVPN_PEER_ATTR_PEER_ID,
>> +       OVPN_PEER_ATTR_SOCKADDR_REMOTE,
>> +       OVPN_PEER_ATTR_IPV4,
>> +       OVPN_PEER_ATTR_IPV6,
>> +       OVPN_PEER_ATTR_LOCAL_IP,
>> +       OVPN_PEER_ATTR_KEEPALIVE_INTERVAL,
>> +       OVPN_PEER_ATTR_KEEPALIVE_TIMEOUT,
>> +       OVPN_PEER_ATTR_ENCRYPT_KEY,
>> +       OVPN_PEER_ATTR_DECRYPT_KEY,
>> +       OVPN_PEER_ATTR_RX_STATS,
>> +       OVPN_PEER_ATTR_TX_STATS,
>> +
>> +       __OVPN_PEER_ATTR_AFTER_LAST,
>> +       OVPN_PEER_ATTR_MAX = __OVPN_PEER_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_netlink_packet_attrs {
>> +       OVPN_PACKET_ATTR_UNSPEC = 0,
>> +       OVPN_PACKET_ATTR_PACKET,
>> +       OVPN_PACKET_ATTR_PEER_ID,
>> +
>> +       __OVPN_PACKET_ATTR_AFTER_LAST,
>> +       OVPN_PACKET_ATTR_MAX = __OVPN_PACKET_ATTR_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_ifla_attrs {
>> +       IFLA_OVPN_UNSPEC = 0,
>> +       IFLA_OVPN_MODE,
>> +
>> +       __IFLA_OVPN_AFTER_LAST,
>> +       IFLA_OVPN_MAX = __IFLA_OVPN_AFTER_LAST - 1,
>> +};
>> +
>> +enum ovpn_mode {
>> +       __OVPN_MODE_FIRST = 0,
>> +       OVPN_MODE_P2P = __OVPN_MODE_FIRST,
>> +       OVPN_MODE_MP,
>> +
>> +       __OVPN_MODE_AFTER_LAST,
>> +};
>> +
>> +#endif /* _UAPI_LINUX_OVPN_DCO_H_ */
> 
> --
> BR,
> Sergey

Thanks again!

Best Regards,


-- 
Antonio Quartulli
OpenVPN Inc.

^ permalink raw reply

* [PATCH net-next v4 3/3] selftests/net: Add sk_bind_sendto_listen and sk_connect_zero_addr
From: Joanne Koong @ 2022-08-12 20:29 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, kafai, kuba, davem, pabeni, dccp, Joanne Koong
In-Reply-To: <20220812202905.1599234-1-joannelkoong@gmail.com>

This patch adds 2 new tests: sk_bind_sendto_listen and
sk_connect_zero_addr.

The sk_bind_sendto_listen test exercises the path where a socket's
rcv saddr changes after it has been added to the binding tables,
and then a listen() on the socket is invoked. The listen() should
succeed.

The sk_bind_sendto_listen test is copied over from one of syzbot's
tests: https://syzkaller.appspot.com/x/repro.c?x=1673a38df00000

The sk_connect_zero_addr test exercises the path where the socket was
never previously added to the binding tables and it gets assigned a
saddr upon a connect() to address 0.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 tools/testing/selftests/net/.gitignore        |  2 +
 tools/testing/selftests/net/Makefile          |  2 +
 .../selftests/net/sk_bind_sendto_listen.c     | 80 +++++++++++++++++++
 .../selftests/net/sk_connect_zero_addr.c      | 62 ++++++++++++++
 4 files changed, 146 insertions(+)
 create mode 100644 tools/testing/selftests/net/sk_bind_sendto_listen.c
 create mode 100644 tools/testing/selftests/net/sk_connect_zero_addr.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 89e2d4aa812a..bec5cf96984c 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -41,3 +41,5 @@ cmsg_sender
 unix_connect
 tap
 bind_bhash
+sk_bind_sendto_listen
+sk_connect_zero_addr
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index a3a26d8c29df..e9f02850106f 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -65,6 +65,8 @@ TEST_GEN_FILES += stress_reuseport_listen
 TEST_PROGS += test_vxlan_vnifiltering.sh
 TEST_GEN_FILES += io_uring_zerocopy_tx
 TEST_GEN_FILES += bind_bhash
+TEST_GEN_PROGS += sk_bind_sendto_listen
+TEST_GEN_PROGS += sk_connect_zero_addr
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/sk_bind_sendto_listen.c b/tools/testing/selftests/net/sk_bind_sendto_listen.c
new file mode 100644
index 000000000000..b420d830f72c
--- /dev/null
+++ b/tools/testing/selftests/net/sk_bind_sendto_listen.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <unistd.h>
+
+int main(void)
+{
+	int fd1, fd2, one = 1;
+	struct sockaddr_in6 bind_addr = {
+		.sin6_family = AF_INET6,
+		.sin6_port = htons(20000),
+		.sin6_flowinfo = htonl(0),
+		.sin6_addr = {},
+		.sin6_scope_id = 0,
+	};
+
+	inet_pton(AF_INET6, "::", &bind_addr.sin6_addr);
+
+	fd1 = socket(AF_INET6, SOCK_STREAM, IPPROTO_IP);
+	if (fd1 < 0) {
+		error(1, errno, "socket fd1");
+		return -1;
+	}
+
+	if (setsockopt(fd1, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one))) {
+		error(1, errno, "setsockopt(SO_REUSEADDR) fd1");
+		goto out_err1;
+	}
+
+	if (bind(fd1, (struct sockaddr *)&bind_addr, sizeof(bind_addr))) {
+		error(1, errno, "bind fd1");
+		goto out_err1;
+	}
+
+	if (sendto(fd1, NULL, 0, MSG_FASTOPEN, (struct sockaddr *)&bind_addr,
+		   sizeof(bind_addr))) {
+		error(1, errno, "sendto fd1");
+		goto out_err1;
+	}
+
+	fd2 = socket(AF_INET6, SOCK_STREAM, IPPROTO_IP);
+	if (fd2 < 0) {
+		error(1, errno, "socket fd2");
+		goto out_err1;
+	}
+
+	if (setsockopt(fd2, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one))) {
+		error(1, errno, "setsockopt(SO_REUSEADDR) fd2");
+		goto out_err2;
+	}
+
+	if (bind(fd2, (struct sockaddr *)&bind_addr, sizeof(bind_addr))) {
+		error(1, errno, "bind fd2");
+		goto out_err2;
+	}
+
+	if (sendto(fd2, NULL, 0, MSG_FASTOPEN, (struct sockaddr *)&bind_addr,
+		   sizeof(bind_addr)) != -1) {
+		error(1, errno, "sendto fd2");
+		goto out_err2;
+	}
+
+	if (listen(fd2, 0)) {
+		error(1, errno, "listen");
+		goto out_err2;
+	}
+
+	close(fd2);
+	close(fd1);
+	return 0;
+
+out_err2:
+	close(fd2);
+
+out_err1:
+	close(fd1);
+	return -1;
+}
diff --git a/tools/testing/selftests/net/sk_connect_zero_addr.c b/tools/testing/selftests/net/sk_connect_zero_addr.c
new file mode 100644
index 000000000000..4be418aefd9f
--- /dev/null
+++ b/tools/testing/selftests/net/sk_connect_zero_addr.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <unistd.h>
+
+int main(void)
+{
+	int fd1, fd2, one = 1;
+	struct sockaddr_in6 bind_addr = {
+		.sin6_family = AF_INET6,
+		.sin6_port = htons(20000),
+		.sin6_flowinfo = htonl(0),
+		.sin6_addr = {},
+		.sin6_scope_id = 0,
+	};
+
+	inet_pton(AF_INET6, "::", &bind_addr.sin6_addr);
+
+	fd1 = socket(AF_INET6, SOCK_STREAM, IPPROTO_IP);
+	if (fd1 < 0) {
+		error(1, errno, "socket fd1");
+		return -1;
+	}
+
+	if (setsockopt(fd1, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one))) {
+		error(1, errno, "setsockopt(SO_REUSEADDR) fd1");
+		goto out_err1;
+	}
+
+	if (bind(fd1, (struct sockaddr *)&bind_addr, sizeof(bind_addr))) {
+		error(1, errno, "bind fd1");
+		goto out_err1;
+	}
+
+	if (listen(fd1, 0)) {
+		error(1, errno, "listen");
+		goto out_err1;
+	}
+
+	fd2 = socket(AF_INET6, SOCK_STREAM, IPPROTO_IP);
+	if (fd2 < 0) {
+		error(1, errno, "socket fd2");
+		goto out_err1;
+	}
+
+	if (connect(fd2, (struct sockaddr *)&bind_addr, sizeof(bind_addr))) {
+		error(1, errno, "bind fd2");
+		goto out_err2;
+	}
+
+	close(fd2);
+	close(fd1);
+	return 0;
+
+out_err2:
+	close(fd2);
+out_err1:
+	close(fd1);
+	return -1;
+}
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next v4 2/3] selftests/net: Add test for timing a bind request to a port with a populated bhash entry
From: Joanne Koong @ 2022-08-12 20:29 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, kafai, kuba, davem, pabeni, dccp, Joanne Koong
In-Reply-To: <20220812202905.1599234-1-joannelkoong@gmail.com>

This test populates the bhash table for a given port with
MAX_THREADS * MAX_CONNECTIONS sockets, and then times how long
a bind request on the port takes.

When populating the bhash table, we create the sockets and then bind
the sockets to the same address and port (SO_REUSEADDR and SO_REUSEPORT
are set). When timing how long a bind on the port takes, we bind on a
different address without SO_REUSEPORT set. We do not set SO_REUSEPORT
because we are interested in the case where the bind request does not
go through the tb->fastreuseport path, which is fragile (eg
tb->fastreuseport path does not work if binding with a different uid).

To run the script:
    Usage: ./bind_bhash.sh [-6 | -4] [-p port] [-a address]
	    6: use ipv6
	    4: use ipv4
	    port: Port number
	    address: ip address

Without any arguments, ./bind_bhash.sh defaults to ipv6 using ip address
"2001:0db8:0:f101::1" on port 443.

On my local machine, I see:
ipv4:
before - 0.002317 seconds
with bhash2 - 0.000020 seconds

ipv6:
before - 0.002431 seconds
with bhash2 - 0.000021 seconds

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 tools/testing/selftests/net/.gitignore    |   3 +-
 tools/testing/selftests/net/Makefile      |   3 +
 tools/testing/selftests/net/bind_bhash.c  | 144 ++++++++++++++++++++++
 tools/testing/selftests/net/bind_bhash.sh |  66 ++++++++++
 4 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/bind_bhash.c
 create mode 100755 tools/testing/selftests/net/bind_bhash.sh

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 0e5751af6247..89e2d4aa812a 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -39,4 +39,5 @@ toeplitz
 tun
 cmsg_sender
 unix_connect
-tap
\ No newline at end of file
+tap
+bind_bhash
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index c0ee2955fe54..a3a26d8c29df 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -42,6 +42,7 @@ TEST_PROGS += arp_ndisc_evict_nocarrier.sh
 TEST_PROGS += ndisc_unsolicited_na_test.sh
 TEST_PROGS += arp_ndisc_untracked_subnets.sh
 TEST_PROGS += stress_reuseport_listen.sh
+TEST_PROGS += bind_bhash.sh
 TEST_PROGS_EXTENDED := in_netns.sh setup_loopback.sh setup_veth.sh
 TEST_PROGS_EXTENDED += toeplitz_client.sh toeplitz.sh
 TEST_GEN_FILES =  socket nettest
@@ -63,6 +64,7 @@ TEST_GEN_FILES += cmsg_sender
 TEST_GEN_FILES += stress_reuseport_listen
 TEST_PROGS += test_vxlan_vnifiltering.sh
 TEST_GEN_FILES += io_uring_zerocopy_tx
+TEST_GEN_FILES += bind_bhash
 
 TEST_FILES := settings
 
@@ -73,3 +75,4 @@ include bpf/Makefile
 $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
 $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread
 $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
+$(OUTPUT)/bind_bhash: LDLIBS += -lpthread
diff --git a/tools/testing/selftests/net/bind_bhash.c b/tools/testing/selftests/net/bind_bhash.c
new file mode 100644
index 000000000000..57ff67a3751e
--- /dev/null
+++ b/tools/testing/selftests/net/bind_bhash.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This times how long it takes to bind to a port when the port already
+ * has multiple sockets in its bhash table.
+ *
+ * In the setup(), we populate the port's bhash table with
+ * MAX_THREADS * MAX_CONNECTIONS number of entries.
+ */
+
+#include <unistd.h>
+#include <stdio.h>
+#include <netdb.h>
+#include <pthread.h>
+#include <string.h>
+#include <stdbool.h>
+
+#define MAX_THREADS 600
+#define MAX_CONNECTIONS 40
+
+static const char *setup_addr_v6 = "::1";
+static const char *setup_addr_v4 = "127.0.0.1";
+static const char *setup_addr;
+static const char *bind_addr;
+static const char *port;
+bool use_v6;
+int ret;
+
+static int fd_array[MAX_THREADS][MAX_CONNECTIONS];
+
+static int bind_socket(int opt, const char *addr)
+{
+	struct addrinfo *res, hint = {};
+	int sock_fd, reuse = 1, err;
+	int domain = use_v6 ? AF_INET6 : AF_INET;
+
+	sock_fd = socket(domain, SOCK_STREAM, 0);
+	if (sock_fd < 0) {
+		perror("socket fd err");
+		return sock_fd;
+	}
+
+	hint.ai_family = domain;
+	hint.ai_socktype = SOCK_STREAM;
+
+	err = getaddrinfo(addr, port, &hint, &res);
+	if (err) {
+		perror("getaddrinfo failed");
+		goto cleanup;
+	}
+
+	if (opt) {
+		err = setsockopt(sock_fd, SOL_SOCKET, opt, &reuse, sizeof(reuse));
+		if (err) {
+			perror("setsockopt failed");
+			goto cleanup;
+		}
+	}
+
+	err = bind(sock_fd, res->ai_addr, res->ai_addrlen);
+	if (err) {
+		perror("failed to bind to port");
+		goto cleanup;
+	}
+
+	return sock_fd;
+
+cleanup:
+	close(sock_fd);
+	return err;
+}
+
+static void *setup(void *arg)
+{
+	int sock_fd, i;
+	int *array = (int *)arg;
+
+	for (i = 0; i < MAX_CONNECTIONS; i++) {
+		sock_fd = bind_socket(SO_REUSEADDR | SO_REUSEPORT, setup_addr);
+		if (sock_fd < 0) {
+			ret = sock_fd;
+			pthread_exit(&ret);
+		}
+		array[i] = sock_fd;
+	}
+
+	return NULL;
+}
+
+int main(int argc, const char *argv[])
+{
+	int listener_fd, sock_fd, i, j;
+	pthread_t tid[MAX_THREADS];
+	clock_t begin, end;
+
+	if (argc != 4) {
+		printf("Usage: listener <port> <ipv6 | ipv4> <bind-addr>\n");
+		return -1;
+	}
+
+	port = argv[1];
+	use_v6 = strcmp(argv[2], "ipv6") == 0;
+	bind_addr = argv[3];
+
+	setup_addr = use_v6 ? setup_addr_v6 : setup_addr_v4;
+
+	listener_fd = bind_socket(SO_REUSEADDR | SO_REUSEPORT, setup_addr);
+	if (listen(listener_fd, 100) < 0) {
+		perror("listen failed");
+		return -1;
+	}
+
+	/* Set up threads to populate the bhash table entry for the port */
+	for (i = 0; i < MAX_THREADS; i++)
+		pthread_create(&tid[i], NULL, setup, fd_array[i]);
+
+	for (i = 0; i < MAX_THREADS; i++)
+		pthread_join(tid[i], NULL);
+
+	if (ret)
+		goto done;
+
+	begin = clock();
+
+	/* Bind to the same port on a different address */
+	sock_fd  = bind_socket(0, bind_addr);
+	if (sock_fd < 0)
+		goto done;
+
+	end = clock();
+
+	printf("time spent = %f\n", (double)(end - begin) / CLOCKS_PER_SEC);
+
+	/* clean up */
+	close(sock_fd);
+
+done:
+	close(listener_fd);
+	for (i = 0; i < MAX_THREADS; i++) {
+		for (j = 0; i < MAX_THREADS; i++)
+			close(fd_array[i][j]);
+	}
+
+	return 0;
+}
diff --git a/tools/testing/selftests/net/bind_bhash.sh b/tools/testing/selftests/net/bind_bhash.sh
new file mode 100755
index 000000000000..ca0292d4b441
--- /dev/null
+++ b/tools/testing/selftests/net/bind_bhash.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+NR_FILES=32768
+SAVED_NR_FILES=$(ulimit -n)
+
+# default values
+port=443
+addr_v6="2001:0db8:0:f101::1"
+addr_v4="10.8.8.8"
+use_v6=true
+addr=""
+
+usage() {
+    echo "Usage: $0 [-6 | -4] [-p port] [-a address]"
+    echo -e "\t6: use ipv6"
+    echo -e "\t4: use ipv4"
+    echo -e "\tport: Port number"
+    echo -e "\taddress: ip address"
+}
+
+while getopts "ha:p:64" opt; do
+    case ${opt} in
+	h)
+	    usage $0
+	    exit 0
+	    ;;
+	a)  addr=$OPTARG;;
+	p)
+	    port=$OPTARG;;
+	6)
+	    use_v6=true;;
+	4)
+	    use_v6=false;;
+    esac
+done
+
+setup() {
+    if [[ "$use_v6" == true ]]; then
+	ip addr add $addr_v6 nodad dev eth0
+    else
+	ip addr add $addr_v4 dev lo
+    fi
+	ulimit -n $NR_FILES
+}
+
+cleanup() {
+    if [[ "$use_v6" == true ]]; then
+	ip addr del $addr_v6 dev eth0
+    else
+	ip addr del $addr_v4/32 dev lo
+    fi
+    ulimit -n $SAVED_NR_FILES
+}
+
+if [[ "$addr" != "" ]]; then
+    addr_v4=$addr;
+    addr_v6=$addr;
+fi
+setup
+if [[ "$use_v6" == true ]] ; then
+    ./bind_bhash $port "ipv6" $addr_v6
+else
+    ./bind_bhash $port "ipv4" $addr_v4
+fi
+cleanup
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next v4 1/3] net: Add a bhash2 table hashed by port + address
From: Joanne Koong @ 2022-08-12 20:29 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, kafai, kuba, davem, pabeni, dccp, Joanne Koong
In-Reply-To: <20220812202905.1599234-1-joannelkoong@gmail.com>

The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.

This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.

Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:

  1) The case where there is a bind() call on INADDR_ANY (ipv4) or
  IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
  assign the socket an address when it handles the connect()

  2) In inet_sk_reselect_saddr(), which is called when rebuilding the
  sk header and a few pre-conditions are met (eg rerouting fails).

In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.

* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.

This brings up a few stipulations:
  1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
  will always be acquired after the bhash lock and released before the
  bhash lock is released.

  2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
  acquired+released before another bhash2 lock is acquired+released.

* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/net/inet_connection_sock.h |   3 +
 include/net/inet_hashtables.h      |  80 ++++++++-
 include/net/sock.h                 |  14 ++
 net/dccp/ipv4.c                    |  25 ++-
 net/dccp/ipv6.c                    |  18 ++
 net/dccp/proto.c                   |  34 +++-
 net/ipv4/af_inet.c                 |  26 ++-
 net/ipv4/inet_connection_sock.c    | 275 ++++++++++++++++++++++-------
 net/ipv4/inet_hashtables.c         | 268 ++++++++++++++++++++++++++--
 net/ipv4/tcp.c                     |  11 +-
 net/ipv4/tcp_ipv4.c                |  23 ++-
 net/ipv6/tcp_ipv6.c                |  17 ++
 12 files changed, 700 insertions(+), 94 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index ee88f0f1350f..c2b15f7e5516 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -25,6 +25,7 @@
 #undef INET_CSK_CLEAR_TIMERS
 
 struct inet_bind_bucket;
+struct inet_bind2_bucket;
 struct tcp_congestion_ops;
 
 /*
@@ -57,6 +58,7 @@ struct inet_connection_sock_af_ops {
  *
  * @icsk_accept_queue:	   FIFO of established children
  * @icsk_bind_hash:	   Bind node
+ * @icsk_bind2_hash:	   Bind node in the bhash2 table
  * @icsk_timeout:	   Timeout
  * @icsk_retransmit_timer: Resend (no ack)
  * @icsk_rto:		   Retransmit timeout
@@ -83,6 +85,7 @@ struct inet_connection_sock {
 	struct inet_sock	  icsk_inet;
 	struct request_sock_queue icsk_accept_queue;
 	struct inet_bind_bucket	  *icsk_bind_hash;
+	struct inet_bind2_bucket  *icsk_bind2_hash;
 	unsigned long		  icsk_timeout;
  	struct timer_list	  icsk_retransmit_timer;
  	struct timer_list	  icsk_delack_timer;
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index e9cf2157ed8a..44a419b9e3d5 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -23,6 +23,7 @@
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_sock.h>
+#include <net/ip.h>
 #include <net/sock.h>
 #include <net/route.h>
 #include <net/tcp_states.h>
@@ -90,7 +91,28 @@ struct inet_bind_bucket {
 	struct hlist_head	owners;
 };
 
-static inline struct net *ib_net(struct inet_bind_bucket *ib)
+struct inet_bind2_bucket {
+	possible_net_t		ib_net;
+	int			l3mdev;
+	unsigned short		port;
+	union {
+#if IS_ENABLED(CONFIG_IPV6)
+		struct in6_addr		v6_rcv_saddr;
+#endif
+		__be32			rcv_saddr;
+	};
+	/* Node in the bhash2 inet_bind_hashbucket chain */
+	struct hlist_node	node;
+	/* List of sockets hashed to this bucket */
+	struct hlist_head	owners;
+};
+
+static inline struct net *ib_net(const struct inet_bind_bucket *ib)
+{
+	return read_pnet(&ib->ib_net);
+}
+
+static inline struct net *ib2_net(const struct inet_bind2_bucket *ib)
 {
 	return read_pnet(&ib->ib_net);
 }
@@ -133,7 +155,14 @@ struct inet_hashinfo {
 	 * TCP hash as well as the others for fast bind/connect.
 	 */
 	struct kmem_cache		*bind_bucket_cachep;
+	/* This bind table is hashed by local port */
 	struct inet_bind_hashbucket	*bhash;
+	struct kmem_cache		*bind2_bucket_cachep;
+	/* This bind table is hashed by local port and sk->sk_rcv_saddr (ipv4)
+	 * or sk->sk_v6_rcv_saddr (ipv6). This 2nd bind table is used
+	 * primarily for expediting bind conflict resolution.
+	 */
+	struct inet_bind_hashbucket	*bhash2;
 	unsigned int			bhash_size;
 
 	/* The 2nd listener table hashed by local port and address */
@@ -182,14 +211,61 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
 			      struct inet_bind_bucket *tb);
 
+bool inet_bind_bucket_match(const struct inet_bind_bucket *tb,
+			    const struct net *net, unsigned short port,
+			    int l3mdev);
+
+struct inet_bind2_bucket *
+inet_bind2_bucket_create(struct kmem_cache *cachep, struct net *net,
+			 struct inet_bind_hashbucket *head,
+			 unsigned short port, int l3mdev,
+			 const struct sock *sk);
+
+void inet_bind2_bucket_destroy(struct kmem_cache *cachep,
+			       struct inet_bind2_bucket *tb);
+
+struct inet_bind2_bucket *
+inet_bind2_bucket_find(const struct inet_bind_hashbucket *head,
+		       const struct net *net,
+		       unsigned short port, int l3mdev,
+		       const struct sock *sk);
+
+bool inet_bind2_bucket_match_addr_any(const struct inet_bind2_bucket *tb,
+				      const struct net *net, unsigned short port,
+				      int l3mdev, const struct sock *sk);
+
 static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
 			       const u32 bhash_size)
 {
 	return (lport + net_hash_mix(net)) & (bhash_size - 1);
 }
 
+static inline struct inet_bind_hashbucket *
+inet_bhashfn_portaddr(const struct inet_hashinfo *hinfo, const struct sock *sk,
+		      const struct net *net, unsigned short port)
+{
+	u32 hash;
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		hash = ipv6_portaddr_hash(net, &sk->sk_v6_rcv_saddr, port);
+	else
+#endif
+		hash = ipv4_portaddr_hash(net, sk->sk_rcv_saddr, port);
+	return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
+}
+
+struct inet_bind_hashbucket *
+inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, int port);
+
+/* This should be called whenever a socket's sk_rcv_saddr (ipv4) or
+ * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
+ * rcv_saddr field should already have been updated when this is called.
+ */
+int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
+
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
-		    const unsigned short snum);
+		    struct inet_bind2_bucket *tb2, unsigned short port);
 
 /* Caller must disable local BH processing. */
 int __inet_inherit_port(const struct sock *sk, struct sock *child);
diff --git a/include/net/sock.h b/include/net/sock.h
index 05a1bbdf5805..5efcd0dd9a93 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -348,6 +348,7 @@ struct sk_filter;
   *	@sk_txtime_report_errors: set report errors mode for SO_TXTIME
   *	@sk_txtime_unused: unused txtime flags
   *	@ns_tracker: tracker for netns reference
+  *	@sk_bind2_node: bind node in the bhash2 table
   */
 struct sock {
 	/*
@@ -537,6 +538,7 @@ struct sock {
 #endif
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
+	struct hlist_node	sk_bind2_node;
 };
 
 enum sk_pacing {
@@ -845,6 +847,16 @@ static inline void sk_add_bind_node(struct sock *sk,
 	hlist_add_head(&sk->sk_bind_node, list);
 }
 
+static inline void __sk_del_bind2_node(struct sock *sk)
+{
+	__hlist_del(&sk->sk_bind2_node);
+}
+
+static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list)
+{
+	hlist_add_head(&sk->sk_bind2_node, list);
+}
+
 #define sk_for_each(__sk, list) \
 	hlist_for_each_entry(__sk, list, sk_node)
 #define sk_for_each_rcu(__sk, list) \
@@ -862,6 +874,8 @@ static inline void sk_add_bind_node(struct sock *sk,
 	hlist_for_each_entry_safe(__sk, tmp, list, sk_node)
 #define sk_for_each_bound(__sk, list) \
 	hlist_for_each_entry(__sk, list, sk_bind_node)
+#define sk_for_each_bound_bhash2(__sk, list) \
+	hlist_for_each_entry(__sk, list, sk_bind2_node)
 
 /**
  * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index da6e3b20cd75..6a6e121dc00c 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -45,10 +45,11 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
 int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
 	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
+	struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
+	__be32 daddr, nexthop, prev_sk_rcv_saddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
 	__be16 orig_sport, orig_dport;
-	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	int err;
@@ -89,9 +90,29 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (inet_opt == NULL || !inet_opt->opt.srr)
 		daddr = fl4->daddr;
 
-	if (inet->inet_saddr == 0)
+	if (inet->inet_saddr == 0) {
+		if (inet_csk(sk)->icsk_bind2_hash) {
+			prev_addr_hashbucket =
+				inet_bhashfn_portaddr(&dccp_hashinfo, sk,
+						      sock_net(sk),
+						      inet->inet_num);
+			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
+		}
 		inet->inet_saddr = fl4->saddr;
+	}
+
 	sk_rcv_saddr_set(sk, inet->inet_saddr);
+
+	if (prev_addr_hashbucket) {
+		err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+		if (err) {
+			inet->inet_saddr = 0;
+			sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
+			ip_rt_put(rt);
+			return err;
+		}
+	}
+
 	inet->inet_dport = usin->sin_port;
 	sk_daddr_set(sk, daddr);
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index fd44638ec16b..e57b43006074 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -934,8 +934,26 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	}
 
 	if (saddr == NULL) {
+		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
+		struct in6_addr prev_v6_rcv_saddr;
+
+		if (icsk->icsk_bind2_hash) {
+			prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
+								     sk, sock_net(sk),
+								     inet->inet_num);
+			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
+		}
+
 		saddr = &fl6.saddr;
 		sk->sk_v6_rcv_saddr = *saddr;
+
+		if (prev_addr_hashbucket) {
+			err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+			if (err) {
+				sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
+				goto failure;
+			}
+		}
 	}
 
 	/* set the source address */
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index e13641c65f88..7cd4a6cc99fc 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1120,6 +1120,12 @@ static int __init dccp_init(void)
 				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
+	dccp_hashinfo.bind2_bucket_cachep =
+		kmem_cache_create("dccp_bind2_bucket",
+				  sizeof(struct inet_bind2_bucket), 0,
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
+	if (!dccp_hashinfo.bind2_bucket_cachep)
+		goto out_free_bind_bucket_cachep;
 
 	/*
 	 * Size and allocate the main established and bind bucket
@@ -1150,7 +1156,7 @@ static int __init dccp_init(void)
 
 	if (!dccp_hashinfo.ehash) {
 		DCCP_CRIT("Failed to allocate DCCP established hash table");
-		goto out_free_bind_bucket_cachep;
+		goto out_free_bind2_bucket_cachep;
 	}
 
 	for (i = 0; i <= dccp_hashinfo.ehash_mask; i++)
@@ -1176,14 +1182,24 @@ static int __init dccp_init(void)
 		goto out_free_dccp_locks;
 	}
 
+	dccp_hashinfo.bhash2 = (struct inet_bind_hashbucket *)
+		__get_free_pages(GFP_ATOMIC | __GFP_NOWARN, bhash_order);
+
+	if (!dccp_hashinfo.bhash2) {
+		DCCP_CRIT("Failed to allocate DCCP bind2 hash table");
+		goto out_free_dccp_bhash;
+	}
+
 	for (i = 0; i < dccp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&dccp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain);
+		spin_lock_init(&dccp_hashinfo.bhash2[i].lock);
+		INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
 	}
 
 	rc = dccp_mib_init();
 	if (rc)
-		goto out_free_dccp_bhash;
+		goto out_free_dccp_bhash2;
 
 	rc = dccp_ackvec_init();
 	if (rc)
@@ -1207,30 +1223,38 @@ static int __init dccp_init(void)
 	dccp_ackvec_exit();
 out_free_dccp_mib:
 	dccp_mib_exit();
+out_free_dccp_bhash2:
+	free_pages((unsigned long)dccp_hashinfo.bhash2, bhash_order);
 out_free_dccp_bhash:
 	free_pages((unsigned long)dccp_hashinfo.bhash, bhash_order);
 out_free_dccp_locks:
 	inet_ehash_locks_free(&dccp_hashinfo);
 out_free_dccp_ehash:
 	free_pages((unsigned long)dccp_hashinfo.ehash, ehash_order);
+out_free_bind2_bucket_cachep:
+	kmem_cache_destroy(dccp_hashinfo.bind2_bucket_cachep);
 out_free_bind_bucket_cachep:
 	kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep);
 out_free_hashinfo2:
 	inet_hashinfo2_free_mod(&dccp_hashinfo);
 out_fail:
 	dccp_hashinfo.bhash = NULL;
+	dccp_hashinfo.bhash2 = NULL;
 	dccp_hashinfo.ehash = NULL;
 	dccp_hashinfo.bind_bucket_cachep = NULL;
+	dccp_hashinfo.bind2_bucket_cachep = NULL;
 	return rc;
 }
 
 static void __exit dccp_fini(void)
 {
+	int bhash_order = get_order(dccp_hashinfo.bhash_size *
+				    sizeof(struct inet_bind_hashbucket));
+
 	ccid_cleanup_builtins();
 	dccp_mib_exit();
-	free_pages((unsigned long)dccp_hashinfo.bhash,
-		   get_order(dccp_hashinfo.bhash_size *
-			     sizeof(struct inet_bind_hashbucket)));
+	free_pages((unsigned long)dccp_hashinfo.bhash, bhash_order);
+	free_pages((unsigned long)dccp_hashinfo.bhash2, bhash_order);
 	free_pages((unsigned long)dccp_hashinfo.ehash,
 		   get_order((dccp_hashinfo.ehash_mask + 1) *
 			     sizeof(struct inet_ehash_bucket)));
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 3ca0cc467886..2e6a3cb06815 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1219,6 +1219,7 @@ EXPORT_SYMBOL(inet_unregister_protosw);
 
 static int inet_sk_reselect_saddr(struct sock *sk)
 {
+	struct inet_bind_hashbucket *prev_addr_hashbucket;
 	struct inet_sock *inet = inet_sk(sk);
 	__be32 old_saddr = inet->inet_saddr;
 	__be32 daddr = inet->inet_daddr;
@@ -1226,6 +1227,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	struct rtable *rt;
 	__be32 new_saddr;
 	struct ip_options_rcu *inet_opt;
+	int err;
 
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
 					     lockdep_sock_is_held(sk));
@@ -1240,20 +1242,34 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
-	sk_setup_caps(sk, &rt->dst);
-
 	new_saddr = fl4->saddr;
 
-	if (new_saddr == old_saddr)
+	if (new_saddr == old_saddr) {
+		sk_setup_caps(sk, &rt->dst);
 		return 0;
+	}
+
+	prev_addr_hashbucket =
+		inet_bhashfn_portaddr(sk->sk_prot->h.hashinfo, sk,
+				      sock_net(sk), inet->inet_num);
+
+	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
+
+	err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+	if (err) {
+		inet->inet_saddr = old_saddr;
+		inet->inet_rcv_saddr = old_saddr;
+		ip_rt_put(rt);
+		return err;
+	}
+
+	sk_setup_caps(sk, &rt->dst);
 
 	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_ip_dynaddr) > 1) {
 		pr_info("%s(): shifting inet->saddr from %pI4 to %pI4\n",
 			__func__, &old_saddr, &new_saddr);
 	}
 
-	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
-
 	/*
 	 * XXX The only one ugly spot where we need to
 	 * XXX really change the sockets identity after
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index eb31c7158b39..f0038043b661 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -130,14 +130,75 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
 }
 EXPORT_SYMBOL(inet_get_local_port_range);
 
+static bool inet_use_bhash2_on_bind(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6) {
+		int addr_type = ipv6_addr_type(&sk->sk_v6_rcv_saddr);
+
+		return addr_type != IPV6_ADDR_ANY &&
+			addr_type != IPV6_ADDR_MAPPED;
+	}
+#endif
+	return sk->sk_rcv_saddr != htonl(INADDR_ANY);
+}
+
+static bool inet_bind_conflict(const struct sock *sk, struct sock *sk2,
+			       kuid_t sk_uid, bool relax,
+			       bool reuseport_cb_ok, bool reuseport_ok)
+{
+	int bound_dev_if2;
+
+	if (sk == sk2)
+		return false;
+
+	bound_dev_if2 = READ_ONCE(sk2->sk_bound_dev_if);
+
+	if (!sk->sk_bound_dev_if || !bound_dev_if2 ||
+	    sk->sk_bound_dev_if == bound_dev_if2) {
+		if (sk->sk_reuse && sk2->sk_reuse &&
+		    sk2->sk_state != TCP_LISTEN) {
+			if (!relax || (!reuseport_ok && sk->sk_reuseport &&
+				       sk2->sk_reuseport && reuseport_cb_ok &&
+				       (sk2->sk_state == TCP_TIME_WAIT ||
+					uid_eq(sk_uid, sock_i_uid(sk2)))))
+				return true;
+		} else if (!reuseport_ok || !sk->sk_reuseport ||
+			   !sk2->sk_reuseport || !reuseport_cb_ok ||
+			   (sk2->sk_state != TCP_TIME_WAIT &&
+			    !uid_eq(sk_uid, sock_i_uid(sk2)))) {
+			return true;
+		}
+	}
+	return false;
+}
+
+static bool inet_bhash2_conflict(const struct sock *sk,
+				 const struct inet_bind2_bucket *tb2,
+				 kuid_t sk_uid,
+				 bool relax, bool reuseport_cb_ok,
+				 bool reuseport_ok)
+{
+	struct sock *sk2;
+
+	sk_for_each_bound_bhash2(sk2, &tb2->owners) {
+		if (sk->sk_family == AF_INET && ipv6_only_sock(sk2))
+			continue;
+
+		if (inet_bind_conflict(sk, sk2, sk_uid, relax,
+				       reuseport_cb_ok, reuseport_ok))
+			return true;
+	}
+	return false;
+}
+
+/* This should be called only when the tb and tb2 hashbuckets' locks are held */
 static int inet_csk_bind_conflict(const struct sock *sk,
 				  const struct inet_bind_bucket *tb,
+				  const struct inet_bind2_bucket *tb2, /* may be null */
 				  bool relax, bool reuseport_ok)
 {
-	struct sock *sk2;
 	bool reuseport_cb_ok;
-	bool reuse = sk->sk_reuse;
-	bool reuseport = !!sk->sk_reuseport;
 	struct sock_reuseport *reuseport_cb;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
 
@@ -150,55 +211,87 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 	/*
 	 * Unlike other sk lookup places we do not check
 	 * for sk_net here, since _all_ the socks listed
-	 * in tb->owners list belong to the same net - the
-	 * one this bucket belongs to.
+	 * in tb->owners and tb2->owners list belong
+	 * to the same net - the one this bucket belongs to.
 	 */
 
-	sk_for_each_bound(sk2, &tb->owners) {
-		int bound_dev_if2;
+	if (!inet_use_bhash2_on_bind(sk)) {
+		struct sock *sk2;
 
-		if (sk == sk2)
-			continue;
-		bound_dev_if2 = READ_ONCE(sk2->sk_bound_dev_if);
-		if ((!sk->sk_bound_dev_if ||
-		     !bound_dev_if2 ||
-		     sk->sk_bound_dev_if == bound_dev_if2)) {
-			if (reuse && sk2->sk_reuse &&
-			    sk2->sk_state != TCP_LISTEN) {
-				if ((!relax ||
-				     (!reuseport_ok &&
-				      reuseport && sk2->sk_reuseport &&
-				      reuseport_cb_ok &&
-				      (sk2->sk_state == TCP_TIME_WAIT ||
-				       uid_eq(uid, sock_i_uid(sk2))))) &&
-				    inet_rcv_saddr_equal(sk, sk2, true))
-					break;
-			} else if (!reuseport_ok ||
-				   !reuseport || !sk2->sk_reuseport ||
-				   !reuseport_cb_ok ||
-				   (sk2->sk_state != TCP_TIME_WAIT &&
-				    !uid_eq(uid, sock_i_uid(sk2)))) {
-				if (inet_rcv_saddr_equal(sk, sk2, true))
-					break;
-			}
-		}
+		sk_for_each_bound(sk2, &tb->owners)
+			if (inet_bind_conflict(sk, sk2, uid, relax,
+					       reuseport_cb_ok, reuseport_ok) &&
+			    inet_rcv_saddr_equal(sk, sk2, true))
+				return true;
+
+		return false;
+	}
+
+	/* Conflicts with an existing IPV6_ADDR_ANY (if ipv6) or INADDR_ANY (if
+	 * ipv4) should have been checked already. We need to do these two
+	 * checks separately because their spinlocks have to be acquired/released
+	 * independently of each other, to prevent possible deadlocks
+	 */
+	return tb2 && inet_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok,
+					   reuseport_ok);
+}
+
+/* Determine if there is a bind conflict with an existing IPV6_ADDR_ANY (if ipv6) or
+ * INADDR_ANY (if ipv4) socket.
+ *
+ * Caller must hold bhash hashbucket lock with local bh disabled, to protect
+ * against concurrent binds on the port for addr any
+ */
+static bool inet_bhash2_addr_any_conflict(const struct sock *sk, int port, int l3mdev,
+					  bool relax, bool reuseport_ok)
+{
+	kuid_t uid = sock_i_uid((struct sock *)sk);
+	const struct net *net = sock_net(sk);
+	struct sock_reuseport *reuseport_cb;
+	struct inet_bind_hashbucket *head2;
+	struct inet_bind2_bucket *tb2;
+	bool reuseport_cb_ok;
+
+	rcu_read_lock();
+	reuseport_cb = rcu_dereference(sk->sk_reuseport_cb);
+	/* paired with WRITE_ONCE() in __reuseport_(add|detach)_closed_sock */
+	reuseport_cb_ok = !reuseport_cb || READ_ONCE(reuseport_cb->num_closed_socks);
+	rcu_read_unlock();
+
+	head2 = inet_bhash2_addr_any_hashbucket(sk, net, port);
+
+	spin_lock(&head2->lock);
+
+	inet_bind_bucket_for_each(tb2, &head2->chain)
+		if (inet_bind2_bucket_match_addr_any(tb2, net, port, l3mdev, sk))
+			break;
+
+	if (tb2 && inet_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok,
+					reuseport_ok)) {
+		spin_unlock(&head2->lock);
+		return true;
 	}
-	return sk2 != NULL;
+
+	spin_unlock(&head2->lock);
+	return false;
 }
 
 /*
  * Find an open port number for the socket.  Returns with the
- * inet_bind_hashbucket lock held.
+ * inet_bind_hashbucket locks held if successful.
  */
 static struct inet_bind_hashbucket *
-inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *port_ret)
+inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
+			struct inet_bind2_bucket **tb2_ret,
+			struct inet_bind_hashbucket **head2_ret, int *port_ret)
 {
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
 	int port = 0;
-	struct inet_bind_hashbucket *head;
+	struct inet_bind_hashbucket *head, *head2;
 	struct net *net = sock_net(sk);
 	bool relax = false;
 	int i, low, high, attempt_half;
+	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
 	int l3mdev;
@@ -239,11 +332,20 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 		head = &hinfo->bhash[inet_bhashfn(net, port,
 						  hinfo->bhash_size)];
 		spin_lock_bh(&head->lock);
+		if (inet_use_bhash2_on_bind(sk)) {
+			if (inet_bhash2_addr_any_conflict(sk, port, l3mdev, relax, false))
+				goto next_port;
+		}
+
+		head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
+		spin_lock(&head2->lock);
+		tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
 		inet_bind_bucket_for_each(tb, &head->chain)
-			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
-			    tb->port == port) {
-				if (!inet_csk_bind_conflict(sk, tb, relax, false))
+			if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
+				if (!inet_csk_bind_conflict(sk, tb, tb2,
+							    relax, false))
 					goto success;
+				spin_unlock(&head2->lock);
 				goto next_port;
 			}
 		tb = NULL;
@@ -272,6 +374,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 success:
 	*port_ret = port;
 	*tb_ret = tb;
+	*tb2_ret = tb2;
+	*head2_ret = head2;
 	return head;
 }
 
@@ -368,53 +472,95 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
 	int ret = 1, port = snum;
-	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
+	bool found_port = false, check_bind_conflict = true;
+	bool bhash_created = false, bhash2_created = false;
+	struct inet_bind_hashbucket *head, *head2;
+	struct inet_bind2_bucket *tb2 = NULL;
 	struct inet_bind_bucket *tb = NULL;
+	bool head2_lock_acquired = false;
 	int l3mdev;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 
 	if (!port) {
-		head = inet_csk_find_open_port(sk, &tb, &port);
+		head = inet_csk_find_open_port(sk, &tb, &tb2, &head2, &port);
 		if (!head)
 			return ret;
+
+		head2_lock_acquired = true;
+
+		if (tb && tb2)
+			goto success;
+		found_port = true;
+	} else {
+		head = &hinfo->bhash[inet_bhashfn(net, port,
+						  hinfo->bhash_size)];
+		spin_lock_bh(&head->lock);
+		inet_bind_bucket_for_each(tb, &head->chain)
+			if (inet_bind_bucket_match(tb, net, port, l3mdev))
+				break;
+	}
+
+	if (!tb) {
+		tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, net,
+					     head, port, l3mdev);
 		if (!tb)
-			goto tb_not_found;
-		goto success;
+			goto fail_unlock;
+		bhash_created = true;
 	}
-	head = &hinfo->bhash[inet_bhashfn(net, port,
-					  hinfo->bhash_size)];
-	spin_lock_bh(&head->lock);
-	inet_bind_bucket_for_each(tb, &head->chain)
-		if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
-		    tb->port == port)
-			goto tb_found;
-tb_not_found:
-	tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
-				     net, head, port, l3mdev);
-	if (!tb)
-		goto fail_unlock;
-tb_found:
-	if (!hlist_empty(&tb->owners)) {
-		if (sk->sk_reuse == SK_FORCE_REUSE)
-			goto success;
 
-		if ((tb->fastreuse > 0 && reuse) ||
-		    sk_reuseport_match(tb, sk))
-			goto success;
-		if (inet_csk_bind_conflict(sk, tb, true, true))
+	if (!found_port) {
+		if (!hlist_empty(&tb->owners)) {
+			if (sk->sk_reuse == SK_FORCE_REUSE ||
+			    (tb->fastreuse > 0 && reuse) ||
+			    sk_reuseport_match(tb, sk))
+				check_bind_conflict = false;
+		}
+
+		if (check_bind_conflict && inet_use_bhash2_on_bind(sk)) {
+			if (inet_bhash2_addr_any_conflict(sk, port, l3mdev, true, true))
+				goto fail_unlock;
+		}
+
+		head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
+		spin_lock(&head2->lock);
+		head2_lock_acquired = true;
+		tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
+	}
+
+	if (!tb2) {
+		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
+					       net, head2, port, l3mdev, sk);
+		if (!tb2)
 			goto fail_unlock;
+		bhash2_created = true;
 	}
+
+	if (!found_port && check_bind_conflict) {
+		if (inet_csk_bind_conflict(sk, tb, tb2, true, true))
+			goto fail_unlock;
+	}
+
 success:
 	inet_csk_update_fastreuse(tb, sk);
 
 	if (!inet_csk(sk)->icsk_bind_hash)
-		inet_bind_hash(sk, tb, port);
+		inet_bind_hash(sk, tb, tb2, port);
 	WARN_ON(inet_csk(sk)->icsk_bind_hash != tb);
+	WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
 	ret = 0;
 
 fail_unlock:
+	if (ret) {
+		if (bhash_created)
+			inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
+		if (bhash2_created)
+			inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep,
+						  tb2);
+	}
+	if (head2_lock_acquired)
+		spin_unlock(&head2->lock);
 	spin_unlock_bh(&head->lock);
 	return ret;
 }
@@ -962,6 +1108,7 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 
 		inet_sk_set_state(newsk, TCP_SYN_RECV);
 		newicsk->icsk_bind_hash = NULL;
+		newicsk->icsk_bind2_hash = NULL;
 
 		inet_sk(newsk)->inet_dport = inet_rsk(req)->ir_rmt_port;
 		inet_sk(newsk)->inet_num = inet_rsk(req)->ir_num;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index b9d995b5ce24..60d77e234a68 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -92,12 +92,75 @@ void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket
 	}
 }
 
+bool inet_bind_bucket_match(const struct inet_bind_bucket *tb, const struct net *net,
+			    unsigned short port, int l3mdev)
+{
+	return net_eq(ib_net(tb), net) && tb->port == port &&
+		tb->l3mdev == l3mdev;
+}
+
+static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb,
+				   struct net *net,
+				   struct inet_bind_hashbucket *head,
+				   unsigned short port, int l3mdev,
+				   const struct sock *sk)
+{
+	write_pnet(&tb->ib_net, net);
+	tb->l3mdev    = l3mdev;
+	tb->port      = port;
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		tb->v6_rcv_saddr = sk->sk_v6_rcv_saddr;
+	else
+#endif
+		tb->rcv_saddr = sk->sk_rcv_saddr;
+	INIT_HLIST_HEAD(&tb->owners);
+	hlist_add_head(&tb->node, &head->chain);
+}
+
+struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
+						   struct net *net,
+						   struct inet_bind_hashbucket *head,
+						   unsigned short port,
+						   int l3mdev,
+						   const struct sock *sk)
+{
+	struct inet_bind2_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
+
+	if (tb)
+		inet_bind2_bucket_init(tb, net, head, port, l3mdev, sk);
+
+	return tb;
+}
+
+/* Caller must hold hashbucket lock for this tb with local BH disabled */
+void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
+{
+	if (hlist_empty(&tb->owners)) {
+		__hlist_del(&tb->node);
+		kmem_cache_free(cachep, tb);
+	}
+}
+
+static bool inet_bind2_bucket_addr_match(const struct inet_bind2_bucket *tb2,
+					 const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		return ipv6_addr_equal(&tb2->v6_rcv_saddr,
+				       &sk->sk_v6_rcv_saddr);
+#endif
+	return tb2->rcv_saddr == sk->sk_rcv_saddr;
+}
+
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
-		    const unsigned short snum)
+		    struct inet_bind2_bucket *tb2, unsigned short port)
 {
-	inet_sk(sk)->inet_num = snum;
+	inet_sk(sk)->inet_num = port;
 	sk_add_bind_node(sk, &tb->owners);
 	inet_csk(sk)->icsk_bind_hash = tb;
+	sk_add_bind2_node(sk, &tb2->owners);
+	inet_csk(sk)->icsk_bind2_hash = tb2;
 }
 
 /*
@@ -109,6 +172,9 @@ static void __inet_put_port(struct sock *sk)
 	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num,
 			hashinfo->bhash_size);
 	struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
+	struct inet_bind_hashbucket *head2 =
+		inet_bhashfn_portaddr(hashinfo, sk, sock_net(sk),
+				      inet_sk(sk)->inet_num);
 	struct inet_bind_bucket *tb;
 
 	spin_lock(&head->lock);
@@ -117,6 +183,17 @@ static void __inet_put_port(struct sock *sk)
 	inet_csk(sk)->icsk_bind_hash = NULL;
 	inet_sk(sk)->inet_num = 0;
 	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+
+	spin_lock(&head2->lock);
+	if (inet_csk(sk)->icsk_bind2_hash) {
+		struct inet_bind2_bucket *tb2 = inet_csk(sk)->icsk_bind2_hash;
+
+		__sk_del_bind2_node(sk);
+		inet_csk(sk)->icsk_bind2_hash = NULL;
+		inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
+	}
+	spin_unlock(&head2->lock);
+
 	spin_unlock(&head->lock);
 }
 
@@ -135,12 +212,21 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 	const int bhash = inet_bhashfn(sock_net(sk), port,
 			table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
+	struct inet_bind_hashbucket *head2 =
+		inet_bhashfn_portaddr(table, child, sock_net(sk), port);
+	bool created_inet_bind_bucket = false;
+	bool update_fastreuse = false;
+	struct net *net = sock_net(sk);
+	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	int l3mdev;
 
 	spin_lock(&head->lock);
+	spin_lock(&head2->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
-	if (unlikely(!tb)) {
+	tb2 = inet_csk(sk)->icsk_bind2_hash;
+	if (unlikely(!tb || !tb2)) {
+		spin_unlock(&head2->lock);
 		spin_unlock(&head->lock);
 		return -ENOENT;
 	}
@@ -153,25 +239,49 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 		 * as that of the child socket. We have to look up or
 		 * create a new bind bucket for the child here. */
 		inet_bind_bucket_for_each(tb, &head->chain) {
-			if (net_eq(ib_net(tb), sock_net(sk)) &&
-			    tb->l3mdev == l3mdev && tb->port == port)
+			if (inet_bind_bucket_match(tb, net, port, l3mdev))
 				break;
 		}
 		if (!tb) {
 			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
-						     sock_net(sk), head, port,
-						     l3mdev);
+						     net, head, port, l3mdev);
 			if (!tb) {
+				spin_unlock(&head2->lock);
 				spin_unlock(&head->lock);
 				return -ENOMEM;
 			}
+			created_inet_bind_bucket = true;
+		}
+		update_fastreuse = true;
+
+		goto bhash2_find;
+	} else if (!inet_bind2_bucket_addr_match(tb2, child)) {
+		l3mdev = inet_sk_bound_l3mdev(sk);
+
+bhash2_find:
+		tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, child);
+		if (!tb2) {
+			tb2 = inet_bind2_bucket_create(table->bind2_bucket_cachep,
+						       net, head2, port,
+						       l3mdev, child);
+			if (!tb2)
+				goto error;
 		}
-		inet_csk_update_fastreuse(tb, child);
 	}
-	inet_bind_hash(child, tb, port);
+	if (update_fastreuse)
+		inet_csk_update_fastreuse(tb, child);
+	inet_bind_hash(child, tb, tb2, port);
+	spin_unlock(&head2->lock);
 	spin_unlock(&head->lock);
 
 	return 0;
+
+error:
+	if (created_inet_bind_bucket)
+		inet_bind_bucket_destroy(table->bind_bucket_cachep, tb);
+	spin_unlock(&head2->lock);
+	spin_unlock(&head->lock);
+	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(__inet_inherit_port);
 
@@ -675,6 +785,112 @@ void inet_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(inet_unhash);
 
+static bool inet_bind2_bucket_match(const struct inet_bind2_bucket *tb,
+				    const struct net *net, unsigned short port,
+				    int l3mdev, const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		return net_eq(ib2_net(tb), net) && tb->port == port &&
+			tb->l3mdev == l3mdev &&
+			ipv6_addr_equal(&tb->v6_rcv_saddr, &sk->sk_v6_rcv_saddr);
+	else
+#endif
+		return net_eq(ib2_net(tb), net) && tb->port == port &&
+			tb->l3mdev == l3mdev && tb->rcv_saddr == sk->sk_rcv_saddr;
+}
+
+bool inet_bind2_bucket_match_addr_any(const struct inet_bind2_bucket *tb, const struct net *net,
+				      unsigned short port, int l3mdev, const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	struct in6_addr addr_any = {};
+
+	if (sk->sk_family == AF_INET6)
+		return net_eq(ib2_net(tb), net) && tb->port == port &&
+			tb->l3mdev == l3mdev &&
+			ipv6_addr_equal(&tb->v6_rcv_saddr, &addr_any);
+	else
+#endif
+		return net_eq(ib2_net(tb), net) && tb->port == port &&
+			tb->l3mdev == l3mdev && tb->rcv_saddr == 0;
+}
+
+/* The socket's bhash2 hashbucket spinlock must be held when this is called */
+struct inet_bind2_bucket *
+inet_bind2_bucket_find(const struct inet_bind_hashbucket *head, const struct net *net,
+		       unsigned short port, int l3mdev, const struct sock *sk)
+{
+	struct inet_bind2_bucket *bhash2 = NULL;
+
+	inet_bind_bucket_for_each(bhash2, &head->chain)
+		if (inet_bind2_bucket_match(bhash2, net, port, l3mdev, sk))
+			break;
+
+	return bhash2;
+}
+
+struct inet_bind_hashbucket *
+inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, int port)
+{
+	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	u32 hash;
+#if IS_ENABLED(CONFIG_IPV6)
+	struct in6_addr addr_any = {};
+
+	if (sk->sk_family == AF_INET6)
+		hash = ipv6_portaddr_hash(net, &addr_any, port);
+	else
+#endif
+		hash = ipv4_portaddr_hash(net, 0, port);
+
+	return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
+}
+
+int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
+{
+	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_bind2_bucket *tb2, *new_tb2;
+	int l3mdev = inet_sk_bound_l3mdev(sk);
+	struct inet_bind_hashbucket *head2;
+	int port = inet_sk(sk)->inet_num;
+	struct net *net = sock_net(sk);
+
+	/* Allocate a bind2 bucket ahead of time to avoid permanently putting
+	 * the bhash2 table in an inconsistent state if a new tb2 bucket
+	 * allocation fails.
+	 */
+	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
+	if (!new_tb2)
+		return -ENOMEM;
+
+	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
+
+	if (prev_saddr) {
+		spin_lock_bh(&prev_saddr->lock);
+		__sk_del_bind2_node(sk);
+		inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep,
+					  inet_csk(sk)->icsk_bind2_hash);
+		spin_unlock_bh(&prev_saddr->lock);
+	}
+
+	spin_lock_bh(&head2->lock);
+	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
+	if (!tb2) {
+		tb2 = new_tb2;
+		inet_bind2_bucket_init(tb2, net, head2, port, l3mdev, sk);
+	}
+	sk_add_bind2_node(sk, &tb2->owners);
+	inet_csk(sk)->icsk_bind2_hash = tb2;
+	spin_unlock_bh(&head2->lock);
+
+	if (tb2 != new_tb2)
+		kmem_cache_free(hinfo->bind2_bucket_cachep, new_tb2);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
+
 /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
  * Note that we use 32bit integers (vs RFC 'short integers')
  * because 2^16 is not a multiple of num_ephemeral and this
@@ -694,11 +910,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			struct sock *, __u16, struct inet_timewait_sock **))
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
+	struct inet_bind_hashbucket *head, *head2;
 	struct inet_timewait_sock *tw = NULL;
-	struct inet_bind_hashbucket *head;
 	int port = inet_sk(sk)->inet_num;
 	struct net *net = sock_net(sk);
+	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
+	bool tb_created = false;
 	u32 remaining, offset;
 	int ret, i, low, high;
 	int l3mdev;
@@ -755,8 +973,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		 * the established check is already unique enough.
 		 */
 		inet_bind_bucket_for_each(tb, &head->chain) {
-			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
-			    tb->port == port) {
+			if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
 				if (tb->fastreuse >= 0 ||
 				    tb->fastreuseport >= 0)
 					goto next_port;
@@ -774,6 +991,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			spin_unlock_bh(&head->lock);
 			return -ENOMEM;
 		}
+		tb_created = true;
 		tb->fastreuse = -1;
 		tb->fastreuseport = -1;
 		goto ok;
@@ -789,6 +1007,20 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	return -EADDRNOTAVAIL;
 
 ok:
+	/* Find the corresponding tb2 bucket since we need to
+	 * add the socket to the bhash2 table as well
+	 */
+	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
+	spin_lock(&head2->lock);
+
+	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
+	if (!tb2) {
+		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, net,
+					       head2, port, l3mdev, sk);
+		if (!tb2)
+			goto error;
+	}
+
 	/* Here we want to add a little bit of randomness to the next source
 	 * port that will be chosen. We use a max() with a random here so that
 	 * on low contention the randomness is maximal and on high contention
@@ -798,7 +1030,10 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
 
 	/* Head lock still held and bh's disabled */
-	inet_bind_hash(sk, tb, port);
+	inet_bind_hash(sk, tb, tb2, port);
+
+	spin_unlock(&head2->lock);
+
 	if (sk_unhashed(sk)) {
 		inet_sk(sk)->inet_sport = htons(port);
 		inet_ehash_nolisten(sk, (struct sock *)tw, NULL);
@@ -810,6 +1045,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		inet_twsk_deschedule_put(tw);
 	local_bh_enable();
 	return 0;
+
+error:
+	spin_unlock(&head2->lock);
+	if (tb_created)
+		inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
+	spin_unlock_bh(&head->lock);
+	return -ENOMEM;
 }
 
 /*
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 970e9a2cca4a..76484451f586 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4739,6 +4739,12 @@ void __init tcp_init(void)
 				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
 				  SLAB_ACCOUNT,
 				  NULL);
+	tcp_hashinfo.bind2_bucket_cachep =
+		kmem_cache_create("tcp_bind2_bucket",
+				  sizeof(struct inet_bind2_bucket), 0,
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
@@ -4762,7 +4768,7 @@ void __init tcp_init(void)
 		panic("TCP: failed to alloc ehash_locks");
 	tcp_hashinfo.bhash =
 		alloc_large_system_hash("TCP bind",
-					sizeof(struct inet_bind_hashbucket),
+					2 * sizeof(struct inet_bind_hashbucket),
 					tcp_hashinfo.ehash_mask + 1,
 					17, /* one slot per 128 KB of memory */
 					0,
@@ -4771,9 +4777,12 @@ void __init tcp_init(void)
 					0,
 					64 * 1024);
 	tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
+	tcp_hashinfo.bhash2 = tcp_hashinfo.bhash + tcp_hashinfo.bhash_size;
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&tcp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
+		spin_lock_init(&tcp_hashinfo.bhash2[i].lock);
+		INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
 	}
 
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0c83780dc9bf..cc2ad67f75be 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -199,11 +199,12 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
 /* This will initiate an outgoing connection. */
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
+	struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
 	struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
+	__be32 daddr, nexthop, prev_sk_rcv_saddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	__be16 orig_sport, orig_dport;
-	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	int err;
@@ -246,10 +247,28 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (!inet_opt || !inet_opt->opt.srr)
 		daddr = fl4->daddr;
 
-	if (!inet->inet_saddr)
+	if (!inet->inet_saddr) {
+		if (inet_csk(sk)->icsk_bind2_hash) {
+			prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo,
+								     sk, sock_net(sk),
+								     inet->inet_num);
+			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
+		}
 		inet->inet_saddr = fl4->saddr;
+	}
+
 	sk_rcv_saddr_set(sk, inet->inet_saddr);
 
+	if (prev_addr_hashbucket) {
+		err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+		if (err) {
+			inet->inet_saddr = 0;
+			sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
+			ip_rt_put(rt);
+			return err;
+		}
+	}
+
 	if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
 		/* Reset inherited state */
 		tp->rx_opt.ts_recent	   = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index e54eee80ce5f..ff5c4fc135fc 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -287,8 +287,25 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	}
 
 	if (!saddr) {
+		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
+		struct in6_addr prev_v6_rcv_saddr;
+
+		if (icsk->icsk_bind2_hash) {
+			prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo,
+								     sk, sock_net(sk),
+								     inet->inet_num);
+			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
+		}
 		saddr = &fl6.saddr;
 		sk->sk_v6_rcv_saddr = *saddr;
+
+		if (prev_addr_hashbucket) {
+			err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+			if (err) {
+				sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
+				goto failure;
+			}
+		}
 	}
 
 	/* set the source address */
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next v4 0/3] Add a second bind table hashed by port + address
From: Joanne Koong @ 2022-08-12 20:29 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, kafai, kuba, davem, pabeni, dccp, Joanne Koong

Currently, there is one bind hashtable (bhash) that hashes by port only.
This patchset adds a second bind table (bhash2) that hashes by port and
address.

The motivation for adding bhash2 is to expedite bind requests in situations
where the port has many sockets in its bhash table entry (eg a large number
of sockets bound to different addresses on the same port), which makes checking
bind conflicts costly especially given that we acquire the table entry spinlock
while doing so, which can cause softirq cpu lockups and can prevent new tcp
connections.

We ran into this problem at Meta where the traffic team binds a large number
of IPs to port 443 and the bind() call took a significant amount of time
which led to cpu softirq lockups, which caused packet drops and other failures
on the machine.

When experimentally testing this on a local server for ~24k sockets bound to
the port, the results seen were:

ipv4:
before - 0.002317 seconds
with bhash2 - 0.000020 seconds

ipv6:
before - 0.002431 seconds
with bhash2 - 0.000021 seconds

The additions to the initial bhash2 submission [0] are:
* Updating bhash2 in the cases where a socket's rcv saddr changes after it has
* been bound
* Adding locks for bhash2 hashbuckets

[0] https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/

---
Changelog

v3 -> v4
v3: https://lore.kernel.org/netdev/20220722195406.1304948-2-joannelkoong@gmail.com/
  * Address kernel test robot bad page map:
    - only update the bhash2 table in connect() on addr 0, if the socket is bound

v2 -> v3
v2: https://lore.kernel.org/netdev/20220712235310.1935121-1-joannelkoong@gmail.com/
  * Address Paolo's feedback
    1/3:
        - Move inet_bhashfn_portaddr down in inet_csk_find_open_port()
        - Remove unused "head" in inet_bhash2_update_saddr
    2/3:
        - Make tests work for ipv4, make address configurable from command line
        - Use 'nodad' option for ip addr add in script
    3/3:
        - Add sk_bind_sendto_listen to Makefile for it to run automatically

  * Check if the icsk_bind2_hash was set before finding the prev_addr_hashbucket.
    If the icsk_bind2_hash wasn't set, this means the prev address was never
    added to the bhash2, so pass in NULL "prev_saddr" to inet_bhash2_update_saddr().
    This addresses the kernel_NULL_pointer_dereference report [1].

  * Add sk_connect_zero_addr test (tests that the kernel_NULL_pointer_dereference bug
    is fixed).

  [1] https://lore.kernel.org/netdev/YtLJMxChUupbAa+U@xsang-OptiPlex-9020/

v1 -> v2
v1: https://lore.kernel.org/netdev/20220623234242.2083895-2-joannelkoong@gmail.com/
  * Drop formatting change to sk_add_bind_node()


Joanne Koong (3):
  net: Add a bhash2 table hashed by port + address
  selftests/net: Add test for timing a bind request to a port with a
    populated bhash entry
  selftests/net: Add sk_bind_sendto_listen and sk_connect_zero_addr

 include/net/inet_connection_sock.h            |   3 +
 include/net/inet_hashtables.h                 |  80 ++++-
 include/net/sock.h                            |  14 +
 net/dccp/ipv4.c                               |  25 +-
 net/dccp/ipv6.c                               |  18 ++
 net/dccp/proto.c                              |  34 ++-
 net/ipv4/af_inet.c                            |  26 +-
 net/ipv4/inet_connection_sock.c               | 275 ++++++++++++++----
 net/ipv4/inet_hashtables.c                    | 268 ++++++++++++++++-
 net/ipv4/tcp.c                                |  11 +-
 net/ipv4/tcp_ipv4.c                           |  23 +-
 net/ipv6/tcp_ipv6.c                           |  17 ++
 tools/testing/selftests/net/.gitignore        |   5 +-
 tools/testing/selftests/net/Makefile          |   5 +
 tools/testing/selftests/net/bind_bhash.c      | 144 +++++++++
 tools/testing/selftests/net/bind_bhash.sh     |  66 +++++
 .../selftests/net/sk_bind_sendto_listen.c     |  80 +++++
 .../selftests/net/sk_connect_zero_addr.c      |  62 ++++
 18 files changed, 1061 insertions(+), 95 deletions(-)
 create mode 100644 tools/testing/selftests/net/bind_bhash.c
 create mode 100755 tools/testing/selftests/net/bind_bhash.sh
 create mode 100644 tools/testing/selftests/net/sk_bind_sendto_listen.c
 create mode 100644 tools/testing/selftests/net/sk_connect_zero_addr.c

-- 
2.30.2


^ permalink raw reply

* [PATCH bpf-next v8 5/5] selftests/bpf: add a selftest for cgroup hierarchical stats collection
From: Hao Luo @ 2022-08-12 20:28 UTC (permalink / raw)
  To: linux-kernel, bpf, cgroups, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Tejun Heo, Zefan Li,
	KP Singh, Johannes Weiner, Michal Hocko, John Fastabend,
	Michal Koutny, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Shakeel Butt, Yosry Ahmed, Hao Luo
In-Reply-To: <20220812202802.3774257-1-haoluo@google.com>

From: Yosry Ahmed <yosryahmed@google.com>

Add a selftest that tests the whole workflow for collecting,
aggregating (flushing), and displaying cgroup hierarchical stats.

TL;DR:
- Userspace program creates a cgroup hierarchy and induces memcg reclaim
  in parts of it.
- Whenever reclaim happens, vmscan_start and vmscan_end update
  per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
  have updates.
- When userspace tries to read the stats, vmscan_dump calls rstat to flush
  the stats, and outputs the stats in text format to userspace (similar
  to cgroupfs stats).
- rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
  updates, vmscan_flush aggregates cpu readings and propagates updates
  to parents.
- Userspace program makes sure the stats are aggregated and read
  correctly.

Detailed explanation:
- The test loads tracing bpf programs, vmscan_start and vmscan_end, to
  measure the latency of cgroup reclaim. Per-cgroup readings are stored in
  percpu maps for efficiency. When a cgroup reading is updated on a cpu,
  cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
  rstat updated tree on that cpu.

- A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
  each cgroup. Reading this file invokes the program, which calls
  cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
  cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
  the stats are exposed to the user. vmscan_dump returns 1 to terminate
  iteration early, so that we only expose stats for one cgroup per read.

- An ftrace program, vmscan_flush, is also loaded and attached to
  bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
  once for each (cgroup, cpu) pair that has updates. cgroups are popped
  from the rstat tree in a bottom-up fashion, so calls will always be
  made for cgroups that have updates before their parents. The program
  aggregates percpu readings to a total per-cgroup reading, and also
  propagates them to the parent cgroup. After rstat flushing is over, all
  cgroups will have correct updated hierarchical readings (including all
  cpus and all their descendants).

- Finally, the test creates a cgroup hierarchy and induces memcg reclaim
  in parts of it, and makes sure that the stats collection, aggregation,
  and reading workflow works as expected.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../prog_tests/cgroup_hierarchical_stats.c    | 358 ++++++++++++++++++
 .../bpf/progs/cgroup_hierarchical_stats.c     | 226 +++++++++++
 2 files changed, 584 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..66cc124f5e5a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include <asm-generic/errno.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "cgroup_hierarchical_stats.skel.h"
+
+#define PAGE_SIZE 4096
+#define MB(x) (x << 20)
+
+#define BPFFS_ROOT "/sys/fs/bpf/"
+#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
+
+#define CG_ROOT_NAME "root"
+#define CG_ROOT_ID 1
+
+#define CGROUP_PATH(p, n) {.path = p"/"n, .name = n}
+
+static struct {
+	const char *path, *name;
+	unsigned long long id;
+	int fd;
+} cgroups[] = {
+	CGROUP_PATH("/", "test"),
+	CGROUP_PATH("/test", "child1"),
+	CGROUP_PATH("/test", "child2"),
+	CGROUP_PATH("/test/child1", "child1_1"),
+	CGROUP_PATH("/test/child1", "child1_2"),
+	CGROUP_PATH("/test/child2", "child2_1"),
+	CGROUP_PATH("/test/child2", "child2_2"),
+};
+
+#define N_CGROUPS ARRAY_SIZE(cgroups)
+#define N_NON_LEAF_CGROUPS 3
+
+static int root_cgroup_fd;
+static bool mounted_bpffs;
+
+/* reads file at 'path' to 'buf', returns 0 on success. */
+static int read_from_file(const char *path, char *buf, size_t size)
+{
+	int fd, len;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return fd;
+
+	len = read(fd, buf, size);
+	close(fd);
+	if (len < 0)
+		return len;
+
+	buf[len] = 0;
+	return 0;
+}
+
+/* mounts bpffs and mkdir for reading stats, returns 0 on success. */
+static int setup_bpffs(void)
+{
+	int err;
+
+	/* Mount bpffs */
+	err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
+	mounted_bpffs = !err;
+	if (ASSERT_FALSE(err && errno != EBUSY, "mount"))
+		return err;
+
+	/* Create a directory to contain stat files in bpffs */
+	err = mkdir(BPFFS_VMSCAN, 0755);
+	if (!ASSERT_OK(err, "mkdir"))
+		return err;
+
+	return 0;
+}
+
+static void cleanup_bpffs(void)
+{
+	/* Remove created directory in bpffs */
+	ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
+
+	/* Unmount bpffs, if it wasn't already mounted when we started */
+	if (mounted_bpffs)
+		return;
+
+	ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
+}
+
+/* sets up cgroups, returns 0 on success. */
+static int setup_cgroups(void)
+{
+	int i, fd, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		return err;
+
+	root_cgroup_fd = get_root_cgroup();
+	if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
+		return root_cgroup_fd;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
+			return fd;
+
+		cgroups[i].fd = fd;
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+
+		/*
+		 * Enable memcg controller for the entire hierarchy.
+		 * Note that stats are collected for all cgroups in a hierarchy
+		 * with memcg enabled anyway, but are only exposed for cgroups
+		 * that have memcg enabled.
+		 */
+		if (i < N_NON_LEAF_CGROUPS) {
+			err = enable_controllers(cgroups[i].path, "memory");
+			if (!ASSERT_OK(err, "enable_controllers"))
+				return err;
+		}
+	}
+	return 0;
+}
+
+static void cleanup_cgroups(void)
+{
+	close(root_cgroup_fd);
+	for (int i = 0; i < N_CGROUPS; i++)
+		close(cgroups[i].fd);
+	cleanup_cgroup_environment();
+}
+
+/* Sets up cgroup hiearchary, returns 0 on success. */
+static int setup_hierarchy(void)
+{
+	return setup_bpffs() || setup_cgroups();
+}
+
+static void destroy_hierarchy(void)
+{
+	cleanup_cgroups();
+	cleanup_bpffs();
+}
+
+static int reclaimer(const char *cgroup_path, size_t size)
+{
+	static char size_buf[128];
+	char *buf, *ptr;
+	int err;
+
+	/* Join cgroup in the parent process workdir */
+	if (join_parent_cgroup(cgroup_path))
+		return EACCES;
+
+	/* Allocate memory */
+	buf = malloc(size);
+	if (!buf)
+		return ENOMEM;
+
+	/* Write to memory to make sure it's actually allocated */
+	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+		*ptr = 1;
+
+	/* Try to reclaim memory */
+	snprintf(size_buf, 128, "%lu", size);
+	err = write_cgroup_file_parent(cgroup_path, "memory.reclaim", size_buf);
+
+	free(buf);
+	/* memory.reclaim returns EAGAIN if the amount is not fully reclaimed */
+	if (err && errno != EAGAIN)
+		return errno;
+
+	return 0;
+}
+
+static int induce_vmscan(void)
+{
+	int i, status;
+
+	/*
+	 * In every leaf cgroup, run a child process that allocates some memory
+	 * and attempts to reclaim some of it.
+	 */
+	for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
+		pid_t pid;
+
+		/* Create reclaimer child */
+		pid = fork();
+		if (pid == 0) {
+			status = reclaimer(cgroups[i].path, MB(5));
+			exit(status);
+		}
+
+		/* Cleanup reclaimer child */
+		waitpid(pid, &status, 0);
+		ASSERT_TRUE(WIFEXITED(status), "reclaimer exited");
+		ASSERT_EQ(WEXITSTATUS(status), 0, "reclaim exit code");
+	}
+	return 0;
+}
+
+static unsigned long long
+get_cgroup_vmscan_delay(unsigned long long cgroup_id, const char *file_name)
+{
+	unsigned long long vmscan = 0, id = 0;
+	static char buf[128], path[128];
+
+	/* For every cgroup, read the file generated by cgroup_iter */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	if (!ASSERT_OK(read_from_file(path, buf, 128), "read cgroup_iter"))
+		return 0;
+
+	/* Check the output file formatting */
+	ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
+			 &id, &vmscan), 2, "output format");
+
+	/* Check that the cgroup_id is displayed correctly */
+	ASSERT_EQ(id, cgroup_id, "cgroup_id");
+	/* Check that the vmscan reading is non-zero */
+	ASSERT_GT(vmscan, 0, "vmscan_reading");
+	return vmscan;
+}
+
+static void check_vmscan_stats(void)
+{
+	unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
+	int i;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id,
+							     cgroups[i].name);
+	}
+
+	/* Read stats for root too */
+	vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
+
+	/* Check that child1 == child1_1 + child1_2 */
+	ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
+		  "child1_vmscan");
+	/* Check that child2 == child2_1 + child2_2 */
+	ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
+		  "child2_vmscan");
+	/* Check that test == child1 + child2 */
+	ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
+		  "test_vmscan");
+	/* Check that root >= test */
+	ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
+}
+
+/* Creates iter link and pins in bpffs, returns 0 on success, -errno on failure.
+ */
+static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj,
+			     int cgroup_fd, const char *file_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {};
+	struct bpf_link *link;
+	static char path[128];
+	int err;
+
+	/*
+	 * Create an iter link, parameterized by cgroup_fd.
+	 * We only want to traverse one cgroup, so set the traversal order to
+	 * "pre", and return 1 from dump_vmscan to stop iteration after the
+	 * first cgroup.
+	 */
+	linfo.cgroup.cgroup_fd = cgroup_fd;
+	linfo.cgroup.order = BPF_ITER_SELF_ONLY;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+	link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
+	if (!ASSERT_OK_PTR(link, "attach iter"))
+		return -EFAULT;
+
+	/* Pin the link to a bpffs file */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err = bpf_link__pin(link, path);
+	ASSERT_OK(err, "pin cgroup_iter");
+
+	/* Remove the link, leaving only the ref held by the pinned file */
+	bpf_link__destroy(link);
+	return err;
+}
+
+/* Sets up programs for collecting stats, returns 0 on success. */
+static int setup_progs(struct cgroup_hierarchical_stats **skel)
+{
+	int i, err;
+
+	*skel = cgroup_hierarchical_stats__open_and_load();
+	if (!ASSERT_OK_PTR(*skel, "open_and_load"))
+		return 1;
+
+	/* Attach cgroup_iter program that will dump the stats to cgroups */
+	for (i = 0; i < N_CGROUPS; i++) {
+		err = setup_cgroup_iter(*skel, cgroups[i].fd, cgroups[i].name);
+		if (!ASSERT_OK(err, "setup_cgroup_iter"))
+			return err;
+	}
+
+	/* Also dump stats for root */
+	err = setup_cgroup_iter(*skel, root_cgroup_fd, CG_ROOT_NAME);
+	if (!ASSERT_OK(err, "setup_cgroup_iter"))
+		return err;
+
+	err = cgroup_hierarchical_stats__attach(*skel);
+	if (!ASSERT_OK(err, "attach"))
+		return err;
+
+	return 0;
+}
+
+static void destroy_progs(struct cgroup_hierarchical_stats *skel)
+{
+	static char path[128];
+	int i;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		/* Delete files in bpffs that cgroup_iters are pinned in */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroups[i].name);
+		ASSERT_OK(remove(path), "remove cgroup_iter pin");
+	}
+
+	/* Delete root file in bpffs */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
+	ASSERT_OK(remove(path), "remove cgroup_iter root pin");
+	cgroup_hierarchical_stats__destroy(skel);
+}
+
+void test_cgroup_hierarchical_stats(void)
+{
+	struct cgroup_hierarchical_stats *skel = NULL;
+
+	if (setup_hierarchy())
+		goto hierarchy_cleanup;
+	if (setup_progs(&skel))
+		goto cleanup;
+	if (induce_vmscan())
+		goto cleanup;
+	check_vmscan_stats();
+cleanup:
+	destroy_progs(skel);
+hierarchy_cleanup:
+	destroy_hierarchy();
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..8ab4253a1592
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -0,0 +1,226 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * Start times are stored per-task, not per-cgroup, as multiple tasks in one
+ * cgroup can perform reclaim concurrently.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, __u64);
+} vmscan_start_time SEC(".maps");
+
+struct vmscan_percpu {
+	/* Previous percpu state, to figure out if we have new updates */
+	__u64 prev;
+	/* Current percpu state */
+	__u64 state;
+};
+
+struct vmscan {
+	/* State propagated through children, pending aggregation */
+	__u64 pending;
+	/* Total state, including all cpus and all children */
+	__u64 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 100);
+	__type(key, __u64);
+	__type(value, struct vmscan_percpu);
+} pcpu_cgroup_vmscan_elapsed SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 100);
+	__type(key, __u64);
+	__type(value, struct vmscan);
+} cgroup_vmscan_elapsed SEC(".maps");
+
+extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
+extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+
+static struct cgroup *task_memcg(struct task_struct *task)
+{
+	int cgrp_id;
+
+#if __has_builtin(__builtin_preserve_enum_value)
+	cgrp_id = bpf_core_enum_value(enum cgroup_subsys_id, memory_cgrp_id);
+#else
+	cgrp_id = memory_cgrp_id;
+#endif
+	return task->cgroups->subsys[cgrp_id]->cgroup;
+}
+
+static uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+static int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
+{
+	struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
+
+	return bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
+				   &pcpu_init, BPF_NOEXIST);
+}
+
+static int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
+{
+	struct vmscan init = {.state = state, .pending = pending};
+
+	return bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
+				   &init, BPF_NOEXIST);
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
+int BPF_PROG(vmscan_start, int order, gfp_t gfp_flags)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	__u64 *start_time_ptr;
+
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (start_time_ptr)
+		*start_time_ptr = bpf_ktime_get_ns();
+	return 0;
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
+int BPF_PROG(vmscan_end, unsigned long nr_reclaimed)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct task_struct *current = bpf_get_current_task_btf();
+	struct cgroup *cgrp;
+	__u64 *start_time_ptr;
+	__u64 current_elapsed, cg_id;
+	__u64 end_time = bpf_ktime_get_ns();
+
+	/*
+	 * cgrp is the first parent cgroup of current that has memcg enabled in
+	 * its subtree_control, or NULL if memcg is disabled in the entire tree.
+	 * In a cgroup hierarchy like this:
+	 *                               a
+	 *                              / \
+	 *                             b   c
+	 *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
+	 *  will accumulate their stats directly to "a". This makes sure that no
+	 *  stats are lost from processes in leaf cgroups that don't have memcg
+	 *  enabled, but only exposes stats for cgroups that have memcg enabled.
+	 */
+	cgrp = task_memcg(current);
+	if (!cgrp)
+		return 0;
+
+	cg_id = cgroup_id(cgrp);
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr)
+		return 0;
+
+	current_elapsed = end_time - *start_time_ptr;
+	pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
+					&cg_id);
+	if (pcpu_stat)
+		pcpu_stat->state += current_elapsed;
+	else if (create_vmscan_percpu_elem(cg_id, current_elapsed))
+		return 0;
+
+	cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
+	return 0;
+}
+
+SEC("fentry/bpf_rstat_flush")
+int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct vmscan *total_stat, *parent_stat;
+	__u64 cg_id = cgroup_id(cgrp);
+	__u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
+	__u64 *pcpu_vmscan;
+	__u64 state;
+	__u64 delta = 0;
+
+	/* Add CPU changes on this level since the last flush */
+	pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
+					       &cg_id, cpu);
+	if (pcpu_stat) {
+		state = pcpu_stat->state;
+		delta += state - pcpu_stat->prev;
+		pcpu_stat->prev = state;
+	}
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		if (create_vmscan_elem(cg_id, delta, 0))
+			return 0;
+
+		goto update_parent;
+	}
+
+	/* Collect pending stats from subtree */
+	if (total_stat->pending) {
+		delta += total_stat->pending;
+		total_stat->pending = 0;
+	}
+
+	/* Propagate changes to this cgroup's total */
+	total_stat->state += delta;
+
+update_parent:
+	/* Skip if there are no changes to propagate, or no parent */
+	if (!delta || !parent_cg_id)
+		return 0;
+
+	/* Propagate changes to cgroup's parent */
+	parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
+					  &parent_cg_id);
+	if (parent_stat)
+		parent_stat->pending += delta;
+	else
+		create_vmscan_elem(parent_cg_id, 0, delta);
+	return 0;
+}
+
+SEC("iter.s/cgroup")
+int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
+{
+	struct seq_file *seq = meta->seq;
+	struct vmscan *total_stat;
+	__u64 cg_id = cgrp ? cgroup_id(cgrp) : 0;
+
+	/* Do nothing for the terminal call */
+	if (!cg_id)
+		return 1;
+
+	/* Flush the stats to make sure we get the most updated numbers */
+	cgroup_rstat_flush(cgrp);
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: 0\n",
+			       cg_id);
+	} else {
+		BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
+			       cg_id, total_stat->state);
+	}
+
+	/*
+	 * We only dump stats for one cgroup here, so return 1 to stop
+	 * iteration after the first cgroup.
+	 */
+	return 1;
+}
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related

* [PATCH bpf-next v8 4/5] selftests/bpf: extend cgroup helpers
From: Hao Luo @ 2022-08-12 20:28 UTC (permalink / raw)
  To: linux-kernel, bpf, cgroups, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Tejun Heo, Zefan Li,
	KP Singh, Johannes Weiner, Michal Hocko, John Fastabend,
	Michal Koutny, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Shakeel Butt, Yosry Ahmed, Hao Luo
In-Reply-To: <20220812202802.3774257-1-haoluo@google.com>

From: Yosry Ahmed <yosryahmed@google.com>

This patch extends bpf selft cgroup_helpers [ID] n various ways:
- Add enable_controllers() that allows tests to enable all or a
  subset of controllers for a specific cgroup.
- Add join_cgroup_parent(). The cgroup workdir is based on the pid,
  therefore a spawned child cannot join the same cgroup hierarchy of the
  test through join_cgroup(). join_cgroup_parent() is used in child
  processes to join a cgroup under the parent's workdir.
- Add write_cgroup_file() and write_cgroup_file_parent() (similar to
  join_cgroup_parent() above).
- Add get_root_cgroup() for tests that need to do checks on root cgroup.
- Distinguish relative and absolute cgroup paths in function arguments.
  Now relative paths are called relative_path, and absolute paths are
  called cgroup_path.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 202 +++++++++++++++----
 tools/testing/selftests/bpf/cgroup_helpers.h |  19 +-
 2 files changed, 174 insertions(+), 47 deletions(-)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 9d59c3990ca8..e914cc45b766 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -33,49 +33,52 @@
 #define CGROUP_MOUNT_DFLT		"/sys/fs/cgroup"
 #define NETCLS_MOUNT_PATH		CGROUP_MOUNT_DFLT "/net_cls"
 #define CGROUP_WORK_DIR			"/cgroup-test-work-dir"
-#define format_cgroup_path(buf, path) \
+
+#define format_cgroup_path_pid(buf, path, pid) \
 	snprintf(buf, sizeof(buf), "%s%s%d%s", CGROUP_MOUNT_PATH, \
-	CGROUP_WORK_DIR, getpid(), path)
+	CGROUP_WORK_DIR, pid, path)
+
+#define format_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getpid())
+
+#define format_parent_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getppid())
 
 #define format_classid_path(buf)				\
 	snprintf(buf, sizeof(buf), "%s%s", NETCLS_MOUNT_PATH,	\
 		 CGROUP_WORK_DIR)
 
-/**
- * enable_all_controllers() - Enable all available cgroup v2 controllers
- *
- * Enable all available cgroup v2 controllers in order to increase
- * the code coverage.
- *
- * If successful, 0 is returned.
- */
-static int enable_all_controllers(char *cgroup_path)
+static int __enable_controllers(const char *cgroup_path, const char *controllers)
 {
 	char path[PATH_MAX + 1];
-	char buf[PATH_MAX];
+	char enable[PATH_MAX + 1];
 	char *c, *c2;
 	int fd, cfd;
 	ssize_t len;
 
-	snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		log_err("Opening cgroup.controllers: %s", path);
-		return 1;
-	}
-
-	len = read(fd, buf, sizeof(buf) - 1);
-	if (len < 0) {
+	/* If not controllers are passed, enable all available controllers */
+	if (!controllers) {
+		snprintf(path, sizeof(path), "%s/cgroup.controllers",
+			 cgroup_path);
+		fd = open(path, O_RDONLY);
+		if (fd < 0) {
+			log_err("Opening cgroup.controllers: %s", path);
+			return 1;
+		}
+		len = read(fd, enable, sizeof(enable) - 1);
+		if (len < 0) {
+			close(fd);
+			log_err("Reading cgroup.controllers: %s", path);
+			return 1;
+		} else if (len == 0) { /* No controllers to enable */
+			close(fd);
+			return 0;
+		}
+		enable[len] = 0;
 		close(fd);
-		log_err("Reading cgroup.controllers: %s", path);
-		return 1;
+	} else {
+		strncpy(enable, controllers, sizeof(enable));
 	}
-	buf[len] = 0;
-	close(fd);
-
-	/* No controllers available? We're probably on cgroup v1. */
-	if (len == 0)
-		return 0;
 
 	snprintf(path, sizeof(path), "%s/cgroup.subtree_control", cgroup_path);
 	cfd = open(path, O_RDWR);
@@ -84,7 +87,7 @@ static int enable_all_controllers(char *cgroup_path)
 		return 1;
 	}
 
-	for (c = strtok_r(buf, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
+	for (c = strtok_r(enable, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
 		if (dprintf(cfd, "+%s\n", c) <= 0) {
 			log_err("Enabling controller %s: %s", c, path);
 			close(cfd);
@@ -95,6 +98,87 @@ static int enable_all_controllers(char *cgroup_path)
 	return 0;
 }
 
+/**
+ * enable_controllers() - Enable cgroup v2 controllers
+ * @relative_path: The cgroup path, relative to the workdir
+ * @controllers: List of controllers to enable in cgroup.controllers format
+ *
+ *
+ * Enable given cgroup v2 controllers, if @controllers is NULL, enable all
+ * available controllers.
+ *
+ * If successful, 0 is returned.
+ */
+int enable_controllers(const char *relative_path, const char *controllers)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __enable_controllers(cgroup_path, controllers);
+}
+
+static int __write_cgroup_file(const char *cgroup_path, const char *file,
+			       const char *buf)
+{
+	char file_path[PATH_MAX + 1];
+	int fd;
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDWR);
+	if (fd < 0) {
+		log_err("Opening %s", file_path);
+		return 1;
+	}
+
+	if (dprintf(fd, "%s", buf) <= 0) {
+		log_err("Writing to %s", file_path);
+		close(fd);
+		return 1;
+	}
+	close(fd);
+	return 0;
+}
+
+/**
+ * write_cgroup_file() - Write to a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @file: The name of the file in cgroupfs to write to
+ * @buf: Buffer to write to the file
+ *
+ * Write to a file in the given cgroup's directory.
+ *
+ * If successful, 0 is returned.
+ */
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf)
+{
+	char cgroup_path[PATH_MAX - 24];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __write_cgroup_file(cgroup_path, file, buf);
+}
+
+/**
+ * write_cgroup_file_parent() - Write to a cgroup file in the parent process
+ *                              workdir
+ * @relative_path: The cgroup path, relative to the parent process workdir
+ * @file: The name of the file in cgroupfs to write to
+ * @buf: Buffer to write to the file
+ *
+ * Write to a file in the given cgroup's directory under the parent process
+ * workdir.
+ *
+ * If successful, 0 is returned.
+ */
+int write_cgroup_file_parent(const char *relative_path, const char *file,
+			     const char *buf)
+{
+	char cgroup_path[PATH_MAX - 24];
+
+	format_parent_cgroup_path(cgroup_path, relative_path);
+	return __write_cgroup_file(cgroup_path, file, buf);
+}
+
 /**
  * setup_cgroup_environment() - Setup the cgroup environment
  *
@@ -133,7 +217,9 @@ int setup_cgroup_environment(void)
 		return 1;
 	}
 
-	if (enable_all_controllers(cgroup_workdir))
+	/* Enable all available controllers to increase test coverage */
+	if (__enable_controllers(CGROUP_MOUNT_PATH, NULL) ||
+	    __enable_controllers(cgroup_workdir, NULL))
 		return 1;
 
 	return 0;
@@ -173,7 +259,7 @@ static int join_cgroup_from_top(const char *cgroup_path)
 
 /**
  * join_cgroup() - Join a cgroup
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function expects a cgroup to already be created, relative to the cgroup
  * work dir, and it joins it. For example, passing "/my-cgroup" as the path
@@ -182,11 +268,27 @@ static int join_cgroup_from_top(const char *cgroup_path)
  *
  * On success, it returns 0, otherwise on failure it returns 1.
  */
-int join_cgroup(const char *path)
+int join_cgroup(const char *relative_path)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return join_cgroup_from_top(cgroup_path);
+}
+
+/**
+ * join_parent_cgroup() - Join a cgroup in the parent process workdir
+ * @relative_path: The cgroup path, relative to parent process workdir, to join
+ *
+ * See join_cgroup().
+ *
+ * On success, it returns 0, otherwise on failure it returns 1.
+ */
+int join_parent_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 
-	format_cgroup_path(cgroup_path, path);
+	format_parent_cgroup_path(cgroup_path, relative_path);
 	return join_cgroup_from_top(cgroup_path);
 }
 
@@ -212,9 +314,27 @@ void cleanup_cgroup_environment(void)
 	nftw(cgroup_workdir, nftwfunc, WALK_FD_LIMIT, FTW_DEPTH | FTW_MOUNT);
 }
 
+/**
+ * get_root_cgroup() - Get the FD of the root cgroup
+ *
+ * On success, it returns the file descriptor. On failure, it returns -1.
+ * If there is a failure, it prints the error to stderr.
+ */
+int get_root_cgroup(void)
+{
+	int fd;
+
+	fd = open(CGROUP_MOUNT_PATH, O_RDONLY);
+	if (fd < 0) {
+		log_err("Opening root cgroup");
+		return -1;
+	}
+	return fd;
+}
+
 /**
  * create_and_get_cgroup() - Create a cgroup, relative to workdir, and get the FD
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function creates a cgroup under the top level workdir and returns the
  * file descriptor. It is idempotent.
@@ -222,14 +342,14 @@ void cleanup_cgroup_environment(void)
  * On success, it returns the file descriptor. On failure it returns -1.
  * If there is a failure, it prints the error to stderr.
  */
-int create_and_get_cgroup(const char *path)
+int create_and_get_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 	int fd;
 
-	format_cgroup_path(cgroup_path, path);
+	format_cgroup_path(cgroup_path, relative_path);
 	if (mkdir(cgroup_path, 0777) && errno != EEXIST) {
-		log_err("mkdiring cgroup %s .. %s", path, cgroup_path);
+		log_err("mkdiring cgroup %s .. %s", relative_path, cgroup_path);
 		return -1;
 	}
 
@@ -244,13 +364,13 @@ int create_and_get_cgroup(const char *path)
 
 /**
  * get_cgroup_id() - Get cgroup id for a particular cgroup path
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * On success, it returns the cgroup id. On failure it returns 0,
  * which is an invalid cgroup id.
  * If there is a failure, it prints the error to stderr.
  */
-unsigned long long get_cgroup_id(const char *path)
+unsigned long long get_cgroup_id(const char *relative_path)
 {
 	int dirfd, err, flags, mount_id, fhsize;
 	union {
@@ -261,7 +381,7 @@ unsigned long long get_cgroup_id(const char *path)
 	struct file_handle *fhp, *fhp2;
 	unsigned long long ret = 0;
 
-	format_cgroup_path(cgroup_workdir, path);
+	format_cgroup_path(cgroup_workdir, relative_path);
 
 	dirfd = AT_FDCWD;
 	flags = 0;
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index fcc9cb91b211..3358734356ab 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -10,11 +10,18 @@
 	__FILE__, __LINE__, clean_errno(), ##__VA_ARGS__)
 
 /* cgroupv2 related */
-int cgroup_setup_and_join(const char *path);
-int create_and_get_cgroup(const char *path);
-unsigned long long get_cgroup_id(const char *path);
-
-int join_cgroup(const char *path);
+int enable_controllers(const char *relative_path, const char *controllers);
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf);
+int write_cgroup_file_parent(const char *relative_path, const char *file,
+			     const char *buf);
+int cgroup_setup_and_join(const char *relative_path);
+int get_root_cgroup(void);
+int create_and_get_cgroup(const char *relative_path);
+unsigned long long get_cgroup_id(const char *relative_path);
+
+int join_cgroup(const char *relative_path);
+int join_parent_cgroup(const char *relative_path);
 
 int setup_cgroup_environment(void);
 void cleanup_cgroup_environment(void);
@@ -26,4 +33,4 @@ int join_classid(void);
 int setup_classid_environment(void);
 void cleanup_classid_environment(void);
 
-#endif /* __CGROUP_HELPERS_H */
\ No newline at end of file
+#endif /* __CGROUP_HELPERS_H */
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related

* [PATCH bpf-next v8 3/5] cgroup: bpf: enable bpf programs to integrate with rstat
From: Hao Luo @ 2022-08-12 20:28 UTC (permalink / raw)
  To: linux-kernel, bpf, cgroups, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Tejun Heo, Zefan Li,
	KP Singh, Johannes Weiner, Michal Hocko, John Fastabend,
	Michal Koutny, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Shakeel Butt, Yosry Ahmed, Hao Luo
In-Reply-To: <20220812202802.3774257-1-haoluo@google.com>

From: Yosry Ahmed <yosryahmed@google.com>

Enable bpf programs to make use of rstat to collect cgroup hierarchical
stats efficiently:
- Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats.
- Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats.
- Add an empty bpf_rstat_flush() hook that is called during rstat
  flushing, for bpf progs that flush stats to attach to. Attaching a bpf
  prog to this hook effectively registers it as a flush callback.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 kernel/cgroup/rstat.c | 48 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index feb59380c896..793ecff29038 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -3,6 +3,10 @@
 
 #include <linux/sched/cputime.h>
 
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
@@ -141,6 +145,31 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	return pos;
 }
 
+/*
+ * A hook for bpf stat collectors to attach to and flush their stats.
+ * Together with providing bpf kfuncs for cgroup_rstat_updated() and
+ * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
+ * collect cgroup stats can integrate with rstat for efficient flushing.
+ *
+ * A static noinline declaration here could cause the compiler to optimize away
+ * the function. A global noinline declaration will keep the definition, but may
+ * optimize away the callsite. Therefore, __weak is needed to ensure that the
+ * call is still emitted, by telling the compiler that we don't know what the
+ * function might eventually be.
+ *
+ * __diag_* below are needed to dismiss the missing prototype warning.
+ */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "kfuncs which will be used in BPF programs");
+
+__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
+				     struct cgroup *parent, int cpu)
+{
+}
+
+__diag_pop();
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -168,6 +197,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 			struct cgroup_subsys_state *css;
 
 			cgroup_base_stat_flush(pos, cpu);
+			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
 
 			rcu_read_lock();
 			list_for_each_entry_rcu(css, &pos->rstat_css_list,
@@ -501,3 +531,21 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
 #endif
 }
+
+/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
+BTF_SET8_START(bpf_rstat_kfunc_ids)
+BTF_ID_FLAGS(func, cgroup_rstat_updated)
+BTF_ID_FLAGS(func, cgroup_rstat_flush, KF_SLEEPABLE)
+BTF_SET8_END(bpf_rstat_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_rstat_kfunc_ids,
+};
+
+static int __init bpf_rstat_kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+					 &bpf_rstat_kfunc_set);
+}
+late_initcall(bpf_rstat_kfunc_init);
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related

* [PATCH bpf-next v8 2/5] selftests/bpf: Test cgroup_iter.
From: Hao Luo @ 2022-08-12 20:27 UTC (permalink / raw)
  To: linux-kernel, bpf, cgroups, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Tejun Heo, Zefan Li,
	KP Singh, Johannes Weiner, Michal Hocko, John Fastabend,
	Michal Koutny, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Shakeel Butt, Yosry Ahmed, Hao Luo
In-Reply-To: <20220812202802.3774257-1-haoluo@google.com>

Add a selftest for cgroup_iter. The selftest creates a mini cgroup tree
of the following structure:

    ROOT (working cgroup)
     |
   PARENT
  /      \
CHILD1  CHILD2

and tests the following scenarios:

 - invalid cgroup fd.
 - pre-order walk over descendants from PARENT.
 - post-order walk over descendants from PARENT.
 - walk of ancestors from PARENT.
 - process only a single object (i.e. PARENT).
 - early termination.

Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../selftests/bpf/prog_tests/cgroup_iter.c    | 224 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../testing/selftests/bpf/progs/cgroup_iter.c |  39 +++
 3 files changed, 270 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
new file mode 100644
index 000000000000..38958c37b9ce
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/btf.h>
+#include "cgroup_iter.skel.h"
+#include "cgroup_helpers.h"
+
+#define ROOT           0
+#define PARENT         1
+#define CHILD1         2
+#define CHILD2         3
+#define NUM_CGROUPS    4
+
+#define PROLOGUE       "prologue\n"
+#define EPILOGUE       "epilogue\n"
+
+static const char *cg_path[] = {
+	"/", "/parent", "/parent/child1", "/parent/child2"
+};
+
+static int cg_fd[] = {-1, -1, -1, -1};
+static unsigned long long cg_id[] = {0, 0, 0, 0};
+static char expected_output[64];
+
+static int setup_cgroups(void)
+{
+	int fd, i = 0;
+
+	for (i = 0; i < NUM_CGROUPS; i++) {
+		fd = create_and_get_cgroup(cg_path[i]);
+		if (fd < 0)
+			return fd;
+
+		cg_fd[i] = fd;
+		cg_id[i] = get_cgroup_id(cg_path[i]);
+	}
+	return 0;
+}
+
+static void cleanup_cgroups(void)
+{
+	int i;
+
+	for (i = 0; i < NUM_CGROUPS; i++)
+		close(cg_fd[i]);
+}
+
+static void read_from_cgroup_iter(struct bpf_program *prog, int cgroup_fd,
+				  int order, const char *testname)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo;
+	struct bpf_link *link;
+	int len, iter_fd;
+	static char buf[128];
+	size_t left;
+	char *p;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = cgroup_fd;
+	linfo.cgroup.order = order;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(prog, &opts);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (iter_fd < 0)
+		goto free_link;
+
+	memset(buf, 0, sizeof(buf));
+	left = ARRAY_SIZE(buf);
+	p = buf;
+	while ((len = read(iter_fd, p, left)) > 0) {
+		p += len;
+		left -= len;
+	}
+
+	ASSERT_STREQ(buf, expected_output, testname);
+
+	/* read() after iter finishes should be ok. */
+	if (len == 0)
+		ASSERT_OK(read(iter_fd, buf, sizeof(buf)), "second_read");
+
+	close(iter_fd);
+free_link:
+	bpf_link__destroy(link);
+}
+
+/* Invalid cgroup. */
+static void test_invalid_cgroup(struct cgroup_iter *skel)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo;
+	struct bpf_link *link;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = (__u32)-1;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.cgroup_id_printer, &opts);
+	ASSERT_ERR_PTR(link, "attach_iter");
+	bpf_link__destroy(link);
+}
+
+/* Specifying both cgroup_fd and cgroup_id is invalid. */
+static void test_invalid_cgroup_spec(struct cgroup_iter *skel)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo;
+	struct bpf_link *link;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = (__u32)cg_fd[PARENT];
+	linfo.cgroup.cgroup_id = (__u64)cg_id[PARENT];
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.cgroup_id_printer, &opts);
+	ASSERT_ERR_PTR(link, "attach_iter");
+	bpf_link__destroy(link);
+}
+
+/* Preorder walk prints parent and child in order. */
+static void test_walk_preorder(struct cgroup_iter *skel)
+{
+	snprintf(expected_output, sizeof(expected_output),
+		 PROLOGUE "%8llu\n%8llu\n%8llu\n" EPILOGUE,
+		 cg_id[PARENT], cg_id[CHILD1], cg_id[CHILD2]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_DESCENDANTS_PRE, "preorder");
+}
+
+/* Postorder walk prints child and parent in order. */
+static void test_walk_postorder(struct cgroup_iter *skel)
+{
+	snprintf(expected_output, sizeof(expected_output),
+		 PROLOGUE "%8llu\n%8llu\n%8llu\n" EPILOGUE,
+		 cg_id[CHILD1], cg_id[CHILD2], cg_id[PARENT]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_DESCENDANTS_POST, "postorder");
+}
+
+/* Walking parents prints parent and then root. */
+static void test_walk_ancestors_up(struct cgroup_iter *skel)
+{
+	/* terminate the walk when ROOT is met. */
+	skel->bss->terminal_cgroup = cg_id[ROOT];
+
+	snprintf(expected_output, sizeof(expected_output),
+		 PROLOGUE "%8llu\n%8llu\n" EPILOGUE,
+		 cg_id[PARENT], cg_id[ROOT]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_ANCESTORS_UP, "ancestors_up");
+
+	skel->bss->terminal_cgroup = 0;
+}
+
+/* Early termination prints parent only. */
+static void test_early_termination(struct cgroup_iter *skel)
+{
+	/* terminate the walk after the first element is processed. */
+	skel->bss->terminate_early = 1;
+
+	snprintf(expected_output, sizeof(expected_output),
+		 PROLOGUE "%8llu\n" EPILOGUE, cg_id[PARENT]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_DESCENDANTS_PRE, "early_termination");
+
+	skel->bss->terminate_early = 0;
+}
+
+/* Waling self prints self only. */
+static void test_walk_self_only(struct cgroup_iter *skel)
+{
+	snprintf(expected_output, sizeof(expected_output),
+		 PROLOGUE "%8llu\n" EPILOGUE, cg_id[PARENT]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_SELF_ONLY, "self_only");
+}
+
+void test_cgroup_iter(void)
+{
+	struct cgroup_iter *skel = NULL;
+
+	if (setup_cgroup_environment())
+		return;
+
+	if (setup_cgroups())
+		goto out;
+
+	skel = cgroup_iter__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_iter__open_and_load"))
+		goto out;
+
+	if (test__start_subtest("cgroup_iter__invalid_cgroup"))
+		test_invalid_cgroup(skel);
+	if (test__start_subtest("cgroup_iter__invalid_cgroup_spec"))
+		test_invalid_cgroup_spec(skel);
+	if (test__start_subtest("cgroup_iter__preorder"))
+		test_walk_preorder(skel);
+	if (test__start_subtest("cgroup_iter__postorder"))
+		test_walk_postorder(skel);
+	if (test__start_subtest("cgroup_iter__ancestors_up_walk"))
+		test_walk_ancestors_up(skel);
+	if (test__start_subtest("cgroup_iter__early_termination"))
+		test_early_termination(skel);
+	if (test__start_subtest("cgroup_iter__self_only"))
+		test_walk_self_only(skel);
+out:
+	cgroup_iter__destroy(skel);
+	cleanup_cgroups();
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index e9846606690d..c41ee80533ca 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -17,6 +17,7 @@
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
 #define bpf_iter__bpf_link bpf_iter__bpf_link___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup___not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -40,6 +41,7 @@
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
 #undef bpf_iter__bpf_link
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -141,6 +143,11 @@ struct bpf_iter__bpf_link {
 	struct bpf_link *link;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute__((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_iter.c b/tools/testing/selftests/bpf/progs/cgroup_iter.c
new file mode 100644
index 000000000000..de03997322a7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_iter.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+int terminate_early = 0;
+u64 terminal_cgroup = 0;
+
+static inline u64 cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+SEC("iter/cgroup")
+int cgroup_id_printer(struct bpf_iter__cgroup *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgrp = ctx->cgroup;
+
+	/* epilogue */
+	if (cgrp == NULL) {
+		BPF_SEQ_PRINTF(seq, "epilogue\n");
+		return 0;
+	}
+
+	/* prologue */
+	if (ctx->meta->seq_num == 0)
+		BPF_SEQ_PRINTF(seq, "prologue\n");
+
+	BPF_SEQ_PRINTF(seq, "%8llu\n", cgroup_id(cgrp));
+
+	if (terminal_cgroup == cgroup_id(cgrp))
+		return 1;
+
+	return terminate_early ? 1 : 0;
+}
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related

* [PATCH bpf-next v8 1/5] bpf: Introduce cgroup iter
From: Hao Luo @ 2022-08-12 20:27 UTC (permalink / raw)
  To: linux-kernel, bpf, cgroups, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Tejun Heo, Zefan Li,
	KP Singh, Johannes Weiner, Michal Hocko, John Fastabend,
	Michal Koutny, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Shakeel Butt, Yosry Ahmed, Hao Luo
In-Reply-To: <20220812202802.3774257-1-haoluo@google.com>

Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:

 - walking a cgroup's descendants in pre-order.
 - walking a cgroup's descendants in post-order.
 - walking a cgroup's ancestors.
 - process only the given cgroup.

When attaching cgroup_iter, one can set a cgroup to the iter_link
created from attaching. This cgroup is passed as a file descriptor
or cgroup id and serves as the starting point of the walk. If no
cgroup is specified, the starting point will be the root cgroup v2.

For walking descendants, one can specify the order: either pre-order or
post-order. For walking ancestors, the walk starts at the specified
cgroup and ends at the root.

One can also terminate the walk early by returning 1 from the iter
program.

Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
program is called with cgroup_mutex held.

Currently only one session is supported, which means, depending on the
volume of data bpf program intends to send to user space, the number
of cgroups that can be walked is limited. For example, given the current
buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
be walked is 512. This is a limitation of cgroup_iter. If the output
data is larger than the kernel buffer size, after all data in the
kernel buffer is consumed by user space, the subsequent read() syscall
will signal EOPNOTSUPP. In order to work around, the user may have to
update their program to reduce the volume of data sent to output. For
example, skip some uninteresting cgroups. In future, we may extend
bpf_iter flags to allow customizing buffer size.

Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h                           |   8 +
 include/uapi/linux/bpf.h                      |  35 +++
 kernel/bpf/Makefile                           |   3 +
 kernel/bpf/cgroup_iter.c                      | 283 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  35 +++
 .../selftests/bpf/prog_tests/btf_dump.c       |   4 +-
 6 files changed, 366 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a627a02cf8ab..ecb8c61178a1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -48,6 +48,7 @@ struct mem_cgroup;
 struct module;
 struct bpf_func_state;
 struct ftrace_ops;
+struct cgroup;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1730,7 +1731,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 	int __init bpf_iter_ ## target(args) { return 0; }
 
 struct bpf_iter_aux_info {
+	/* for map_elem iter */
 	struct bpf_map *map;
+
+	/* for cgroup iter */
+	struct {
+		struct cgroup *start; /* starting cgroup */
+		int order;
+	} cgroup;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7d1e2794d83e..bc3c901b9f70 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -87,10 +87,34 @@ struct bpf_cgroup_storage_key {
 	__u32	attach_type;		/* program attach type (enum bpf_attach_type) */
 };
 
+enum bpf_iter_order {
+	BPF_ITER_DESCENDANTS_PRE = 0,	/* walk descendants in pre-order. */
+	BPF_ITER_DESCENDANTS_POST,	/* walk descendants in post-order. */
+	BPF_ITER_ANCESTORS_UP,		/* walk ancestors upward. */
+	BPF_ITER_SELF_ONLY,		/* process only a single object. */
+};
+
 union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		/* Users must specify order using one of the following values:
+		 *  - BPF_ITER_DESCENDANTS_PRE
+		 *  - BPF_ITER_DESCENDANTS_POST
+		 *  - BPF_ITER_ANCESTORS_UP
+		 *  - BPF_ITER_SELF_ONLY
+		 */
+		__u32	order;
+
+		/* At most one of cgroup_fd and cgroup_id can be non-zero. If
+		 * both are zero, the walk starts from the default cgroup v2
+		 * root. For walking v1 hierarchy, one should always explicitly
+		 * specify cgroup_fd.
+		 */
+		__u32	cgroup_fd;
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -6157,11 +6181,22 @@ struct bpf_link_info {
 		struct {
 			__aligned_u64 target_name; /* in/out: target_name buffer ptr */
 			__u32 target_name_len;	   /* in/out: target_name buffer len */
+
+			/* If the iter specific field is 32 bits, it can be put
+			 * in the first or second union. Otherwise it should be
+			 * put in the second union.
+			 */
 			union {
 				struct {
 					__u32 map_id;
 				} map;
 			};
+			union {
+				struct {
+					__u64 cgroup_id;
+					__u32 order;
+				} cgroup;
+			};
 		} iter;
 		struct  {
 			__u32 netns_ino;
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..00e05b69a4df 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -24,6 +24,9 @@ endif
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
+ifeq ($(CONFIG_CGROUPS),y)
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
 ifeq ($(CONFIG_INET),y)
 obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..6bdd11952783
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+#include "../cgroup/cgroup-internal.h"  /* cgroup_mutex and cgroup_is_dead */
+
+/* cgroup_iter provides four modes of traversal to the cgroup hierarchy.
+ *
+ *  1. Walk the descendants of a cgroup in pre-order.
+ *  2. Walk the descendants of a cgroup in post-order.
+ *  3. Walk the ancestors of a cgroup.
+ *  4. Show the given cgroup only.
+ *
+ * For walking descendants, cgroup_iter can walk in either pre-order or
+ * post-order. For walking ancestors, the iter walks up from a cgroup to
+ * the root.
+ *
+ * The iter program can terminate the walk early by returning 1. Walk
+ * continues if prog returns 0.
+ *
+ * The prog can check (seq->num == 0) to determine whether this is
+ * the first element. The prog may also be passed a NULL cgroup,
+ * which means the walk has completed and the prog has a chance to
+ * do post-processing, such as outputting an epilogue.
+ *
+ * Note: the iter_prog is called with cgroup_mutex held.
+ *
+ * Currently only one session is supported, which means, depending on the
+ * volume of data bpf program intends to send to user space, the number
+ * of cgroups that can be walked is limited. For example, given the current
+ * buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
+ * cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
+ * be walked is 512. This is a limitation of cgroup_iter. If the output data
+ * is larger than the kernel buffer size, after all data in the kernel buffer
+ * is consumed by user space, the subsequent read() syscall will signal
+ * EOPNOTSUPP. In order to work around, the user may have to update their
+ * program to reduce the volume of data sent to output. For example, skip
+ * some uninteresting cgroups.
+ */
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+struct cgroup_iter_priv {
+	struct cgroup_subsys_state *start_css;
+	bool visited_all;
+	bool terminate;
+	int order;
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct cgroup_iter_priv *p = seq->private;
+
+	mutex_lock(&cgroup_mutex);
+
+	/* cgroup_iter doesn't support read across multiple sessions. */
+	if (*pos > 0) {
+		if (p->visited_all)
+			return NULL;
+
+		/* Haven't visited all, but because cgroup_mutex has dropped,
+		 * return -EOPNOTSUPP to indicate incomplete iteration.
+		 */
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	++*pos;
+	p->terminate = false;
+	p->visited_all = false;
+	if (p->order == BPF_ITER_DESCENDANTS_PRE)
+		return css_next_descendant_pre(NULL, p->start_css);
+	else if (p->order == BPF_ITER_DESCENDANTS_POST)
+		return css_next_descendant_post(NULL, p->start_css);
+	else if (p->order == BPF_ITER_ANCESTORS_UP)
+		return p->start_css;
+	else /* BPF_ITER_SELF_ONLY */
+		return p->start_css;
+}
+
+static int __cgroup_iter_seq_show(struct seq_file *seq,
+				  struct cgroup_subsys_state *css, int in_stop);
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+	struct cgroup_iter_priv *p = seq->private;
+
+	mutex_unlock(&cgroup_mutex);
+
+	/* pass NULL to the prog for post-processing */
+	if (!v) {
+		__cgroup_iter_seq_show(seq, NULL, true);
+		p->visited_all = true;
+	}
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
+	struct cgroup_iter_priv *p = seq->private;
+
+	++*pos;
+	if (p->terminate)
+		return NULL;
+
+	if (p->order == BPF_ITER_DESCENDANTS_PRE)
+		return css_next_descendant_pre(curr, p->start_css);
+	else if (p->order == BPF_ITER_DESCENDANTS_POST)
+		return css_next_descendant_post(curr, p->start_css);
+	else if (p->order == BPF_ITER_ANCESTORS_UP)
+		return curr->parent;
+	else  /* BPF_ITER_SELF_ONLY */
+		return NULL;
+}
+
+static int __cgroup_iter_seq_show(struct seq_file *seq,
+				  struct cgroup_subsys_state *css, int in_stop)
+{
+	struct cgroup_iter_priv *p = seq->private;
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	/* cgroup is dead, skip this element */
+	if (css && cgroup_is_dead(css->cgroup))
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = css ? css->cgroup : NULL;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	/* if prog returns > 0, terminate after this element. */
+	if (ret != 0)
+		p->terminate = true;
+
+	return 0;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	return __cgroup_iter_seq_show(seq, (struct cgroup_subsys_state *)v,
+				      false);
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
+{
+	struct cgroup_iter_priv *p = (struct cgroup_iter_priv *)priv;
+	struct cgroup *cgrp = aux->cgroup.start;
+
+	p->start_css = &cgrp->self;
+	p->terminate = false;
+	p->visited_all = false;
+	p->order = aux->cgroup.order;
+	return 0;
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops		= &cgroup_iter_seq_ops,
+	.init_seq_private	= cgroup_iter_seq_init,
+	.seq_priv_size		= sizeof(struct cgroup_iter_priv),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	int fd = linfo->cgroup.cgroup_fd;
+	u64 id = linfo->cgroup.cgroup_id;
+	int order = linfo->cgroup.order;
+	struct cgroup *cgrp;
+
+	if (order != BPF_ITER_DESCENDANTS_PRE &&
+	    order != BPF_ITER_DESCENDANTS_POST &&
+	    order != BPF_ITER_ANCESTORS_UP &&
+	    order != BPF_ITER_SELF_ONLY)
+		return -EINVAL;
+
+	if (fd && id)
+		return -EINVAL;
+
+	if (fd)
+		cgrp = cgroup_get_from_fd(fd);
+	else if (id)
+		cgrp = cgroup_get_from_id(id);
+	else /* walk the entire hierarchy by default. */
+		cgrp = cgroup_get_from_path("/");
+
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+
+	aux->cgroup.start = cgrp;
+	aux->cgroup.order = order;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+	cgroup_put(aux->cgroup.start);
+}
+
+static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					struct seq_file *seq)
+{
+	char *buf;
+
+	buf = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf) {
+		seq_puts(seq, "cgroup_path:\t<unknown>\n");
+		goto show_order;
+	}
+
+	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
+	 * will print nothing.
+	 *
+	 * Path is in the calling process's cgroup namespace.
+	 */
+	cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
+		       current->nsproxy->cgroup_ns);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+	kfree(buf);
+
+show_order:
+	if (aux->cgroup.order == BPF_ITER_DESCENDANTS_PRE)
+		seq_puts(seq, "order: descendants_pre\n");
+	else if (aux->cgroup.order == BPF_ITER_DESCENDANTS_POST)
+		seq_puts(seq, "order: descendants_post\n");
+	else if (aux->cgroup.order == BPF_ITER_ANCESTORS_UP)
+		seq_puts(seq, "order: ancestors_up\n");
+	else /* BPF_ITER_SELF_ONLY */
+		seq_puts(seq, "order: self_only\n");
+}
+
+static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+					  struct bpf_link_info *info)
+{
+	info->iter.cgroup.order = aux->cgroup.order;
+	info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup.start);
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e174ad28aeb7..57cf7a302bd7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -87,10 +87,34 @@ struct bpf_cgroup_storage_key {
 	__u32	attach_type;		/* program attach type (enum bpf_attach_type) */
 };
 
+enum bpf_iter_order {
+	BPF_ITER_DESCENDANTS_PRE = 0,	/* walk descendants in pre-order. */
+	BPF_ITER_DESCENDANTS_POST,	/* walk descendants in post-order. */
+	BPF_ITER_ANCESTORS_UP,		/* walk ancestors upward. */
+	BPF_ITER_SELF_ONLY,		/* process only a single object. */
+};
+
 union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		/* Users must specify order using one of the following values:
+		 *  - BPF_ITER_DESCENDANTS_PRE
+		 *  - BPF_ITER_DESCENDANTS_POST
+		 *  - BPF_ITER_ANCESTORS_UP
+		 *  - BPF_ITER_SELF_ONLY
+		 */
+		__u32	order;
+
+		/* At most one of cgroup_fd and cgroup_id can be non-zero. If
+		 * both are zero, the walk starts from the default cgroup v2
+		 * root. For walking v1 hierarchy, one should always explicitly
+		 * specify cgroup_fd.
+		 */
+		__u32	cgroup_fd;
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -6157,11 +6181,22 @@ struct bpf_link_info {
 		struct {
 			__aligned_u64 target_name; /* in/out: target_name buffer ptr */
 			__u32 target_name_len;	   /* in/out: target_name buffer len */
+
+			/* If the iter specific field is 32 bits, it can be put
+			 * in the first or second union. Otherwise it should be
+			 * put in the second union.
+			 */
 			union {
 				struct {
 					__u32 map_id;
 				} map;
 			};
+			union {
+				struct {
+					__u64 cgroup_id;
+					__u32 order;
+				} cgroup;
+			};
 		} iter;
 		struct  {
 			__u32 netns_ino;
diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
index 5fce7008d1ff..84c1cfaa2b02 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
@@ -764,8 +764,8 @@ static void test_btf_dump_struct_data(struct btf *btf, struct btf_dump *d,
 
 	/* union with nested struct */
 	TEST_BTF_DUMP_DATA(btf, d, "union", str, union bpf_iter_link_info, BTF_F_COMPACT,
-			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},}",
-			   { .map = { .map_fd = 1 }});
+			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},.cgroup = (struct){.order = (__u32)1,.cgroup_fd = (__u32)1,},}",
+			   { .cgroup = { .order = 1, .cgroup_fd = 1, }});
 
 	/* struct skb with nested structs/unions; because type output is so
 	 * complex, we don't do a string comparison, just verify we return
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related

* [PATCH bpf-next v8 0/5] bpf: rstat: cgroup hierarchical stats
From: Hao Luo @ 2022-08-12 20:27 UTC (permalink / raw)
  To: linux-kernel, bpf, cgroups, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Tejun Heo, Zefan Li,
	KP Singh, Johannes Weiner, Michal Hocko, John Fastabend,
	Michal Koutny, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Shakeel Butt, Yosry Ahmed, Hao Luo

This patch series allows for using bpf to collect hierarchical cgroup
stats efficiently by integrating with the rstat framework. The rstat
framework provides an efficient way to collect cgroup stats percpu and
propagate them through the cgroup hierarchy.

The stats are exposed to userspace in textual form by reading files in
bpffs, similar to cgroupfs stats by using a cgroup_iter program.
cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
- walking a cgroup's descendants in pre-order.
- walking a cgroup's descendants in post-order.
- walking a cgroup's ancestors.
- process only a single object.

When attaching cgroup_iter, one needs to set a cgroup to the iter_link
created from attaching. This cgroup can be passed either as a file
descriptor or a cgroup id. That cgroup serves as the starting point of
the walk.

One can also terminate the walk early by returning 1 from the iter
program.

Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
program is called with cgroup_mutex held.

** Background on rstat for stats collection **
(I am using a subscriber analogy that is not commonly used)

The rstat framework maintains a tree of cgroups that have updates and
which cpus have updates. A subscriber to the rstat framework maintains
their own stats. The framework is used to tell the subscriber when
and what to flush, for the most efficient stats propagation. The
workflow is as follows:

- When a subscriber updates a cgroup on a cpu, it informs the rstat
  framework by calling cgroup_rstat_updated(cgrp, cpu).

- When a subscriber wants to read some stats for a cgroup, it asks
  the rstat framework to initiate a stats flush (propagation) by calling
  cgroup_rstat_flush(cgrp).

- When the rstat framework initiates a flush, it makes callbacks to
  subscribers to aggregate stats on cpus that have updates, and
  propagate updates to their parent.

Currently, the main subscribers to the rstat framework are cgroup
subsystems (e.g. memory, block). This patch series allow bpf programs to
become subscribers as well.

Patches in this series are organized as follows:
* Patches 1-2 introduce cgroup_iter prog, and a selftest.
* Patches 3-5 allow bpf programs to integrate with rstat by adding the
  necessary hook points and kfunc. A comprehensive selftest that
  demonstrates the entire workflow for using bpf and rstat to
  efficiently collect and output cgroup stats is added.

---
Changelog:
v7 -> v8:
- Removed the confusing BPF_ITER_DEFAULT (Andrii)
- s/SELF/SELF_ONLY/g
- Fixed typo (e.g. outputing) (Andrii)
- Use "descendants_pre", "descendants_post" etc. instead of "pre",
  "post" (Andrii)

v6 -> v7:
- Updated commit/comments in cgroup_iter for read() behavior (Yonghong)
- Extracted BPF_ITER_SELF and other options out of cgroup_iter, so
  that they can be used in other iters. Also renamed them. (Andrii)
- Supports both cgroup_fd and cgroup_id when specifying target cgroup.
  (Andrii)
- Avoided using macro for formatting expected output in cgroup_iter
  selftest. (Andrii)
- Applied 'static' on all vars and functions in cgroup_iter selftest.
  (Andrii)
- Fixed broken buf reading in cgroup_iter selftest. (Andrii)
- Switched to use bpf_link__destroy() unconditionally. (Andrii)
- Removed 'volatile' for non-const global vars in selftests. (Andrii)
- Started using bpf_core_enum_value() to get memory_cgrp_id. (Andrii)

v5 -> v6:
- Rebased on bpf-next
- Tidy up cgroup_hierarchical_stats test (Andrii)
  * 'static' and 'inline'
  * avoid using libbpf_get_error()
  * string literals of cgroup paths.
- Rename patch 8/8 to 'selftests/bpf' (Yonghong)
- Fix cgroup_iter comments (e.g. PAGE_SIZE and uapi) (Yonghong)
- Make sure further read() returns OK after previous read() finished
  properly (Yonghong)
- Release cgroup_mutex before the last call of show() (Kumar)

v4 -> v5:
- Rebased on top of new kfunc flags infrastructure, updated patch 1 and
  patch 6 accordingly.
- Added docs for sleepable kfuncs.

v3 -> v4:
- cgroup_iter:
  * reorder fields in bpf_link_info to avoid break uapi (Yonghong)
  * comment the behavior when cgroup_fd=0 (Yonghong)
  * comment on the limit of number of cgroups supported by cgroup_iter.
    (Yonghong)
- cgroup_hierarchical_stats selftest:
  * Do not return -1 if stats are not found (causes overflow in userspace).
  * Check if child process failed to join cgroup.
  * Make buf and path arrays in get_cgroup_vmscan_delay() static.
  * Increase the test map sizes to accomodate cgroups that are not
    created by the test.

v2 -> v3:
- cgroup_iter:
  * Added conditional compilation of cgroup_iter.c in kernel/bpf/Makefile
    (kernel test) and dropped the !CONFIG_CGROUP patch.
  * Added validation of traversal_order when attaching (Yonghong).
  * Fixed previous wording "two modes" to "three modes" (Yonghong).
  * Fixed the btf_dump selftest broken by this patch (Yonghong).
  * Fixed ctx_arg_info[0] to use "PTR_TO_BTF_ID_OR_NULL" instead of
    "PTR_TO_BTF_ID", because the "cgroup" pointer passed to iter prog can
     be null.
- Use __diag_push to eliminate __weak noinline warning in
  bpf_rstat_flush().
- cgroup_hierarchical_stats selftest:
  * Added write_cgroup_file_parent() helper.
  * Added error handling for failed map updates.
  * Added null check for cgroup in vmscan_flush.
  * Fixed the signature of vmscan_[start/end].
  * Correctly return error code when attaching trace programs fail.
  * Make sure all links are destroyed correctly and not leaking in
    cgroup_hierarchical_stats selftest.
  * Use memory.reclaim instead of memory.high as a more reliable way to
    invoke reclaim.
  * Eliminated sleeps, the test now runs faster.

v1 -> v2:
- Redesign of cgroup_iter from v1, based on Alexei's idea [1]:
  * supports walking cgroup subtree.
  * supports walking ancestors of a cgroup. (Andrii)
  * supports terminating the walk early.
  * uses fd instead of cgroup_id as parameter for iter_link. Using fd is
    a convention in bpf.
  * gets cgroup's ref at attach time and deref at detach.
  * brought back cgroup1 support for cgroup_iter.
- Squashed the patches adding the rstat flush hook points and kfuncs
  (Tejun).
- Added a comment explaining why bpf_rstat_flush() needs to be weak
  (Tejun).
- Updated the final selftest with the new cgroup_iter design.
- Changed CHECKs in the selftest with ASSERTs (Yonghong, Andrii).
- Removed empty line at the end of the selftest (Yonghong).
- Renamed test files to cgroup_hierarchical_stats.c.
- Reordered CGROUP_PATH params order to match struct declaration
  in the selftest (Michal).
- Removed memory_subsys_enabled() and made sure memcg controller
  enablement checks make sense and are documented (Michal).


RFC v2 -> v1:
- Instead of introducing a new program type for rstat flushing, add an
  empty hook point, bpf_rstat_flush(), and use fentry bpf programs to
  attach to it and flush bpf stats.
- Instead of using helpers, use kfuncs for rstat functions.
- These changes simplify the patchset greatly, with minimal changes to
  uapi.

RFC v1 -> RFC v2:
- Instead of rstat flush programs attach to subsystems, they now attach
  to rstat (global flushers, not per-subsystem), based on discussions
  with Tejun. The first patch is entirely rewritten.
- Pass cgroup pointers to rstat flushers instead of cgroup ids. This is
  much more flexibility and less likely to need a uapi update later.
- rstat helpers are now only defined if CGROUP_CONFIG.
- Most of the code is now only defined if CGROUP_CONFIG and
  CONFIG_BPF_SYSCALL.
- Move rstat helper protos from bpf_base_func_proto() to
  tracing_prog_func_proto().
- rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not
  ARG_ANYTHING.
- Rewrote the selftest to use the cgroup helpers.
- Dropped bpf_map_lookup_percpu_elem (already added by Feng).
- Dropped patch to support cgroup v1 for cgroup_iter.
- Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The
  code that calls it is no longer compiled when !CONFIG_CGROUP.

cgroup_iter was originally introduced in a different patch series[2].
Hao and I agreed that it fits better as part of this series.
RFC v1 of this patch series had the following changes from [2]:
- Getting the cgroup's reference at the time at attaching, instead of
  at the time when iterating. (Yonghong)
- Remove .init_seq_private and .fini_seq_private callbacks for
  cgroup_iter. They are not needed now. (Yonghong)

[1] https://lore.kernel.org/bpf/20220520221919.jnqgv52k4ajlgzcl@MBP-98dd607d3435.dhcp.thefacebook.com/
[2] https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/

Hao Luo (2):
  bpf: Introduce cgroup iter
  selftests/bpf: Test cgroup_iter.

Yosry Ahmed (3):
  cgroup: bpf: enable bpf programs to integrate with rstat
  selftests/bpf: extend cgroup helpers
  selftests/bpf: add a selftest for cgroup hierarchical stats collection

 include/linux/bpf.h                           |   8 +
 include/uapi/linux/bpf.h                      |  35 ++
 kernel/bpf/Makefile                           |   3 +
 kernel/bpf/cgroup_iter.c                      | 283 ++++++++++++++
 kernel/cgroup/rstat.c                         |  48 +++
 tools/include/uapi/linux/bpf.h                |  35 ++
 tools/testing/selftests/bpf/cgroup_helpers.c  | 202 ++++++++--
 tools/testing/selftests/bpf/cgroup_helpers.h  |  19 +-
 .../selftests/bpf/prog_tests/btf_dump.c       |   4 +-
 .../prog_tests/cgroup_hierarchical_stats.c    | 358 ++++++++++++++++++
 .../selftests/bpf/prog_tests/cgroup_iter.c    | 224 +++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../bpf/progs/cgroup_hierarchical_stats.c     | 226 +++++++++++
 .../testing/selftests/bpf/progs/cgroup_iter.c |  39 ++
 14 files changed, 1442 insertions(+), 49 deletions(-)
 create mode 100644 kernel/bpf/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter.c

-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply

* Re: igc: missing HW timestamps at TX
From: Vladimir Oltean @ 2022-08-12 20:16 UTC (permalink / raw)
  To: Ferenc Fejes
  Cc: vinicius.gomes@intel.com, netdev@vger.kernel.org,
	marton12050@gmail.com, peti.antal99@gmail.com
In-Reply-To: <252755c5f3b83c86fac5cb60c70931204b0ed6df.camel@ericsson.com>

Hi Ferenc,

On Fri, Aug 12, 2022 at 02:13:52PM +0000, Ferenc Fejes wrote:
> Ethtool after the measurement:
> ethtool -S enp3s0 | grep hwtstamp
>      tx_hwtstamp_timeouts: 1
>      tx_hwtstamp_skipped: 419
>      rx_hwtstamp_cleared: 0
> 
> Which is inline with what the isochron see.
> 
> But thats only happens if I forcingly put the affinity of the sender
> different CPU core than the ptp worker of the igc. If those running on
> the same core I doesnt lost any HW timestam even for 10 million
> packets. Worth to mention actually I see many lost timestamp which
> confused me a little bit but those are lost because of the small
> MSG_ERRQUEUE. When I increased that from few kbytes to 20 mbytes I got
> every timestamp successfully.

I have zero knowledge of Intel hardware. That being said, I've looked at
the driver for about 5 minutes, and the design seems to be that where
the timestamp is not available in band from the TX completion NAPI as
part of BD ring metadata, but rather, a TX timestamp complete is raised,
and this results in igc_tsync_interrupt() being called. However there
are 2 paths in the driver which call this, one is igc_msix_other() and
the other is igc_intr_msi() - this latter one is also the interrupt that
triggers the napi_schedule(). It would be interesting to see exactly
which MSI-X interrupt is the one that triggers igc_tsync_interrupt().

It's also interesting to understand what you mean precisely by affinity
of isochron. It has a main thread (used for PTP monitoring and for TX
timestamps) and a pthread for the sending process. The main thread's
affinity is controlled via taskset; the sender thread via --cpu-mask.
Is it the *sender* thread the one who makes the TX timestamps be
available quicker to user space, rather than the main thread, who
actually dequeues them from the error queue? If so, it might be because
the TX packets will trigger the TX completion interrupt, and this will
accelerate the processing of the TX timestamps. I'm unclear what happens
when the sender thread runs on a different CPU core than the TX
timestamp thread.

Your need to increase the SO_RCVBUF is also interesting. Keep in mind
that isochron at that scheduling priority and policy is a CPU hog, and
that igc_tsync_interrupt() calls schedule_work() - which uses the system
workqueue that runs at a very low priority (this begs the question, how
do you know how to match the CPU on which isochron runs with the CPU of
the system workqueue?). So isochron, high priority, competes for CPU
time with igc_ptp_tx_work(), low priority. One produces data, one
consumes it; queues are bound to get full at some point.
On the other hand, other drivers use the ptp_aux_kworker() that the PTP
core creates specifically for this purpose. It is a dedicated kthread
whose scheduling policy and priority can be adjusted using chrt. I think
it would be interesting to see how things behave when you replace
schedule_work() with ptp_schedule_worker().

^ permalink raw reply

* Re: [PATCH vhost 0/2] virtio_net: fix for stuck when change ring size with dev down
From: Jakub Kicinski @ 2022-08-12 19:47 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, Jason Wang, David S. Miller, Eric Dumazet,
	Paolo Abeni, netdev, Michael S. Tsirkin
In-Reply-To: <1660267838.1945586-1-xuanzhuo@linux.alibaba.com>

On Fri, 12 Aug 2022 09:30:38 +0800 Xuan Zhuo wrote:
> Yes, the commits fixed by this patch are currently in Michael's vhost branch.
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=linux-next
> 
> So I mean that by "vhost" here, not into the net/net-next branch. Or should I use
> a more accurate term next time?

The tagging seems good, perhaps add a note to the commit message next
time. Maybe a lore link to the series you are fixing up?
The virtio_net.c patches usually go via netdev AFAIU and we're lacking
the vhost context.

^ permalink raw reply

* Re: [PATCH net] l2tp: Serialize access to sk_user_data with sock lock
From: Jakub Kicinski @ 2022-08-12 19:40 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: netdev, kernel-team, van fantasy
In-Reply-To: <87edxlu6kd.fsf@cloudflare.com>

On Fri, 12 Aug 2022 11:54:43 +0200 Jakub Sitnicki wrote:
> On Thu, Aug 11, 2022 at 10:23 AM -07, Jakub Kicinski wrote:
> > On Wed, 10 Aug 2022 12:28:48 +0200 Jakub Sitnicki wrote:  
> >> Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")  
> >
> > That tag immediately sets off red flags. Please find the commit where
> > to code originates, not where it was last moved.  
> 
> The code move happened in v2.6.35. There's no point in digging further, IMHO.

We can discuss a new "fixes-all-stable-trees" tag but until then let's
just stick to the existing rules.

As luck would have it in this case I think the tag is actually correct,
AFAICT the socket _was_ locked before the code move / refactoring?

> >> Reported-by: van fantasy <g1042620637@gmail.com>
> >> Tested-by: van fantasy <g1042620637@gmail.com>  
> >
> > Can we get real names? Otherwise let's just drop those tags.
> > I know that the legal name requirement is only for S-o-b tags,
> > technically, but it feels silly.  
> 
> I don't make the rules. There is already a precendent in the git log:

Ack, I'm aware, that's why I complained. If it's a single case meh, but
Haowei Yan seems to be quite prolific in finding bugs so switching to 
the real name is preferable.

Thanks!

^ permalink raw reply

* Re: [PATCH v2] wifi: cfg80211: Fix UAF in ieee80211_scan_rx()
From: Jakub Kicinski @ 2022-08-12 19:25 UTC (permalink / raw)
  To: Siddh Raman Pant
  Cc: Greg KH, johannes berg, david s. miller, eric dumazet,
	paolo abeni, netdev, syzbot+6cb476b7c69916a0caca, linux-wireless,
	linux-kernel, syzbot+f9acff9bf08a845f225d,
	syzbot+9250865a55539d384347, linux-kernel-mentees
In-Reply-To: <18292e1dcd8.2359a549180213.8185874405406307019@siddh.me>

On Fri, 12 Aug 2022 21:57:31 +0530 Siddh Raman Pant wrote:
> On Fri, 12 Aug 2022 20:57:58 +0530  Greg KH  wrote:
> > rcu just delays freeing of an object, you might just be delaying the
> > race condition.  Just moving a single object to be freed with rcu feels
> > very odd if you don't have another reference somewhere.  
> 
> As mentioned in patch message, in net/mac80211/scan.c, we have:
>         void ieee80211_scan_rx(struct ieee80211_local *local, struct sk_buff *skb)
>         {
>                 ...
>                 scan_req = rcu_dereference(local->scan_req);
>                 sched_scan_req = rcu_dereference(local->sched_scan_req);
> 
>                 if (scan_req)
>                         scan_req_flags = scan_req->flags;
>                 ...
>         }
> 
> So scan_req is probably supposed to be protected by RCU.
> 
> Also, in ieee80211_local's definition at net/mac80211/ieee80211_i.h, we have:
>         struct cfg80211_scan_request __rcu *scan_req;
> 
> Thus, scan_req is indeed supposed to be protected by RCU, which this patch
> addresses by adding a RCU head to the type's struct, and using kfree_rcu().
> 
> The above snippet is where the UAF happens (you can refer to syzkaller's log),
> because __cfg80211_scan_done() is called and frees the pointer.

Similarly to Greg, I'm not very familiar with the code base but one
sure way to move things forward would be to point out a commit which
broke things and put it in a Fixes tag. Much easier to validate a fix
by looking at where things went wrong.

^ permalink raw reply

* Re: [PATCH v5 3/5] Bluetooth: Add support for hci devcoredump
From: kernel test robot @ 2022-08-12 19:19 UTC (permalink / raw)
  To: Manish Mandlik, Arend van Spriel, Greg Kroah-Hartman, marcel,
	luiz.dentz
  Cc: llvm, kbuild-all, Johannes Berg, Dan Williams, Jason Gunthorpe,
	Signed-off-by : Manish Mandlik, linux-bluetooth, Thomas Gleixner,
	Rafael J . Wysocki, chromeos-bluetooth-upstreaming, Won Chung,
	Abhishek Pandit-Subedi, Eric Dumazet, Jakub Kicinski,
	Johan Hedberg, Paolo Abeni, linux-kernel, netdev
In-Reply-To: <20220810085753.v5.3.Iaf638bb9f885f5880ab1b4e7ae2f73dd53a54661@changeid>

Hi Manish,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bluetooth/master]
[also build test WARNING on bluetooth-next/master driver-core/driver-core-testing linus/master next-20220812]
[cannot apply to v5.19]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Manish-Mandlik/sysfs-Add-attribute-info-for-sys-devices-coredump_disabled/20220811-000313
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth.git master
config: x86_64-randconfig-a003 (https://download.01.org/0day-ci/archive/20220813/202208130346.2UmF05bO-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 5f1c7e2cc5a3c07cbc2412e851a7283c1841f520)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/6fe2192077ebdca91aef91e907f79d9e38960a21
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Manish-Mandlik/sysfs-Add-attribute-info-for-sys-devices-coredump_disabled/20220811-000313
        git checkout 6fe2192077ebdca91aef91e907f79d9e38960a21
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash net/bluetooth/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/bluetooth/coredump.c:301:20: warning: format specifies type 'unsigned int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
                                       dump_size, hdev->dump.alloc_size);
                                                  ^~~~~~~~~~~~~~~~~~~~~
   include/net/bluetooth/bluetooth.h:255:43: note: expanded from macro 'bt_dev_info'
           BT_INFO("%s: " fmt, bt_dev_name(hdev), ##__VA_ARGS__)
                          ~~~                       ^~~~~~~~~~~
   include/net/bluetooth/bluetooth.h:242:47: note: expanded from macro 'BT_INFO'
   #define BT_INFO(fmt, ...)       bt_info(fmt "\n", ##__VA_ARGS__)
                                           ~~~         ^~~~~~~~~~~
   net/bluetooth/coredump.c:320:20: warning: format specifies type 'unsigned int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
                                       dump_size, hdev->dump.alloc_size);
                                                  ^~~~~~~~~~~~~~~~~~~~~
   include/net/bluetooth/bluetooth.h:255:43: note: expanded from macro 'bt_dev_info'
           BT_INFO("%s: " fmt, bt_dev_name(hdev), ##__VA_ARGS__)
                          ~~~                       ^~~~~~~~~~~
   include/net/bluetooth/bluetooth.h:242:47: note: expanded from macro 'BT_INFO'
   #define BT_INFO(fmt, ...)       bt_info(fmt "\n", ##__VA_ARGS__)
                                           ~~~         ^~~~~~~~~~~
   net/bluetooth/coredump.c:365:18: warning: format specifies type 'unsigned int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
                       dump_size, hdev->dump.alloc_size);
                                  ^~~~~~~~~~~~~~~~~~~~~
   include/net/bluetooth/bluetooth.h:255:43: note: expanded from macro 'bt_dev_info'
           BT_INFO("%s: " fmt, bt_dev_name(hdev), ##__VA_ARGS__)
                          ~~~                       ^~~~~~~~~~~
   include/net/bluetooth/bluetooth.h:242:47: note: expanded from macro 'BT_INFO'
   #define BT_INFO(fmt, ...)       bt_info(fmt "\n", ##__VA_ARGS__)
                                           ~~~         ^~~~~~~~~~~
   3 warnings generated.


vim +301 net/bluetooth/coredump.c

   182	
   183	/* Bluetooth devcoredump state machine.
   184	 *
   185	 * Devcoredump states:
   186	 *
   187	 *      HCI_DEVCOREDUMP_IDLE: The default state.
   188	 *
   189	 *      HCI_DEVCOREDUMP_ACTIVE: A devcoredump will be in this state once it has
   190	 *              been initialized using hci_devcoredump_init(). Once active, the
   191	 *              driver can append data using hci_devcoredump_append() or insert
   192	 *              a pattern using hci_devcoredump_append_pattern().
   193	 *
   194	 *      HCI_DEVCOREDUMP_DONE: Once the dump collection is complete, the drive
   195	 *              can signal the completion using hci_devcoredump_complete(). A
   196	 *              devcoredump is generated indicating the completion event and
   197	 *              then the state machine is reset to the default state.
   198	 *
   199	 *      HCI_DEVCOREDUMP_ABORT: The driver can cancel ongoing dump collection in
   200	 *              case of any error using hci_devcoredump_abort(). A devcoredump
   201	 *              is still generated with the available data indicating the abort
   202	 *              event and then the state machine is reset to the default state.
   203	 *
   204	 *      HCI_DEVCOREDUMP_TIMEOUT: A timeout timer for HCI_DEVCOREDUMP_TIMEOUT sec
   205	 *              is started during devcoredump initialization. Once the timeout
   206	 *              occurs, the driver is notified, a devcoredump is generated with
   207	 *              the available data indicating the timeout event and then the
   208	 *              state machine is reset to the default state.
   209	 *
   210	 * The driver must register using hci_devcoredump_register() before using the
   211	 * hci devcoredump APIs.
   212	 */
   213	void hci_devcoredump_rx(struct work_struct *work)
   214	{
   215		struct hci_dev *hdev = container_of(work, struct hci_dev, dump.dump_rx);
   216		struct sk_buff *skb;
   217		struct hci_devcoredump_skb_pattern *pattern;
   218		u32 dump_size;
   219		int start_state;
   220	
   221	#define DBG_UNEXPECTED_STATE() \
   222			bt_dev_dbg(hdev, \
   223				   "Unexpected packet (%d) for state (%d). ", \
   224				   hci_dmp_cb(skb)->pkt_type, hdev->dump.state)
   225	
   226		while ((skb = skb_dequeue(&hdev->dump.dump_q))) {
   227			hci_dev_lock(hdev);
   228			start_state = hdev->dump.state;
   229	
   230			switch (hci_dmp_cb(skb)->pkt_type) {
   231			case HCI_DEVCOREDUMP_PKT_INIT:
   232				if (hdev->dump.state != HCI_DEVCOREDUMP_IDLE) {
   233					DBG_UNEXPECTED_STATE();
   234					goto loop_continue;
   235				}
   236	
   237				if (skb->len != sizeof(dump_size)) {
   238					bt_dev_dbg(hdev, "Invalid dump init pkt");
   239					goto loop_continue;
   240				}
   241	
   242				dump_size = *((u32 *)skb->data);
   243				if (!dump_size) {
   244					bt_dev_err(hdev, "Zero size dump init pkt");
   245					goto loop_continue;
   246				}
   247	
   248				if (hci_devcoredump_prepare(hdev, dump_size)) {
   249					bt_dev_err(hdev, "Failed to prepare for dump");
   250					goto loop_continue;
   251				}
   252	
   253				hci_devcoredump_update_state(hdev,
   254							     HCI_DEVCOREDUMP_ACTIVE);
   255				queue_delayed_work(hdev->workqueue,
   256						   &hdev->dump.dump_timeout,
   257						   DEVCOREDUMP_TIMEOUT);
   258				break;
   259	
   260			case HCI_DEVCOREDUMP_PKT_SKB:
   261				if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
   262					DBG_UNEXPECTED_STATE();
   263					goto loop_continue;
   264				}
   265	
   266				if (!hci_devcoredump_copy(hdev, skb->data, skb->len))
   267					bt_dev_dbg(hdev, "Failed to insert skb");
   268				break;
   269	
   270			case HCI_DEVCOREDUMP_PKT_PATTERN:
   271				if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
   272					DBG_UNEXPECTED_STATE();
   273					goto loop_continue;
   274				}
   275	
   276				if (skb->len != sizeof(*pattern)) {
   277					bt_dev_dbg(hdev, "Invalid pattern skb");
   278					goto loop_continue;
   279				}
   280	
   281				pattern = (void *)skb->data;
   282	
   283				if (!hci_devcoredump_memset(hdev, pattern->pattern,
   284							    pattern->len))
   285					bt_dev_dbg(hdev, "Failed to set pattern");
   286				break;
   287	
   288			case HCI_DEVCOREDUMP_PKT_COMPLETE:
   289				if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
   290					DBG_UNEXPECTED_STATE();
   291					goto loop_continue;
   292				}
   293	
   294				hci_devcoredump_update_state(hdev,
   295							     HCI_DEVCOREDUMP_DONE);
   296				dump_size = hdev->dump.tail - hdev->dump.head;
   297	
   298				bt_dev_info(hdev,
   299					    "Devcoredump complete with size %u "
   300					    "(expect %u)",
 > 301					    dump_size, hdev->dump.alloc_size);
   302	
   303				dev_coredumpv(&hdev->dev, hdev->dump.head, dump_size,
   304					      GFP_KERNEL);
   305				break;
   306	
   307			case HCI_DEVCOREDUMP_PKT_ABORT:
   308				if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
   309					DBG_UNEXPECTED_STATE();
   310					goto loop_continue;
   311				}
   312	
   313				hci_devcoredump_update_state(hdev,
   314							     HCI_DEVCOREDUMP_ABORT);
   315				dump_size = hdev->dump.tail - hdev->dump.head;
   316	
   317				bt_dev_info(hdev,
   318					    "Devcoredump aborted with size %u "
   319					    "(expect %u)",
   320					    dump_size, hdev->dump.alloc_size);
   321	
   322				/* Emit a devcoredump with the available data */
   323				dev_coredumpv(&hdev->dev, hdev->dump.head, dump_size,
   324					      GFP_KERNEL);
   325				break;
   326	
   327			default:
   328				bt_dev_dbg(hdev,
   329					   "Unknown packet (%d) for state (%d). ",
   330					   hci_dmp_cb(skb)->pkt_type, hdev->dump.state);
   331				break;
   332			}
   333	
   334	loop_continue:
   335			kfree_skb(skb);
   336			hci_dev_unlock(hdev);
   337	
   338			if (start_state != hdev->dump.state)
   339				hci_devcoredump_notify(hdev, hdev->dump.state);
   340	
   341			hci_dev_lock(hdev);
   342			if (hdev->dump.state == HCI_DEVCOREDUMP_DONE ||
   343			    hdev->dump.state == HCI_DEVCOREDUMP_ABORT)
   344				hci_devcoredump_reset(hdev);
   345			hci_dev_unlock(hdev);
   346		}
   347	}
   348	EXPORT_SYMBOL(hci_devcoredump_rx);
   349	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] sfc: Add EF100 BAR config support
From: Jakub Kicinski @ 2022-08-12 19:18 UTC (permalink / raw)
  To: Martin Habets
  Cc: Bjorn Helgaas, davem, Paolo Abeni, Eric Dumazet, netdev,
	ecree.xilinx, linux-pci, virtualization, mst
In-Reply-To: <YvYfmw44gpuqexYz@gmail.com>

On Fri, 12 Aug 2022 10:38:35 +0100 Martin Habets wrote:
> FYI, during my holiday my colleagues found a way to use the vdpa tool for this.
> That means we should not need this series, at least for vDPA.
> So we can drop this series.

🎉 small victories :)

^ permalink raw reply

* Re: [PATCH net 1/3] netfilter: nf_conntrack_tcp: re-init for syn packets only
From: Jakub Kicinski @ 2022-08-12 19:17 UTC (permalink / raw)
  To: Neal Cardwell, stable
  Cc: patchwork-bot+netdevbpf, Pablo Neira Ayuso, netfilter-devel,
	davem, netdev, Yuchung Cheng, Eric Dumazet
In-Reply-To: <CADVnQykD5NRcjmrbP9bgNaVuhpOaSiC1dxCOF03bL5nTo2HP7g@mail.gmail.com>

On Fri, 12 Aug 2022 09:34:14 -0400 Neal Cardwell wrote:
> This first commit is an important bug fix for a serious bug that causes
> TCP connection hangs for users of TCP fast open and nf_conntrack:
> 
>   c7aab4f17021b netfilter: nf_conntrack_tcp: re-init for syn packets only
> 
> We are continuing to get reports about the bug that this commit fixes.
> 
> It seems this fix was only backported to v5.17 stable release, and not further,
> due to a cherry-pick conflict, because this fix implicitly depends on a
> slightly earlier v5.17 fix in the same spot:
> 
>   82b72cb94666 netfilter: conntrack: re-init state for retransmitted syn-ack
> 
> I manually verified that the fix c7aab4f17021b can be cleanly cherry-picked
> into the oldest (v4.9.325) and newest (v5.15.60) longterm release kernels as
> long as we first cherry-pick that related fix that it implicitly depends on:
> 
> 82b72cb94666b3dbd7152bb9f441b068af7a921b
> netfilter: conntrack: re-init state for retransmitted syn-ack
> 
> c7aab4f17021b636a0ee75bcf28e06fb7c94ab48
> netfilter: nf_conntrack_tcp: re-init for syn packets only
> 
> So would it be possible to backport both of those fixes with the following
> cherry-picks, to all LTS stable releases?
> 
> git cherry-pick 82b72cb94666b3dbd7152bb9f441b068af7a921b
> git cherry-pick c7aab4f17021b636a0ee75bcf28e06fb7c94ab48

Thanks a lot Neal! FWIW we have recently changed our process and no
longer handle stable submissions ourselves, so in the future feel free
to talk directly to stable@ (and add CC: stable@ tags to patches).

I'm adding stable@, let's see if Greg & team can pick things up based
on your instructions :)

^ permalink raw reply

* Re: [net] 03d56978dd: BUG:Bad_page_map_in_process
From: Joanne Koong @ 2022-08-12 19:01 UTC (permalink / raw)
  To: Yujie Liu
  Cc: 0day robot, LKML, netdev, dccp, lkp, Paolo Abeni, Eric Dumazet,
	Jakub Kicinski, Martin KaFai Lau, David Miller, kernel test robot
In-Reply-To: <CAJnrk1Y4WOHohQFp_+_OUuAQfKS7BewgmKp4V+MF3EMGTxcR=w@mail.gmail.com>

On Tue, Aug 9, 2022 at 9:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Aug 5, 2022 at 12:30 AM Yujie Liu <yujie.liu@intel.com> wrote:
> >
> > Hi Joanne,
> >
> > On 7/28/2022 07:41, Joanne Koong wrote:
> > > On Sun, Jul 24, 2022 at 7:05 AM kernel test robot <oliver.sang@intel.com> wrote:
> > >>
> > >>
> > >>
> > >> Greeting,
> > >>
> > >> FYI, we noticed the following commit (built with gcc-11):
> > >>
> > >> commit: 03d56978dd246147e151916e4dc72af7bc24d5c9 ("[PATCH net-next v3 1/3] net: Add a bhash2 table hashed by port + address")
> > >> url: https://github.com/intel-lab-lkp/linux/commits/Joanne-Koong/Add-a-second-bind-table-hashed-by-port-address/20220723-035903
> > >> base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git 949d6b405e6160ae44baea39192d67b39cb7eeac
> > >> patch link: https://lore.kernel.org/netdev/20220722195406.1304948-2-joannelkoong@gmail.com
> > >>
> > >> in testcase: boot
> > >>
> > >> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > >>
> > >> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > >>
> > >>
> > >>
> > >> If you fix the issue, kindly add following tag
> > >> Reported-by: kernel test robot <oliver.sang@intel.com>
> > >>
> > >>
> > >> [  103.871133][  T486] BUG: Bad page map in process rsync  pte:ffff92f93b759508 pmd:13fc1e067
> > >> [  103.873143][  T486] addr:00007f9fe52a2000 vm_flags:00000075 anon_vma:0000000000000000 mapping:ffff92f928adcb58 index:1a1
> > >> [  103.875128][  T486] file:libcrypto.so.1.1 fault:filemap_fault mmap:generic_file_mmap read_folio:simple_read_folio
> > >> [  103.877339][  T486] CPU: 0 PID: 486 Comm: rsync Not tainted 5.19.0-rc7-01443-g03d56978dd24 #1
> > >> [  103.879032][  T486] Call Trace:
> > >> [  103.879742][  T486]  <TASK>
> > >> [  103.880329][  T486]  ? simple_write_end+0x140/0x140
> > >> [  103.881338][  T486]  dump_stack_lvl+0x3b/0x53
> > >> [  103.882274][  T486]  ? __filemap_get_folio+0x780/0x780
> > >> [  103.883270][  T486]  print_bad_pte.cold+0x15b/0x1c5
> > >> [  103.884202][  T486]  vm_normal_page+0x65/0x140
> > >> [  103.885062][  T486]  zap_pte_range+0x23b/0x9c0
> > >> [  103.885897][  T486]  unmap_page_range+0x263/0x5c0
> > >> [  103.886846][  T486]  unmap_vmas+0x121/0x200
> > >> [  103.887628][  T486]  exit_mmap+0xb5/0x240
> > >> [  103.888401][  T486]  mmput+0x3b/0x140
> > >> [  103.889134][  T486]  exit_mm+0xff/0x180
> > >> [  103.889877][  T486]  do_exit+0x100/0x400
> > >> [  103.890661][  T486]  do_group_exit+0x3e/0x100
> > >> [  103.891514][  T486]  __x64_sys_exit_group+0x18/0x40
> > >> [  103.892494][  T486]  do_syscall_64+0x5d/0x80
> > >> [  103.893294][  T486]  ? do_user_addr_fault+0x257/0x6c0
> > >> [  103.894238][  T486]  ? lock_release+0x6e/0x100
> > >> [  103.895171][  T486]  ? up_read+0x12/0x40
> > >> [  103.896036][  T486]  ? exc_page_fault+0xb2/0x2c0
> > >> [  103.897021][  T486]  entry_SYSCALL_64_after_hwframe+0x5d/0xc7
> > >> [  103.898243][  T486] RIP: 0033:0x7f9fe5007699
> > >> [  103.899149][  T486] Code: Unable to access opcode bytes at RIP 0x7f9fe500766f.
> > >> [  103.900511][  T486] RSP: 002b:00007fff7e32c3a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > >> [  103.902027][  T486] RAX: ffffffffffffffda RBX: 00007f9fe50fc610 RCX: 00007f9fe5007699
> > >> [  103.903477][  T486] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
> > >> [  103.904943][  T486] RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000001
> > >> [  103.906384][  T486] R10: 000000000000000b R11: 0000000000000246 R12: 00007f9fe50fc610
> > >> [  103.907823][  T486] R13: 0000000000000001 R14: 00007f9fe50fcae8 R15: 0000000000000000
> > >> [  103.909290][  T486]  </TASK>
> > >> [  103.910423][  T486] Disabling lock debugging due to kernel taint
> > >> [  107.503093][  T508] BUG: Bad page map in process rsync  pte:ffff92f93b7fe508 pmd:13aa1c067
> > >> [  107.504948][  T508] addr:00007fced9aa2000 vm_flags:00000075 anon_vma:0000000000000000 mapping:ffff92f92891ab58 index:9a
> > >> [  107.507070][  T508] file:libzstd.so.1.4.8 fault:filemap_fault mmap:generic_file_mmap read_folio:simple_read_folio
> > >> [  107.508825][  T508] CPU: 0 PID: 508 Comm: rsync Tainted: G    B             5.19.0-rc7-01443-g03d56978dd24 #1
> > >> [  107.510762][  T508] Call Trace:
> > >> [  107.511458][  T508]  <TASK>
> > >> [  107.512058][  T508]  ? simple_write_end+0x140/0x140
> > >> [  107.513072][  T508]  dump_stack_lvl+0x3b/0x53
> > >> [  107.513990][  T508]  ? __filemap_get_folio+0x780/0x780
> > >> [  107.519166][  T508]  print_bad_pte.cold+0x15b/0x1c5
> > >> [  107.520032][  T508]  vm_normal_page+0x65/0x140
> > >> [  107.520802][  T508]  zap_pte_range+0x23b/0x9c0
> > >> [  107.521548][  T508]  unmap_page_range+0x263/0x5c0
> > >> [  107.522355][  T508]  unmap_vmas+0x121/0x200
> > >> [  107.523247][  T508]  exit_mmap+0xb5/0x240
> > >> [  107.524107][  T508]  mmput+0x3b/0x140
> > >> [  107.524908][  T508]  exit_mm+0xff/0x180
> > >> [  107.525716][  T508]  do_exit+0x100/0x400
> > >> [  107.526613][  T508]  do_group_exit+0x3e/0x100
> > >> [  107.527541][  T508]  __x64_sys_exit_group+0x18/0x40
> > >> [  107.528450][  T508]  do_syscall_64+0x5d/0x80
> > >> [  107.529368][  T508]  ? up_read+0x12/0x40
> > >> [  107.530228][  T508]  ? do_user_addr_fault+0x257/0x6c0
> > >> [  107.531121][  T508]  ? rcu_read_lock_sched_held+0x5/0x40
> > >> [  107.532046][  T508]  ? exc_page_fault+0xb2/0x2c0
> > >> [  107.532843][  T508]  entry_SYSCALL_64_after_hwframe+0x5d/0xc7
> > >> [  107.533866][  T508] RIP: 0033:0x7fced95ff699
> > >> [  107.534781][  T508] Code: Unable to access opcode bytes at RIP 0x7fced95ff66f.
> > >> [  107.536225][  T508] RSP: 002b:00007fff162474c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > >> [  107.537871][  T508] RAX: ffffffffffffffda RBX: 00007fced96f4610 RCX: 00007fced95ff699
> > >> [  107.539506][  T508] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
> > >> [  107.541126][  T508] RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000001
> > >> [  107.542743][  T508] R10: 000000000000000b R11: 0000000000000246 R12: 00007fced96f4610
> > >> [  107.544310][  T508] R13: 0000000000000001 R14: 00007fced96f4ae8 R15: 0000000000000000
> > >> [  107.545881][  T508]  </TASK>
> > >>
> > >>
> > >>
> > >> To reproduce:
> > >>
> > >>          # build kernel
> > >>          cd linux
> > >>          cp config-5.19.0-rc7-01443-g03d56978dd24 .config
> > >>          make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
> > >>          make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
> > >>          cd <mod-install-dir>
> > >>          find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
> > >>
> > >>
> > >>          git clone https://github.com/intel/lkp-tests.git
> > >>          cd lkp-tests
> > >>          bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
> > >>
> > >>          # if come across any failure that blocks the test,
> > >>          # please remove ~/.lkp and /lkp dir to run from a clean state.
> > >>
> > > I ran this in a loop ~20 times but I'm not able to repro the crash.
> > > This is a snippet of what I see (and I can also attach or paste the
> > > entire log if that would be helpful):
> > >
> > > I examined more closely the changes between v2 and v3 and I don't see
> > > anything that would lead to this error either (I'm assuming  v2 is
> > > okay because this report wasn't generated for it). Looking at the
> > > stack trace too, I'm not seeing anything that sticks out (eg this
> > > looks like a memory mapping failure and bhash2 didn't modify mapping
> > > or paging code).
> >
> > We chose commit 949d6b405e61 (net: add missing includes and forward
> > declarations under net/) as base, which used to be the head of
> > net-next/master branch then, and apply your v3 patches on top of it.
> > So the test result is a comparison between 949d6b405e61 and v3.
> >
> > Refer to the bug info:
> >
> > [  103.871133][  T486] BUG: Bad page map in process rsync  pte:ffff92f93b759508 pmd:13fc1e067
> >
> > The BUG happens in rsync, and it reminds me that we have some extra
> > steps when running the test in our infrastructure. We will use some
> > commands such as `wget` and `rsync` to transfer the test result to
> > our server, but these steps are not included when reproducing locally.
> >
> > Then I come up with an idea that maybe the kernel can boot successfully,
> > but the v3 patch may have some impacts on the command involving network
> > operations.
> >
> > Could you please help to apply below hack on the latest version of
> > lkp-tests, and retry to see if can reproduce the crash? It is just
> > a meaningless `wget` command to involve network in local test and align
> > with the steps in our testing environment.
>
> I will try to repro this this week. I'll let you know what I find.

I applied the wget change you suggested and was able to reproduce the crash.

This is happening because in the case where there is a connect() call
on address 0 on an unbound socket, the socket gets added to the bind
bucket twice. The first happens in inet_bhash2_update_saddr() and the
second happens when __inet_hash_connect() calls inet_bind_hash(). The
fix is to update the bhash2 table only if the socket is already bound.

I will submit v4 with this fix added. There is already a selftest
("sk_connect_zero_addr") in the 3rd patch that simulates this case but
it doesn't trigger the bad page table entry state when unmapping.

Thanks for reporting.

>
> >
> > diff --git a/lib/upload.sh b/lib/upload.sh
> > index 257b498db..e8801736e 100755
> > --- a/lib/upload.sh
> > +++ b/lib/upload.sh
> > @@ -181,7 +181,8 @@ upload_files()
> >                  fi
> >          else
> >                  # 9pfs, copy directly
> > -               upload_files_copy "$@"
> > +               wget 127.0.0.1
> >                  return
> >          fi
> >   }
> >
> > After applying above hack, I've tried to run 20 times on base and v3 patch
> > respectively. All runs of base are good, but there are 8 crash runs of v3.
> >
> > Reproducing steps:
> >
> >         cd linux
> >         git remote add net-next https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git
> >         git fetch net-next master
> >         git checkout 949d6b405e61 # checkout to base
> >         git am <v3.patch>
> >
> >         cp config-5.19.0-rc7-01443-g03d56978dd24 .config # config file is attached
> >         make ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
> >         mkdir <mod-install-dir>
> >         make ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
> >         cd <mod-install-dir>
> >         find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
> >
> >          git clone https://github.com/intel/lkp-tests.git
> >          cd lkp-tests
> >         # apply the hack mentioned above
> >          bin/lkp qemu -k <bzImage> -m <mod-install-dir>/modules.cgz job-script # job-script is attached in this email
> >
> > --
> > Best Regards,
> > Yujie
> >
> > >
> > > I don't think this bug report is related to the bhash2 changes. But
> > > please let me know if you disagree.
> > >
> > > Thanks,
> > > Joanne
> > >
> > >>
> > >>
> > >> --
> > >> 0-DAY CI Kernel Test Service
> > >> https://01.org/lkp
> > >>
> > >>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox