Netdev List
 help / color / mirror / Atom feed
* [PATCH iproute2-next v2] ip-link: put types on man page in alphabetic order
From: Stephen Hemminger @ 2022-04-20  3:11 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger

Lets try and keep man pages using alpha order, it looks like
it started that way then drifted.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 man/man8/ip-link.8.in | 175 +++++++++++++++++++++---------------------
 1 file changed, 89 insertions(+), 86 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index ec3cc4297da5..fc214a10c4e7 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -209,43 +209,43 @@ ip-link \- network device configuration
 .ti -8
 .IR TYPE " := [ "
 .BR amt " | "
-.BR bridge " | "
+.BR bareudp " |"
 .BR bond " | "
+.BR bridge " | "
 .BR can " | "
 .BR dummy " | "
-.BR hsr " | "
-.BR ifb " | "
-.BR ipoib " |"
-.BR macvlan  " | "
-.BR macvtap  " | "
-.BR vcan " | "
-.BR vxcan " | "
-.BR veth " | "
-.BR vlan " | "
-.BR vxlan " |"
-.BR ip6tnl " |"
-.BR ipip " |"
-.BR sit " |"
+.BR erspan " |"
+.BR geneve " |"
 .BR gre " |"
 .BR gretap " |"
-.BR erspan " |"
+.BR gtp " |"
+.BR hsr " | "
+.BR ifb " | "
+.BR ip6erspan " |"
 .BR ip6gre " |"
 .BR ip6gretap " |"
-.BR ip6erspan " |"
-.BR vti " |"
-.BR nlmon " |"
+.BR ip6tnl " |"
+.BR ipip " |"
+.BR ipoib " |"
 .BR ipvlan " |"
 .BR ipvtap " |"
 .BR lowpan " |"
-.BR geneve " |"
-.BR bareudp " |"
-.BR vrf " |"
 .BR macsec " |"
+.BR macvlan  " | "
+.BR macvtap  " | "
 .BR netdevsim " |"
+.BR nlmon " |"
 .BR rmnet " |"
-.BR xfrm " |"
-.BR gtp " |"
-.BR virt_wifi " ]"
+.BR sit " |"
+.BR vcan " | "
+.BR veth " | "
+.BR virt_wifi " |"
+.BR vlan " | "
+.BR vrf " |"
+.BR vti " |"
+.BR vxcan " | "
+.BR vxlan " |"
+.BR xfrm " ]"
 
 .ti -8
 .IR ETYPE " := [ " TYPE " |"
@@ -290,44 +290,52 @@ specifies the type of the new device.
 Link types:
 
 .in +8
-.B bridge
-- Ethernet Bridge device
+.BR amt
+- Automatic Multicast Tunneling (AMT)
+.sp
+.BR bareudp
+- Bare UDP L3 encapsulation support
 .sp
 .B bond
 - Bonding device
+.B bridge
+- Ethernet Bridge device
+.sp
+.B can
+- Controller Area Network
 .sp
 .B dummy
 - Dummy network interface
 .sp
-.B hsr
-- High-availability Seamless Redundancy device
+.BR erspan
+- Encapsulated Remote SPAN over GRE and IPv4
 .sp
-.B ifb
-- Intermediate Functional Block device
+.B geneve
+- GEneric NEtwork Virtualization Encapsulation
 .sp
-.B ipoib
-- IP over Infiniband device
+.B gre
+- Virtual tunnel interface GRE over IPv4
 .sp
-.B macvlan
-- Virtual interface base on link layer address (MAC)
+.BR gretap
+- Virtual L2 tunnel interface GRE over IPv4
 .sp
-.B macvtap
-- Virtual interface based on link layer address (MAC) and TAP.
+.BR gtp
+- GPRS Tunneling Protocol
 .sp
-.B vcan
-- Virtual Controller Area Network interface
+.B hsr
+- High-availability Seamless Redundancy device
 .sp
-.B vxcan
-- Virtual Controller Area Network tunnel interface
+.B ifb
+- Intermediate Functional Block device
 .sp
-.B veth
-- Virtual ethernet interface
+.BR ip6erspan
+- Encapsulated Remote SPAN over GRE and IPv6
 .sp
-.BR vlan
-- 802.1q tagged virtual LAN interface
+.BR ip6gre
+- Virtual tunnel interface GRE over IPv6
 .sp
-.BR vxlan
-- Virtual eXtended LAN
+.BR ip6gretap
+- Virtual L2 tunnel interface GRE over IPv6
 .sp
 .BR ip6tnl
 - Virtual tunnel interface IPv4|IPv6 over IPv6
@@ -335,71 +343,66 @@ Link types:
 .BR ipip
 - Virtual tunnel interface IPv4 over IPv4
 .sp
-.BR sit
-- Virtual tunnel interface IPv6 over IPv4
+.B ipoib
+- IP over Infiniband device
 .sp
-.BR gre
-- Virtual tunnel interface GRE over IPv4
+.BR ipvlan
+- Interface for L3 (IPv6/IPv4) based VLANs
 .sp
-.BR gretap
-- Virtual L2 tunnel interface GRE over IPv4
+.BR ipvtap
+- Interface for L3 (IPv6/IPv4) based VLANs and TAP
 .sp
-.BR erspan
-- Encapsulated Remote SPAN over GRE and IPv4
+.BR lowpan
+- Interface for 6LoWPAN (IPv6) over IEEE 802.15.4 / Bluetooth
 .sp
-.BR ip6gre
-- Virtual tunnel interface GRE over IPv6
+.BR macsec
+- Interface for IEEE 802.1AE MAC Security (MACsec)
 .sp
-.BR ip6gretap
-- Virtual L2 tunnel interface GRE over IPv6
+.B macvlan
+- Virtual interface base on link layer address (MAC)
 .sp
-.BR ip6erspan
-- Encapsulated Remote SPAN over GRE and IPv6
+.B macvtap
+- Virtual interface based on link layer address (MAC) and TAP.
 .sp
-.BR vti
-- Virtual tunnel interface
+.BR netdevsim
+- Interface for netdev API tests
 .sp
 .BR nlmon
 - Netlink monitoring device
 .sp
-.BR ipvlan
-- Interface for L3 (IPv6/IPv4) based VLANs
-.sp
-.BR ipvtap
-- Interface for L3 (IPv6/IPv4) based VLANs and TAP
+.BR rmnet
+- Qualcomm rmnet device
 .sp
-.BR lowpan
-- Interface for 6LoWPAN (IPv6) over IEEE 802.15.4 / Bluetooth
+.BR sit
+- Virtual tunnel interface IPv6 over IPv4
 .sp
-.BR geneve
-- GEneric NEtwork Virtualization Encapsulation
+.B vcan
+- Virtual Controller Area Network interface
 .sp
-.BR bareudp
-- Bare UDP L3 encapsulation support
+.B veth
+- Virtual ethernet interface
 .sp
-.BR amt
-- Automatic Multicast Tunneling (AMT)
+.BR virt_wifi
+- rtnetlink wifi simulation device
 .sp
-.BR macsec
-- Interface for IEEE 802.1AE MAC Security (MACsec)
+.BR vlan
+- 802.1q tagged virtual LAN interface
 .sp
 .BR vrf
 - Interface for L3 VRF domains
 .sp
-.BR netdevsim
-- Interface for netdev API tests
+.BR vti
+- Virtual tunnel interface
 .sp
-.BR rmnet
-- Qualcomm rmnet device
+.B vxcan
+- Virtual Controller Area Network tunnel interface
+.sp
+.BR vxlan
+- Virtual eXtended LAN
 .sp
 .BR xfrm
 - Virtual xfrm interface
 .sp
-.BR gtp
-- GPRS Tunneling Protocol
-.sp
-.BR virt_wifi
-- rtnetlink wifi simulation device
 .in -8
 
 .TP
-- 
2.35.1


^ permalink raw reply related

* [PATCH iproute2-next] ip-link: put types on man page in alphabetic order
From: Stephen Hemminger @ 2022-04-20  2:54 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger

Lets try and keep man pages using alpha order, it looks like
it started that way then drifted.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 man/man8/ip-link.8.in | 48 +++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index ec3cc4297da5..477e5ae24772 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -209,43 +209,43 @@ ip-link \- network device configuration
 .ti -8
 .IR TYPE " := [ "
 .BR amt " | "
-.BR bridge " | "
+.BR bareudp " |"
 .BR bond " | "
+.BR bridge " | "
 .BR can " | "
 .BR dummy " | "
-.BR hsr " | "
-.BR ifb " | "
-.BR ipoib " |"
-.BR macvlan  " | "
-.BR macvtap  " | "
-.BR vcan " | "
-.BR vxcan " | "
-.BR veth " | "
-.BR vlan " | "
-.BR vxlan " |"
-.BR ip6tnl " |"
-.BR ipip " |"
-.BR sit " |"
+.BR erspan " |"
+.BR geneve " |"
 .BR gre " |"
 .BR gretap " |"
-.BR erspan " |"
+.BR gtp " |"
+.BR hsr " | "
+.BR ifb " | "
+.BR ip6erspan " |"
 .BR ip6gre " |"
 .BR ip6gretap " |"
-.BR ip6erspan " |"
-.BR vti " |"
-.BR nlmon " |"
+.BR ip6tnl " |"
+.BR ipip " |"
+.BR ipoib " |"
 .BR ipvlan " |"
 .BR ipvtap " |"
 .BR lowpan " |"
-.BR geneve " |"
-.BR bareudp " |"
-.BR vrf " |"
 .BR macsec " |"
+.BR macvlan  " | "
+.BR macvtap  " | "
 .BR netdevsim " |"
+.BR nlmon " |"
 .BR rmnet " |"
-.BR xfrm " |"
-.BR gtp " |"
-.BR virt_wifi " ]"
+.BR sit " |"
+.BR vcan " | "
+.BR veth " | "
+.BR virt_wifi " |"
+.BR vlan " | "
+.BR vrf " |"
+.BR vti " |"
+.BR vxcan " | "
+.BR vxlan " |"
+.BR xfrm " ]"
 
 .ti -8
 .IR ETYPE " := [ " TYPE " |"
-- 
2.35.1


^ permalink raw reply related

* Re: [PATCH net 1/2] net/af_packet: adjust network header position for VLAN tagged packets
From: Jason Wang @ 2022-04-20  2:47 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: Michael S. Tsirkin, Willem de Bruijn, Maxim Mikityanskiy,
	Mike Pattrick, netdev, Eric Dumazet, virtualization,
	Balazs Nemeth, Flavio Leitner, Jakub Kicinski, Paolo Abeni,
	David S . Miller
In-Reply-To: <Yl9bDrDFZhc04MiY@Laptop-X1>

On Wed, Apr 20, 2022 at 9:00 AM Hangbin Liu <liuhangbin@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 10:26:09AM -0400, Michael S. Tsirkin wrote:
> > > > There are also some duplicated codes in these *_snd functions.
> > > > I think we can move them out to one single function.
> > >
> > > Please don't refactor this code. It will complicate future backports
> > > of stable fixes.
> >
> > Hmm I don't know offhand which duplication this refers to specifically
> > so maybe it's not worth addressing specifically but generally not
> > cleaning up code just because of backports seems wrong ...
>
> Yes, packet_snd() and tpacket_snd() share same addr/msg checking logic that
> I think we can clean up.
>
> > > > > stretching the definition of the flags to include VLAN is acceptable
> > > > > (unlike outright tunnels), but even then I would suggest for net-next.
> > > >
> > > > As I asked, I'm not familiar with virtio code. Do you think if I should
> > > > add a new VIRTIO_NET_HDR_GSO_VLAN flag? It's only a L2 flag without any L3
> > > > info. If I add something like VIRTIO_NET_HDR_GSO_VLAN_TCPV4/TCPV6/UDP. That
> > > > would add more combinations. Which doesn't like a good idea.
> > >
> > > I would prefer a new flag to denote this type, so that we can be
> > > strict and only change the datapath for packets that have this flag
> > > set (and thus express the intent).
> > >
> > > But the VIRTIO_NET_HDR types are defined in the virtio spec. The
> > > maintainers should probably chime in.
> >
> > Yes, it's a UAPI extension, not to be done lightly. In this case IIUC
> > gso_type in the header is only u8 - 8 bits and 5 of these are already
> > used.  So I don't think the virtio TC will be all that happy to burn up
> > a bit unless a clear benefit can be demonstrated.
> >
> > I agree with the net-next proposal, I think it's more a feature than a
> > bugfix. In particular I think a Fixes tag can also be dropped in that
> > IIUC GSO for vlan packets didn't work even before that commit - right?

It should work, we initialize vlan_features since ("4fda830263c5
virtio-net: initialize vlan_features").

What we don't support is vlan offload.

>
> Right. virtio_net_hdr GSO with vlan doesn't work before.

It doesn't work since we don't support that feature.

> I will post this to net-next.

If you want to do that, you need a spec patch as well.

Thanks

>
> Thanks
> Hangbin
>


^ permalink raw reply

* Re: [PATCH v4] ip/iplink_virt_wifi: add support for virt_wifi
From: patchwork-bot+netdevbpf @ 2022-04-20  2:40 UTC (permalink / raw)
  To: Baligh GASMI; +Cc: netdev, stephen, dsahern
In-Reply-To: <20220418232507.4047-1-gasmibal@gmail.com>

Hello:

This patch was applied to iproute2/iproute2-next.git (main)
by David Ahern <dsahern@kernel.org>:

On Tue, 19 Apr 2022 01:25:07 +0200 you wrote:
> Add support for creating virt_wifi devices type.
> 
> Syntax:
> $ ip link add link eth0 name wlan0 type virt_wifi
> 
> Signed-off-by: Baligh Gasmi <gasmibal@gmail.com>
> 
> [...]

Here is the summary with links:
  - [v4] ip/iplink_virt_wifi: add support for virt_wifi
    https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git/commit/?id=ee53174bd977

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH net v3] tcp: ensure to use the most recently sent skb when filling the rate sample
From: Pengcheng Yang @ 2022-04-20  2:34 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, netdev
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
	Paolo Abeni, Pengcheng Yang

If an ACK (s)acks multiple skbs, we favor the information
from the most recently sent skb by choosing the skb with
the highest prior_delivered count. But in the interval
between receiving ACKs, we send multiple skbs with the same
prior_delivered, because the tp->delivered only changes
when we receive an ACK.

We used RACK's solution, copying tcp_rack_sent_after() as
tcp_skb_sent_after() helper to determine "which packet was
sent last?". Later, we will use tcp_skb_sent_after() instead
in RACK.

Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
---
 include/net/tcp.h   |  6 ++++++
 net/ipv4/tcp_rate.c | 11 ++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6d50a66..fcd69fc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1042,6 +1042,7 @@ struct rate_sample {
 	int  losses;		/* number of packets marked lost upon ACK */
 	u32  acked_sacked;	/* number of packets newly (S)ACKed upon ACK */
 	u32  prior_in_flight;	/* in flight before this ACK */
+	u32  last_end_seq;	/* end_seq of most recently ACKed packet */
 	bool is_app_limited;	/* is sample from packet with bubble in pipe? */
 	bool is_retrans;	/* is sample from retransmission? */
 	bool is_ack_delayed;	/* is this (likely) a delayed ACK? */
@@ -1158,6 +1159,11 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
 		  bool is_sack_reneg, struct rate_sample *rs);
 void tcp_rate_check_app_limited(struct sock *sk);
 
+static inline bool tcp_skb_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2)
+{
+	return t1 > t2 || (t1 == t2 && after(seq1, seq2));
+}
+
 /* These functions determine how the current flow behaves in respect of SACK
  * handling. SACK is negotiated with the peer, and therefore it can vary
  * between different flows.
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
index 617b818..a8f6d9d 100644
--- a/net/ipv4/tcp_rate.c
+++ b/net/ipv4/tcp_rate.c
@@ -74,27 +74,32 @@ void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb)
  *
  * If an ACK (s)acks multiple skbs (e.g., stretched-acks), this function is
  * called multiple times. We favor the information from the most recently
- * sent skb, i.e., the skb with the highest prior_delivered count.
+ * sent skb, i.e., the skb with the most recently sent time and the highest
+ * sequence.
  */
 void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
 			    struct rate_sample *rs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
+	u64 tx_tstamp;
 
 	if (!scb->tx.delivered_mstamp)
 		return;
 
+	tx_tstamp = tcp_skb_timestamp_us(skb);
 	if (!rs->prior_delivered ||
-	    after(scb->tx.delivered, rs->prior_delivered)) {
+	    tcp_skb_sent_after(tx_tstamp, tp->first_tx_mstamp,
+			       scb->end_seq, rs->last_end_seq)) {
 		rs->prior_delivered_ce  = scb->tx.delivered_ce;
 		rs->prior_delivered  = scb->tx.delivered;
 		rs->prior_mstamp     = scb->tx.delivered_mstamp;
 		rs->is_app_limited   = scb->tx.is_app_limited;
 		rs->is_retrans	     = scb->sacked & TCPCB_RETRANS;
+		rs->last_end_seq     = scb->end_seq;
 
 		/* Record send time of most recently ACKed packet: */
-		tp->first_tx_mstamp  = tcp_skb_timestamp_us(skb);
+		tp->first_tx_mstamp  = tx_tstamp;
 		/* Find the duration of the "send phase" of this window: */
 		rs->interval_us = tcp_stamp_us_delta(tp->first_tx_mstamp,
 						     scb->tx.first_tx_mstamp);
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH] wwan_hwsim: Avoid flush_scheduled_work() usage
From: Tetsuo Handa @ 2022-04-20  2:22 UTC (permalink / raw)
  To: Loic Poulain, Sergey Ryazanov, Johannes Berg, David S. Miller,
	Jakub Kicinski, Paolo Abeni
  Cc: Network Development

Flushing system-wide workqueues is dangerous and will be forbidden.
Replace system_wq with local wwan_wq.

Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
Note: This patch is only compile tested. By the way, don't you want to call
debugfs_remove(wwan_hwsim_debugfs_devcreate) at err_clean_devs label in
wwan_hwsim_init() like wwan_hwsim_exit() does, for debugfs_create_file("devcreate")
is called before "goto err_clean_devs" happens?

 drivers/net/wwan/wwan_hwsim.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wwan/wwan_hwsim.c b/drivers/net/wwan/wwan_hwsim.c
index 5b62cf3b3c42..2136319f588f 100644
--- a/drivers/net/wwan/wwan_hwsim.c
+++ b/drivers/net/wwan/wwan_hwsim.c
@@ -33,6 +33,7 @@ static struct dentry *wwan_hwsim_debugfs_devcreate;
 static DEFINE_SPINLOCK(wwan_hwsim_devs_lock);
 static LIST_HEAD(wwan_hwsim_devs);
 static unsigned int wwan_hwsim_dev_idx;
+static struct workqueue_struct *wwan_wq;
 
 struct wwan_hwsim_dev {
 	struct list_head list;
@@ -371,7 +372,7 @@ static ssize_t wwan_hwsim_debugfs_portdestroy_write(struct file *file,
 	 * waiting this callback to finish in the debugfs_remove() call. So,
 	 * use workqueue.
 	 */
-	schedule_work(&port->del_work);
+	queue_work(wwan_wq, &port->del_work);
 
 	return count;
 }
@@ -416,7 +417,7 @@ static ssize_t wwan_hwsim_debugfs_devdestroy_write(struct file *file,
 	 * waiting this callback to finish in the debugfs_remove() call. So,
 	 * use workqueue.
 	 */
-	schedule_work(&dev->del_work);
+	queue_work(wwan_wq, &dev->del_work);
 
 	return count;
 }
@@ -506,9 +507,15 @@ static int __init wwan_hwsim_init(void)
 	if (wwan_hwsim_devsnum < 0 || wwan_hwsim_devsnum > 128)
 		return -EINVAL;
 
+	wwan_wq = alloc_workqueue("wwan_wq", 0, 0);
+	if (!wwan_wq)
+		return -ENOMEM;
+
 	wwan_hwsim_class = class_create(THIS_MODULE, "wwan_hwsim");
-	if (IS_ERR(wwan_hwsim_class))
+	if (IS_ERR(wwan_hwsim_class)) {
+		destroy_workqueue(wwan_wq);
 		return PTR_ERR(wwan_hwsim_class);
+	}
 
 	wwan_hwsim_debugfs_topdir = debugfs_create_dir("wwan_hwsim", NULL);
 	wwan_hwsim_debugfs_devcreate =
@@ -524,6 +531,7 @@ static int __init wwan_hwsim_init(void)
 
 err_clean_devs:
 	wwan_hwsim_free_devs();
+	destroy_workqueue(wwan_wq);
 	debugfs_remove(wwan_hwsim_debugfs_topdir);
 	class_destroy(wwan_hwsim_class);
 
@@ -534,7 +542,7 @@ static void __exit wwan_hwsim_exit(void)
 {
 	debugfs_remove(wwan_hwsim_debugfs_devcreate);	/* Avoid new devs */
 	wwan_hwsim_free_devs();
-	flush_scheduled_work();		/* Wait deletion works completion */
+	destroy_workqueue(wwan_wq);		/* Wait deletion works completion */
 	debugfs_remove(wwan_hwsim_debugfs_topdir);
 	class_destroy(wwan_hwsim_class);
 }
-- 
2.32.0

^ permalink raw reply related

* Re: [PATCH net-next v2 1/2] tcp: ensure to use the most recently sent skb when filling the rate sample
From: Neal Cardwell @ 2022-04-20  2:11 UTC (permalink / raw)
  To: Pengcheng Yang
  Cc: Paolo Abeni, Eric Dumazet, Yuchung Cheng, netdev, David S. Miller,
	Jakub Kicinski
In-Reply-To: <005c01d85458$bd0ef940$372cebc0$@wangsu.com>

 .

On Tue, Apr 19, 2022 at 9:48 PM Pengcheng Yang <yangpc@wangsu.com> wrote:
>
> On Tue, Apr 19, 2022 at 10:00 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Sun, 2022-04-17 at 14:51 -0400, Neal Cardwell wrote:
> > > On Sat, Apr 16, 2022 at 5:20 AM Pengcheng Yang <yangpc@wangsu.com> wrote:
> > > >
> > > > If an ACK (s)acks multiple skbs, we favor the information
> > > > from the most recently sent skb by choosing the skb with
> > > > the highest prior_delivered count. But in the interval
> > > > between receiving ACKs, we send multiple skbs with the same
> > > > prior_delivered, because the tp->delivered only changes
> > > > when we receive an ACK.
> > > >
> > > > We used RACK's solution, copying tcp_rack_sent_after() as
> > > > tcp_skb_sent_after() helper to determine "which packet was
> > > > sent last?". Later, we will use tcp_skb_sent_after() instead
> > > > in RACK.
> > > >
> > > > Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > ---
> > > >  include/net/tcp.h   |  6 ++++++
> > > >  net/ipv4/tcp_rate.c | 11 ++++++++---
> > > >  2 files changed, 14 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > > index 6d50a66..fcd69fc 100644
> > > > --- a/include/net/tcp.h
> > > > +++ b/include/net/tcp.h
> > > > @@ -1042,6 +1042,7 @@ struct rate_sample {
> > > >         int  losses;            /* number of packets marked lost upon ACK */
> > > >         u32  acked_sacked;      /* number of packets newly (S)ACKed upon ACK */
> > > >         u32  prior_in_flight;   /* in flight before this ACK */
> > > > +       u32  last_end_seq;      /* end_seq of most recently ACKed packet */
> > > >         bool is_app_limited;    /* is sample from packet with bubble in pipe? */
> > > >         bool is_retrans;        /* is sample from retransmission? */
> > > >         bool is_ack_delayed;    /* is this (likely) a delayed ACK? */
> > > > @@ -1158,6 +1159,11 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
> > > >                   bool is_sack_reneg, struct rate_sample *rs);
> > > >  void tcp_rate_check_app_limited(struct sock *sk);
> > > >
> > > > +static inline bool tcp_skb_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2)
> > > > +{
> > > > +       return t1 > t2 || (t1 == t2 && after(seq1, seq2));
> > > > +}
> > > > +
> > > >  /* These functions determine how the current flow behaves in respect of SACK
> > > >   * handling. SACK is negotiated with the peer, and therefore it can vary
> > > >   * between different flows.
> > > > diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
> > > > index 617b818..a8f6d9d 100644
> > > > --- a/net/ipv4/tcp_rate.c
> > > > +++ b/net/ipv4/tcp_rate.c
> > > > @@ -74,27 +74,32 @@ void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb)
> > > >   *
> > > >   * If an ACK (s)acks multiple skbs (e.g., stretched-acks), this function is
> > > >   * called multiple times. We favor the information from the most recently
> > > > - * sent skb, i.e., the skb with the highest prior_delivered count.
> > > > + * sent skb, i.e., the skb with the most recently sent time and the highest
> > > > + * sequence.
> > > >   */
> > > >  void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
> > > >                             struct rate_sample *rs)
> > > >  {
> > > >         struct tcp_sock *tp = tcp_sk(sk);
> > > >         struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
> > > > +       u64 tx_tstamp;
> > > >
> > > >         if (!scb->tx.delivered_mstamp)
> > > >                 return;
> > > >
> > > > +       tx_tstamp = tcp_skb_timestamp_us(skb);
> > > >         if (!rs->prior_delivered ||
> > > > -           after(scb->tx.delivered, rs->prior_delivered)) {
> > > > +           tcp_skb_sent_after(tx_tstamp, tp->first_tx_mstamp,
> > > > +                              scb->end_seq, rs->last_end_seq)) {
> > > >                 rs->prior_delivered_ce  = scb->tx.delivered_ce;
> > > >                 rs->prior_delivered  = scb->tx.delivered;
> > > >                 rs->prior_mstamp     = scb->tx.delivered_mstamp;
> > > >                 rs->is_app_limited   = scb->tx.is_app_limited;
> > > >                 rs->is_retrans       = scb->sacked & TCPCB_RETRANS;
> > > > +               rs->last_end_seq     = scb->end_seq;
> > > >
> > > >                 /* Record send time of most recently ACKed packet: */
> > > > -               tp->first_tx_mstamp  = tcp_skb_timestamp_us(skb);
> > > > +               tp->first_tx_mstamp  = tx_tstamp;
> > > >                 /* Find the duration of the "send phase" of this window: */
> > > >                 rs->interval_us = tcp_stamp_us_delta(tp->first_tx_mstamp,
> > > >                                                      scb->tx.first_tx_mstamp);
> > > > --
> > >
> > > Thanks for the patch! The change looks good to me, and it passes our
> > > team's packetdrill tests.
> > >
> > > One suggestion: currently this patch seems to be targeted to the
> > > net-next branch. However, since it's a bug fix my sense is that it
> > > would be best to target this to the net branch, so that it gets
> > > backported to stable releases.
> > >
> > > One complication is that the follow-on patch in this series ("tcp: use
> > > tcp_skb_sent_after() instead in RACK") is a pure re-factor/cleanup,
> > > which is more appropriate for net-next. So the plan I was trying to
> > > describe in the previous thread was that this series could be
> > > implemented as:
> > >
> > > (1) first, submit "tcp: ensure to use the most recently sent skb when
> > > filling the rate sample" to the net branch
> > > (2) wait for the fix in the net branch to be merged into the net-next branch
> > > (3) second, submit "tcp: use tcp_skb_sent_after() instead in RACK" to
> > > the net-next branch
> > >
> > > What do folks think?
> >
> > +1 for the above.
> >
> > @Pengcheng: please additionally provide a suitable 'fixes' tag for
> > patch 1/2.
>
> Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")

Thanks. That looks like the correct SHA1. However, I think there may
be a miscommunication. :-)

I think what Paolo and I are suggesting is:

(1) e-mail the patch "tcp: ensure to use the most recently sent skb
when filling the rate sample" as a submission to the net branch
("[PATCH net v3] tcp: ensure to use the most recently sent skb when
filling the rate sample"), with the "Fixes:" footer in the commit
message  in the line above your "Signed-off-by:" footer.

(2) wait for the fix in the net branch to be merged into the net-next branch

(3) submit "tcp: use tcp_skb_sent_after() instead in RACK" to the
net-next branch

thanks,
neal

^ permalink raw reply

* Re: [PATCH] octeon_ep: Remove unnecessary cast
From: Joe Perches @ 2022-04-20  2:07 UTC (permalink / raw)
  To: Haowen Bai, Veerasenareddy Burru, Abhijit Ayarekar,
	David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-kernel
In-Reply-To: <1650419232-7982-1-git-send-email-baihaowen@meizu.com>

On Wed, 2022-04-20 at 09:47 +0800, Haowen Bai wrote:
> Fix the following coccicheck warning:
> 
> ./drivers/net/ethernet/marvell/octeon_ep/octep_rx.c:161:18-40: WARNING:
> casting value returned by memory allocation function to (struct
> octep_rx_buffer *) is useless.
[]
> diff --git a/drivers/net/ethernet/marvell/octeon_ep/octep_rx.c b/drivers/net/ethernet/marvell/octeon_ep/octep_rx.c
[]
> @@ -158,8 +158,7 @@ static int octep_setup_oq(struct octep_device *oct, int q_no)
>  		goto desc_dma_alloc_err;
>  	}
>  
> -	oq->buff_info = (struct octep_rx_buffer *)
> -			vzalloc(oq->max_count * OCTEP_OQ_RECVBUF_SIZE);
> +	oq->buff_info = vzalloc(oq->max_count * OCTEP_OQ_RECVBUF_SIZE);
>  	if (unlikely(!oq->buff_info)) {
>  		dev_err(&oct->pdev->dev,
>  			"Failed to allocate buffer info for OQ-%d\n", q_no);

probably better to use kvcalloc or maybe vcalloc if oq->max_count is
always expected to be huge.

OCTEP_OQ_RECVBUF_SIZE is pretty small (just a pointer and a u64).



^ permalink raw reply

* [PATCH net-next 2/2] ipv6: Use ipv6_only_sock() helper in condition.
From: Kuniyuki Iwashima @ 2022-04-20  1:58 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, David Ahern
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20220420015851.50237-1-kuniyu@amazon.co.jp>

This patch replaces some sk_ipv6only tests with ipv6_only_sock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 drivers/net/bonding/bond_main.c                                | 2 +-
 drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c      | 2 +-
 drivers/net/ethernet/netronome/nfp/crypto/tls.c                | 2 +-
 net/core/filter.c                                              | 2 +-
 net/ipv6/af_inet6.c                                            | 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 15eddca7b4b6..00e6bc591f77 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5221,7 +5221,7 @@ static void bond_sk_to_flow(struct sock *sk, struct flow_keys *flow)
 	switch (sk->sk_family) {
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		if (sk->sk_ipv6only ||
+		if (ipv6_only_sock(sk) ||
 		    ipv6_addr_type(&sk->sk_v6_daddr) != IPV6_ADDR_MAPPED) {
 			flow->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
 			flow->addrs.v6addrs.src = inet6_sk(sk)->saddr;
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c
index 59683f79959c..60b648b46f75 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c
@@ -483,7 +483,7 @@ static int chcr_ktls_dev_add(struct net_device *netdev, struct sock *sk,
 		tx_info->ip_family = AF_INET;
 #if IS_ENABLED(CONFIG_IPV6)
 	} else {
-		if (!sk->sk_ipv6only &&
+		if (!ipv6_only_sock(sk) &&
 		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED) {
 			memcpy(daaddr, &sk->sk_daddr, 4);
 			tx_info->ip_family = AF_INET;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
index 4c4ee524176c..3ae6067c7e6b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
@@ -102,7 +102,7 @@ struct mlx5_flow_handle *mlx5e_accel_fs_add_sk(struct mlx5e_priv *priv,
 		break;
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		if (!sk->sk_ipv6only &&
+		if (!ipv6_only_sock(sk) &&
 		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED) {
 			accel_fs_tcp_set_ipv4_flow(spec, sk);
 			ft = &fs_tcp->tables[ACCEL_FS_IPV4_TCP];
diff --git a/drivers/net/ethernet/netronome/nfp/crypto/tls.c b/drivers/net/ethernet/netronome/nfp/crypto/tls.c
index 84d66d138c3d..78368e71ce83 100644
--- a/drivers/net/ethernet/netronome/nfp/crypto/tls.c
+++ b/drivers/net/ethernet/netronome/nfp/crypto/tls.c
@@ -289,7 +289,7 @@ nfp_net_tls_add(struct net_device *netdev, struct sock *sk,
 	switch (sk->sk_family) {
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		if (sk->sk_ipv6only ||
+		if (ipv6_only_sock(sk) ||
 		    ipv6_addr_type(&sk->sk_v6_daddr) != IPV6_ADDR_MAPPED) {
 			req_sz = sizeof(struct nfp_crypto_req_add_v6);
 			ipv6 = true;
diff --git a/net/core/filter.c b/net/core/filter.c
index 143f442a9505..2ed81c48c282 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7099,7 +7099,7 @@ BPF_CALL_5(bpf_tcp_gen_syncookie, struct sock *, sk, void *, iph, u32, iph_len,
 	 */
 	switch (((struct iphdr *)iph)->version) {
 	case 4:
-		if (sk->sk_family == AF_INET6 && sk->sk_ipv6only)
+		if (sk->sk_family == AF_INET6 && ipv6_only_sock(sk))
 			return -EINVAL;
 
 		mss = tcp_v4_get_syncookie(sk, iph, th, &cookie);
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 6595a78672c8..70564ddccc46 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -318,7 +318,7 @@ static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 		/* Binding to v4-mapped address on a v6-only socket
 		 * makes no sense
 		 */
-		if (sk->sk_ipv6only) {
+		if (ipv6_only_sock(sk)) {
 			err = -EINVAL;
 			goto out;
 		}
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next 1/2] ipv6: Remove __ipv6_only_sock().
From: Kuniyuki Iwashima @ 2022-04-20  1:58 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, David Ahern
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20220420015851.50237-1-kuniyu@amazon.co.jp>

Since commit 9fe516ba3fb2 ("inet: move ipv6only in sock_common"),
ipv6_only_sock() and __ipv6_only_sock() are the same macro.  Let's
remove the one.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/linux/ipv6.h | 4 +---
 net/dccp/ipv6.c      | 2 +-
 net/ipv6/datagram.c  | 4 ++--
 net/ipv6/tcp_ipv6.c  | 2 +-
 net/ipv6/udp.c       | 4 ++--
 net/sctp/ipv6.c      | 4 ++--
 6 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 918bfea4ef5f..ec5ca392eaa3 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -340,8 +340,7 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
 	return (struct raw6_sock *)sk;
 }
 
-#define __ipv6_only_sock(sk)	(sk->sk_ipv6only)
-#define ipv6_only_sock(sk)	(__ipv6_only_sock(sk))
+#define ipv6_only_sock(sk)	(sk->sk_ipv6only)
 #define ipv6_sk_rxinfo(sk)	((sk)->sk_family == PF_INET6 && \
 				 inet6_sk(sk)->rxopt.bits.rxinfo)
 
@@ -358,7 +357,6 @@ static inline int inet_v6_ipv6only(const struct sock *sk)
 	return ipv6_only_sock(sk);
 }
 #else
-#define __ipv6_only_sock(sk)	0
 #define ipv6_only_sock(sk)	0
 #define ipv6_sk_rxinfo(sk)	0
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index eab3bd1ee9a0..4d95b6400915 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -892,7 +892,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 		SOCK_DEBUG(sk, "connect: ipv4 mapped\n");
 
-		if (__ipv6_only_sock(sk))
+		if (ipv6_only_sock(sk))
 			return -ENETUNREACH;
 
 		sin.sin_family = AF_INET;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 206f66310a88..39b2327edc4e 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -145,7 +145,7 @@ int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr,
 	int			err;
 
 	if (usin->sin6_family == AF_INET) {
-		if (__ipv6_only_sock(sk))
+		if (ipv6_only_sock(sk))
 			return -EAFNOSUPPORT;
 		err = __ip4_datagram_connect(sk, uaddr, addr_len);
 		goto ipv4_connected;
@@ -178,7 +178,7 @@ int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr,
 	if (addr_type & IPV6_ADDR_MAPPED) {
 		struct sockaddr_in sin;
 
-		if (__ipv6_only_sock(sk)) {
+		if (ipv6_only_sock(sk)) {
 			err = -ENETUNREACH;
 			goto out;
 		}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 782df529ff69..54277de7474b 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -230,7 +230,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 		u32 exthdrlen = icsk->icsk_ext_hdr_len;
 		struct sockaddr_in sin;
 
-		if (__ipv6_only_sock(sk))
+		if (ipv6_only_sock(sk))
 			return -ENETUNREACH;
 
 		sin.sin_family = AF_INET;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index db9449b52dbe..688af6f809fe 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1123,7 +1123,7 @@ static int udpv6_pre_connect(struct sock *sk, struct sockaddr *uaddr,
 	 * bytes that are out of the bound specified by user in addr_len.
 	 */
 	if (uaddr->sa_family == AF_INET) {
-		if (__ipv6_only_sock(sk))
+		if (ipv6_only_sock(sk))
 			return -EAFNOSUPPORT;
 		return udp_pre_connect(sk, uaddr, addr_len);
 	}
@@ -1359,7 +1359,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			msg->msg_name = &sin;
 			msg->msg_namelen = sizeof(sin);
 do_udp_sendmsg:
-			if (__ipv6_only_sock(sk))
+			if (ipv6_only_sock(sk))
 				return -ENETUNREACH;
 			return udp_sendmsg(sk, msg, len);
 		}
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 470dbdc27d58..d081858c2d07 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -926,7 +926,7 @@ static int sctp_inet6_af_supported(sa_family_t family, struct sctp_sock *sp)
 		return 1;
 	/* v4-mapped-v6 addresses */
 	case AF_INET:
-		if (!__ipv6_only_sock(sctp_opt2sk(sp)))
+		if (!ipv6_only_sock(sctp_opt2sk(sp)))
 			return 1;
 		fallthrough;
 	default:
@@ -952,7 +952,7 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
 		return 0;
 
 	/* If the socket is IPv6 only, v4 addrs will not match */
-	if (__ipv6_only_sock(sk) && af1 != af2)
+	if (ipv6_only_sock(sk) && af1 != af2)
 		return 0;
 
 	/* Today, wildcard AF_INET/AF_INET6. */
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH net-next] PCI: add Corigine vendor ID into pci_ids.h
From: Yinjun Zhang @ 2022-04-20  1:59 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: davem, kuba, netdev, Bjorn Helgaas, linux-pci, Simon Horman
In-Reply-To: <Yl8w5XK54fB/rx9c@lunn.ch>

On Wed, Apr 20, 2022 at 12:00:05AM +0200, Andrew Lunn wrote:
> On Tue, Apr 19, 2022 at 06:02:48PM +0800, Yinjun Zhang wrote:
> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
> > Signed-off-by: Simon Horman <simon.horman@corigine.com>
> > ---
> >  include/linux/pci_ids.h | 2 ++
> 
> The very top of this file says:
> 
>  *      Do not add new entries to this file unless the definitions
>  *      are shared between multiple drivers.
> 
> Please add to the commit messages the two or more drivers which share
> this definition.

It will be used by ethernet and infiniband driver as we can see now,
I'll update the commit message if you think it's a good practice.

> 
> Thanks
> 
>      Andrew

^ permalink raw reply

* [PATCH net-next 0/2] ipv6: Use ipv6_only_sock helper function.
From: Kuniyuki Iwashima @ 2022-04-20  1:58 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, David Ahern
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The first patch removes __ipv6_only_sock(), and the second replaces
ipv6only tests with ipv6_only_sock().


Kuniyuki Iwashima (2):
  ipv6: Remove __ipv6_only_sock().
  ipv6: Use ipv6_only_sock() helper in condition.

 drivers/net/bonding/bond_main.c                               | 2 +-
 .../net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c    | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c     | 2 +-
 drivers/net/ethernet/netronome/nfp/crypto/tls.c               | 2 +-
 include/linux/ipv6.h                                          | 4 +---
 net/core/filter.c                                             | 2 +-
 net/dccp/ipv6.c                                               | 2 +-
 net/ipv6/af_inet6.c                                           | 2 +-
 net/ipv6/datagram.c                                           | 4 ++--
 net/ipv6/tcp_ipv6.c                                           | 2 +-
 net/ipv6/udp.c                                                | 4 ++--
 net/sctp/ipv6.c                                               | 4 ++--
 12 files changed, 15 insertions(+), 17 deletions(-)

-- 
2.30.2


^ permalink raw reply

* Re: [PATCH net-next v2 1/2] tcp: ensure to use the most recently sent skb when filling the rate sample
From: Pengcheng Yang @ 2022-04-20  1:48 UTC (permalink / raw)
  To: 'Paolo Abeni', 'Neal Cardwell'
  Cc: 'Eric Dumazet', 'Yuchung Cheng', netdev,
	'David S. Miller', 'Jakub Kicinski'
In-Reply-To: <862fb2570c4350f0fd3bb3ad153d37b528564ed1.camel@redhat.com>

On Tue, Apr 19, 2022 at 10:00 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2022-04-17 at 14:51 -0400, Neal Cardwell wrote:
> > On Sat, Apr 16, 2022 at 5:20 AM Pengcheng Yang <yangpc@wangsu.com> wrote:
> > >
> > > If an ACK (s)acks multiple skbs, we favor the information
> > > from the most recently sent skb by choosing the skb with
> > > the highest prior_delivered count. But in the interval
> > > between receiving ACKs, we send multiple skbs with the same
> > > prior_delivered, because the tp->delivered only changes
> > > when we receive an ACK.
> > >
> > > We used RACK's solution, copying tcp_rack_sent_after() as
> > > tcp_skb_sent_after() helper to determine "which packet was
> > > sent last?". Later, we will use tcp_skb_sent_after() instead
> > > in RACK.
> > >
> > > Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> > > Cc: Neal Cardwell <ncardwell@google.com>
> > > ---
> > >  include/net/tcp.h   |  6 ++++++
> > >  net/ipv4/tcp_rate.c | 11 ++++++++---
> > >  2 files changed, 14 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index 6d50a66..fcd69fc 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -1042,6 +1042,7 @@ struct rate_sample {
> > >         int  losses;            /* number of packets marked lost upon ACK */
> > >         u32  acked_sacked;      /* number of packets newly (S)ACKed upon ACK */
> > >         u32  prior_in_flight;   /* in flight before this ACK */
> > > +       u32  last_end_seq;      /* end_seq of most recently ACKed packet */
> > >         bool is_app_limited;    /* is sample from packet with bubble in pipe? */
> > >         bool is_retrans;        /* is sample from retransmission? */
> > >         bool is_ack_delayed;    /* is this (likely) a delayed ACK? */
> > > @@ -1158,6 +1159,11 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
> > >                   bool is_sack_reneg, struct rate_sample *rs);
> > >  void tcp_rate_check_app_limited(struct sock *sk);
> > >
> > > +static inline bool tcp_skb_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2)
> > > +{
> > > +       return t1 > t2 || (t1 == t2 && after(seq1, seq2));
> > > +}
> > > +
> > >  /* These functions determine how the current flow behaves in respect of SACK
> > >   * handling. SACK is negotiated with the peer, and therefore it can vary
> > >   * between different flows.
> > > diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
> > > index 617b818..a8f6d9d 100644
> > > --- a/net/ipv4/tcp_rate.c
> > > +++ b/net/ipv4/tcp_rate.c
> > > @@ -74,27 +74,32 @@ void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb)
> > >   *
> > >   * If an ACK (s)acks multiple skbs (e.g., stretched-acks), this function is
> > >   * called multiple times. We favor the information from the most recently
> > > - * sent skb, i.e., the skb with the highest prior_delivered count.
> > > + * sent skb, i.e., the skb with the most recently sent time and the highest
> > > + * sequence.
> > >   */
> > >  void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
> > >                             struct rate_sample *rs)
> > >  {
> > >         struct tcp_sock *tp = tcp_sk(sk);
> > >         struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
> > > +       u64 tx_tstamp;
> > >
> > >         if (!scb->tx.delivered_mstamp)
> > >                 return;
> > >
> > > +       tx_tstamp = tcp_skb_timestamp_us(skb);
> > >         if (!rs->prior_delivered ||
> > > -           after(scb->tx.delivered, rs->prior_delivered)) {
> > > +           tcp_skb_sent_after(tx_tstamp, tp->first_tx_mstamp,
> > > +                              scb->end_seq, rs->last_end_seq)) {
> > >                 rs->prior_delivered_ce  = scb->tx.delivered_ce;
> > >                 rs->prior_delivered  = scb->tx.delivered;
> > >                 rs->prior_mstamp     = scb->tx.delivered_mstamp;
> > >                 rs->is_app_limited   = scb->tx.is_app_limited;
> > >                 rs->is_retrans       = scb->sacked & TCPCB_RETRANS;
> > > +               rs->last_end_seq     = scb->end_seq;
> > >
> > >                 /* Record send time of most recently ACKed packet: */
> > > -               tp->first_tx_mstamp  = tcp_skb_timestamp_us(skb);
> > > +               tp->first_tx_mstamp  = tx_tstamp;
> > >                 /* Find the duration of the "send phase" of this window: */
> > >                 rs->interval_us = tcp_stamp_us_delta(tp->first_tx_mstamp,
> > >                                                      scb->tx.first_tx_mstamp);
> > > --
> >
> > Thanks for the patch! The change looks good to me, and it passes our
> > team's packetdrill tests.
> >
> > One suggestion: currently this patch seems to be targeted to the
> > net-next branch. However, since it's a bug fix my sense is that it
> > would be best to target this to the net branch, so that it gets
> > backported to stable releases.
> >
> > One complication is that the follow-on patch in this series ("tcp: use
> > tcp_skb_sent_after() instead in RACK") is a pure re-factor/cleanup,
> > which is more appropriate for net-next. So the plan I was trying to
> > describe in the previous thread was that this series could be
> > implemented as:
> >
> > (1) first, submit "tcp: ensure to use the most recently sent skb when
> > filling the rate sample" to the net branch
> > (2) wait for the fix in the net branch to be merged into the net-next branch
> > (3) second, submit "tcp: use tcp_skb_sent_after() instead in RACK" to
> > the net-next branch
> >
> > What do folks think?
> 
> +1 for the above.
> 
> @Pengcheng: please additionally provide a suitable 'fixes' tag for
> patch 1/2.

Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")

> 
> Thanks!
> 
> Paolo


^ permalink raw reply

* [PATCH] octeon_ep: Remove unnecessary cast
From: Haowen Bai @ 2022-04-20  1:47 UTC (permalink / raw)
  To: Veerasenareddy Burru, Abhijit Ayarekar, David S. Miller,
	Jakub Kicinski, Paolo Abeni
  Cc: Haowen Bai, netdev, linux-kernel

Fix the following coccicheck warning:

./drivers/net/ethernet/marvell/octeon_ep/octep_rx.c:161:18-40: WARNING:
casting value returned by memory allocation function to (struct
octep_rx_buffer *) is useless.

Signed-off-by: Haowen Bai <baihaowen@meizu.com>
---
 drivers/net/ethernet/marvell/octeon_ep/octep_rx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeon_ep/octep_rx.c b/drivers/net/ethernet/marvell/octeon_ep/octep_rx.c
index 945947ec7723..96768823cced 100644
--- a/drivers/net/ethernet/marvell/octeon_ep/octep_rx.c
+++ b/drivers/net/ethernet/marvell/octeon_ep/octep_rx.c
@@ -158,8 +158,7 @@ static int octep_setup_oq(struct octep_device *oct, int q_no)
 		goto desc_dma_alloc_err;
 	}
 
-	oq->buff_info = (struct octep_rx_buffer *)
-			vzalloc(oq->max_count * OCTEP_OQ_RECVBUF_SIZE);
+	oq->buff_info = vzalloc(oq->max_count * OCTEP_OQ_RECVBUF_SIZE);
 	if (unlikely(!oq->buff_info)) {
 		dev_err(&oct->pdev->dev,
 			"Failed to allocate buffer info for OQ-%d\n", q_no);
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH net-next] PCI: add Corigine vendor ID into pci_ids.h
From: Yinjun Zhang @ 2022-04-20  1:46 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: davem, kuba, netdev, Bjorn Helgaas, linux-pci, Simon Horman
In-Reply-To: <20220419142310.GA1197793@bhelgaas>

On Tue, Apr 19, 2022 at 09:23:10AM -0500, Bjorn Helgaas wrote:
> On Tue, Apr 19, 2022 at 06:02:48PM +0800, Yinjun Zhang wrote:
> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
> > Signed-off-by: Simon Horman <simon.horman@corigine.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> I'd be happy to apply this, but since I'm in the cc: line, I assume it
> will be applied with other patches that use this.  Let me know if
> otherwise.

Yes, it will be used by coming-up patches which will be submitted to
net-next tree, so I'll appreciate that if it can be applied to net-next
tree first.

> 
> I see that you also added the ID at
> https://pci-ids.ucw.cz/read/PC/1da8; thank you for that!  
> 
> But it looks like the "name" part isn't quite correct -- at
> https://pci-ids.ucw.cz/read/PC?restrict=1, "Corigine" isn't shown, so
> I think lspci won't show the right thing yet.

Yeah, I presume that it's not approved yet by the administrator.

> 
> > ---
> >  include/linux/pci_ids.h | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> > index 0178823ce8c2..6d12b6d71c61 100644
> > --- a/include/linux/pci_ids.h
> > +++ b/include/linux/pci_ids.h
> > @@ -2568,6 +2568,8 @@
> >  
> >  #define PCI_VENDOR_ID_HYGON		0x1d94
> >  
> > +#define PCI_VENDOR_ID_CORIGINE		0x1da8
> > +
> >  #define PCI_VENDOR_ID_FUNGIBLE		0x1dad
> >  
> >  #define PCI_VENDOR_ID_HXT		0x1dbf
> > -- 
> > 1.8.3.1
> > 

^ permalink raw reply

* [PATCH v2 2/2] igc: Trigger proper interrupts in igc_xsk_wakeup
From: jeff.evanson @ 2022-04-20  1:26 UTC (permalink / raw)
  To: Jesse Brandeburg, Tony Nguyen, David S. Miller, Jakub Kicinski,
	Vedang Patel, Andre Guedes, Maciej Fijalkowski, Jithu Joseph,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list
  Cc: Jeff Evanson
In-Reply-To: <20220420012635.13733-1-jeff.evanson@qsc.com>

From: Jeff Evanson <jeff.evanson@qsc.com>

In igc_xsk_wakeup, trigger the proper interrupt based on whether flags
contains XDP_WAKEUP_RX and/or XDP_WAKEUP_TX.

Consider a scenario where the transmit queue interrupt is mapped to a
different irq from the receive queue. If XDP_WAKEUP_TX is set in the
flags argument, the interrupt for transmit queue must be triggered,
otherwise the transmit queue's napi_struct will never be scheduled.

In the case where both XDP_WAKEUP_TX and XDP_WAKEUP_RX are both set,
the receive interrupt should always be triggered, but the transmit
interrupt should only be triggered if its q_vector differs from the
receive queue's interrupt.

Fixes: fc9df2a0b520 ("igc: Enable RX via AF_XDP zero-copy")
---
 drivers/net/ethernet/intel/igc/igc_main.c | 40 ++++++++++++++++++-----
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index a36a18c84aeb..41b5d1ac8bc1 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -6072,8 +6072,8 @@ static void igc_trigger_rxtxq_interrupt(struct igc_adapter *adapter,
 
 int igc_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
 {
+	struct igc_q_vector *txq_vector = NULL, *rxq_vector = NULL;
 	struct igc_adapter *adapter = netdev_priv(dev);
-	struct igc_q_vector *q_vector;
 	struct igc_ring *ring;
 
 	if (test_bit(__IGC_DOWN, &adapter->state))
@@ -6082,17 +6082,39 @@ int igc_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
 	if (!igc_xdp_is_enabled(adapter))
 		return -ENXIO;
 
-	if (queue_id >= adapter->num_rx_queues)
-		return -EINVAL;
+	if (flags & XDP_WAKEUP_RX) {
+		if (queue_id >= adapter->num_rx_queues)
+			return -EINVAL;
 
-	ring = adapter->rx_ring[queue_id];
+		ring = adapter->rx_ring[queue_id];
+		if (!ring->xsk_pool)
+			return -ENXIO;
 
-	if (!ring->xsk_pool)
-		return -ENXIO;
+		rxq_vector = ring->q_vector;
+	}
+
+	if (flags & XDP_WAKEUP_TX) {
+		if (queue_id >= adapter->num_tx_queues)
+			return -EINVAL;
+
+		ring = adapter->tx_ring[queue_id];
+		if (!ring->xsk_pool)
+			return -ENXIO;
+
+		txq_vector = ring->q_vector;
+	}
+
+	if (rxq_vector != NULL &&
+	    !napi_if_scheduled_mark_missed(&rxq_vector->napi))
+		igc_trigger_rxtxq_interrupt(adapter, rxq_vector);
 
-	q_vector = adapter->q_vector[queue_id];
-	if (!napi_if_scheduled_mark_missed(&q_vector->napi))
-		igc_trigger_rxtxq_interrupt(adapter, q_vector);
+	/* only trigger tx interrupt if the receive interrupt was not
+	 * triggered or if its irq differs from the receive queue's irq
+	 */
+	if (txq_vector != NULL &&
+            txq_vector != rxq_vector &&
+	    !napi_if_scheduled_mark_missed(&txq_vector->napi))
+		igc_trigger_rxtxq_interrupt(adapter, txq_vector);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 1/2] igc: Fix race in igc_xdp_xmit_zc
From: jeff.evanson @ 2022-04-20  1:26 UTC (permalink / raw)
  To: Jesse Brandeburg, Tony Nguyen, David S. Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Vedang Patel, Andre Guedes, Maciej Fijalkowski,
	Jithu Joseph, moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)
  Cc: Jeff Evanson
In-Reply-To: <20220420012635.13733-1-jeff.evanson@qsc.com>

From: Jeff Evanson <jeff.evanson@qsc.com>

In igc_xdp_xmit_zc, initialize next_to_use while holding the netif_tx_lock
to prevent racing with other users of the tx ring.

Fixes: 9acf59a752d4 (igc: Enable TX via AF_XDP zero-copy)
---
 drivers/net/ethernet/intel/igc/igc_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 1c00ee310c19..a36a18c84aeb 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -2598,7 +2598,7 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
 	struct netdev_queue *nq = txring_txq(ring);
 	union igc_adv_tx_desc *tx_desc = NULL;
 	int cpu = smp_processor_id();
-	u16 ntu = ring->next_to_use;
+	u16 ntu;
 	struct xdp_desc xdp_desc;
 	u16 budget;
 
@@ -2607,6 +2607,8 @@ static void igc_xdp_xmit_zc(struct igc_ring *ring)
 
 	__netif_tx_lock(nq, cpu);
 
+	ntu = ring->next_to_use;
+
 	budget = igc_desc_unused(ring);
 
 	while (xsk_tx_peek_desc(pool, &xdp_desc) && budget--) {
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 0/2] AF_XDP patches for zero-copy in igc driver
From: jeff.evanson @ 2022-04-20  1:26 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	open list:XDP (eXpress Data Path),
	open list:XDP (eXpress Data Path)
  Cc: Jeff Evanson

From: Jeff Evanson <jeff.evanson@qsc.com>

Patch 1/2
In igc_xdp_xmit_zc, the local variable ntu is initialized from ring->next_to_use 
without holding the __netif_tx_lock. If another thread already held the lock, the 
ntu value is potentially incorrect. Various bad behaviors were observed ranging
from transmit timeouts to an outright hard-lock of the host. The behavior is
corrected by simply initializing ntu while __netif_tx_lock is held.

Patch 2/2
In igc_xsk_wakeup, only the q_vector interrupt for the passed queue_id was
being triggered. This worked fine so long as both the tx and rx queues shared
the same interrupt. If the tx and rx queues use separate interrupts, only
the rx interrupt would be triggered, regardless of whether flags contained
XDP_WAKEUP_TX. This patch changes the behavior so that q_vectors (rxq_vector 
and txq_vector) are looked up from the referenced tx and/or rx queues instead
of from the adapter q_vector array. If only XDP_WAKEUP_RX is set, rxq_vector
is triggered. If only XDP_WAKEUP_TX is set txq_vector is triggered. If both
are set, rxq_vector is triggered and txq_vector is triggered only if it is
not equal to rxq_vector.
The bad behavior here is apparent when packets are queued up to an AF_XDP
socket and then sendmsg is called on the socket. The packets would not be
sent until the q_vector was triggered via some other mechanism (IE
igc_xmit_frame).


Jeff Evanson (2):
  igc: Fix race in igc_xdp_xmit_zc
  igc: Trigger proper interrupts in igc_xsk_wakeup

 drivers/net/ethernet/intel/igc/igc_main.c | 44 +++++++++++++++++------
 1 file changed, 34 insertions(+), 10 deletions(-)

-- 
2.17.1


^ permalink raw reply

* [PATCH bpf-next v1 2/2] selftests/bpf: handle batch operations for map-in-map bpf-maps
From: Takshak Chahande @ 2022-04-20  1:17 UTC (permalink / raw)
  To: netdev; +Cc: ast, kernel-team, ctakshak, ndixit, kafai, andriin
In-Reply-To: <20220420011734.605177-1-ctakshak@fb.com>

This patch adds up test cases that handles 4 combinations:
a) outer map: BPF_MAP_TYPE_ARRAY_OF_MAPS
   inner maps: BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_HASH
b) outer map: BPF_MAP_TYPE_HASH_OF_MAPS
   inner maps: BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_HASH

Signed-off-by: Takshak Chahande <ctakshak@fb.com>
---
 .../bpf/map_tests/map_in_map_batch_ops.c      | 232 ++++++++++++++++++
 1 file changed, 232 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c

diff --git a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
new file mode 100644
index 000000000000..5a0f7f9a6ddd
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <test_maps.h>
+
+#define OUTER_MAP_ENTRIES 10
+
+static __u32 get_map_id_from_fd(int map_fd)
+{
+	struct bpf_map_info map_info = {};
+	uint32_t info_len = sizeof(map_info);
+	int ret;
+
+	ret = bpf_obj_get_info_by_fd(map_fd, &map_info, &info_len);
+	CHECK(ret < 0, "Finding map info failed, error:%s\n",
+	      strerror(errno));
+
+	return map_info.id;
+}
+
+/* This creates number of OUTER_MAP_ENTRIES maps that will be stored
+ * in outer map and return the created map_fds
+ */
+static void create_inner_maps(enum bpf_map_type map_type,
+			      __u32 *inner_map_fds)
+{
+	int map_fd, map_index, ret;
+	__u32 map_key = 0, map_id;
+	char map_name[15];
+
+	for (map_index = 0; map_index < OUTER_MAP_ENTRIES; map_index++) {
+		memset(map_name, 0, sizeof(map_name));
+		sprintf(map_name, "inner_map_fd_%d", map_index);
+		map_fd = bpf_map_create(map_type, map_name, sizeof(__u32),
+					sizeof(__u32), 1, NULL);
+		CHECK(map_fd < 0,
+		      "inner bpf_map_create() failed",
+		      "map_type=(%d) map_name(%s), error:%s\n",
+		      map_type, map_name, strerror(errno));
+
+		/* keep track of the inner map fd as it is required
+		 * to add records in outer map
+		 */
+		inner_map_fds[map_index] = map_fd;
+
+		/* Add entry into this created map
+		 * eg: map1 key = 0, value = map1's map id
+		 *     map2 key = 0, value = map2's map id
+		 */
+		map_id = get_map_id_from_fd(map_fd);
+		ret = bpf_map_update_elem(map_fd, &map_key, &map_id, 0);
+		CHECK(ret != 0,
+		      "bpf_map_update_elem failed",
+		      "map_type=(%d) map_name(%s), error:%s\n",
+		      map_type, map_name, strerror(errno));
+	}
+}
+
+static int create_outer_map(enum bpf_map_type map_type, __u32 inner_map_fd)
+{
+	int outer_map_fd;
+
+	LIBBPF_OPTS(bpf_map_create_opts, attr);
+	attr.inner_map_fd = inner_map_fd;
+	outer_map_fd = bpf_map_create(map_type, "outer_map", sizeof(__u32),
+				      sizeof(__u32), OUTER_MAP_ENTRIES,
+				      &attr);
+	CHECK(outer_map_fd < 0,
+	      "outer bpf_map_create()",
+	      "map_type=(%d), error:%s\n",
+	      map_type, strerror(errno));
+
+	return outer_map_fd;
+}
+
+static void validate_fetch_results(int outer_map_fd, __u32 *inner_map_fds,
+				   __u32 *fetched_keys, __u32 *fetched_values,
+				   __u32 max_entries_fetched)
+{
+	__u32 inner_map_key, inner_map_value;
+	int inner_map_fd, entry, err;
+	__u32 outer_map_value;
+
+	for (entry = 0; entry < max_entries_fetched; ++entry) {
+		outer_map_value = fetched_values[entry];
+		inner_map_fd = bpf_map_get_fd_by_id(outer_map_value);
+		CHECK(inner_map_fd < 0,
+		      "Failed to get inner map fd",
+		      "from id(%d), error=%s\n",
+		      outer_map_value, strerror(errno));
+		err = bpf_map_get_next_key(inner_map_fd, NULL, &inner_map_key);
+		CHECK(err != 0,
+		      "Failed to get inner map key",
+		      "error=%s\n", strerror(errno));
+
+		err = bpf_map_lookup_elem(inner_map_fd, &inner_map_key,
+					  &inner_map_value);
+		CHECK(err != 0,
+		      "Failed to get inner map value",
+		      "for key(%d), error=%s\n",
+		      inner_map_key, strerror(errno));
+
+		/* Actual value validation */
+		CHECK(outer_map_value != inner_map_value,
+		      "Failed to validate inner map value",
+		      "fetched(%d) and lookedup(%d)!\n",
+		      outer_map_value, inner_map_value);
+	}
+}
+
+static void fetch_and_validate(int outer_map_fd,
+			       __u32 *inner_map_fds,
+			       struct bpf_map_batch_opts *opts,
+			       __u32 batch_size, bool delete_entries)
+{
+	__u32 *fetched_keys, *fetched_values, fetched_entries = 0;
+	__u32 next_batch_key = 0, step_size = 5;
+	int err, retries = 0, max_retries = 3;
+	__u32 value_size = sizeof(__u32);
+
+	fetched_keys = calloc(batch_size, value_size);
+	fetched_values = calloc(batch_size, value_size);
+
+	while (fetched_entries < batch_size) {
+		err = delete_entries
+		      ? bpf_map_lookup_and_delete_batch(outer_map_fd,
+				      fetched_entries ? &next_batch_key : NULL,
+				      &next_batch_key,
+				      fetched_keys + fetched_entries,
+				      fetched_values + fetched_entries,
+				      &step_size, opts)
+		      : bpf_map_lookup_batch(outer_map_fd,
+				      fetched_entries ? &next_batch_key : NULL,
+				      &next_batch_key,
+				      fetched_keys + fetched_entries,
+				      fetched_values + fetched_entries,
+				      &step_size, opts);
+		CHECK((err < 0 && (errno != ENOENT && errno != ENOSPC)),
+		      "lookup with steps failed",
+		      "error: %s\n", strerror(errno));
+
+		fetched_entries += step_size;
+		/* retry for max_retries if ENOSPC */
+		if (errno == ENOSPC)
+			++retries;
+
+		if (retries >= max_retries)
+			break;
+	}
+
+	CHECK((fetched_entries != batch_size && err != ENOSPC),
+	      "Unable to fetch expected entries !",
+	      "fetched_entries(%d) and batch_size(%d) error: (%d):%s\n",
+	      fetched_entries, batch_size, errno, strerror(errno));
+
+	/* validate the fetched entries */
+	validate_fetch_results(outer_map_fd, inner_map_fds, fetched_keys,
+			       fetched_values, fetched_entries);
+	printf("batch_op is successful for batch_size(%d)\n", batch_size);
+
+	free(fetched_keys);
+	free(fetched_values);
+}
+
+static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
+				  enum bpf_map_type inner_map_type)
+{
+	__u32 *outer_map_keys, *inner_map_fds;
+	__u32 max_entries = OUTER_MAP_ENTRIES;
+	__u32 value_size = sizeof(__u32);
+	int batch_size[2] = {5, 10};
+	__u32 map_index, op_index;
+	int outer_map_fd, ret;
+	DECLARE_LIBBPF_OPTS(bpf_map_batch_opts, opts,
+			    .elem_flags = 0,
+			    .flags = 0,
+	);
+
+	outer_map_keys = calloc(max_entries, value_size);
+	inner_map_fds = calloc(max_entries, value_size);
+	create_inner_maps(inner_map_type, inner_map_fds);
+
+	outer_map_fd = create_outer_map(outer_map_type, *inner_map_fds);
+	/* create outer map keys */
+	for (map_index = 0; map_index < max_entries; map_index++)
+		outer_map_keys[map_index] =
+			((outer_map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
+			 ? 9 : 1000) - map_index;
+
+	/* batch operation - map_update */
+	ret = bpf_map_update_batch(outer_map_fd, outer_map_keys,
+				   inner_map_fds, &max_entries, &opts);
+	CHECK(ret != 0,
+	      "Failed to update the outer map batch ops",
+	      "error=%s\n", strerror(errno));
+
+	/* batch operation - map_lookup */
+	for (op_index = 0; op_index < 2; ++op_index)
+		fetch_and_validate(outer_map_fd, inner_map_fds, &opts,
+				   batch_size[op_index], false);
+
+	/* batch operation - map_lookup_delete */
+	if (outer_map_type == BPF_MAP_TYPE_HASH_OF_MAPS)
+		fetch_and_validate(outer_map_fd, inner_map_fds, &opts,
+				   max_entries, true /*delete*/);
+
+	free(inner_map_fds);
+	free(outer_map_keys);
+}
+
+void test_map_in_map_batch_ops_array(void)
+{
+	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY);
+	printf("%s:PASS with inner ARRAY map\n", __func__);
+	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH);
+	printf("%s:PASS with inner HASH map\n", __func__);
+}
+
+void test_map_in_map_batch_ops_hash(void)
+{
+	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY);
+	printf("%s:PASS with inner ARRAY map\n", __func__);
+	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH);
+	printf("%s:PASS with inner HASH map\n", __func__);
+}
-- 
2.30.2


^ permalink raw reply related

* [PATCH bpf-next v1 1/2] bpf: Extend batch operations for map-in-map bpf-maps
From: Takshak Chahande @ 2022-04-20  1:17 UTC (permalink / raw)
  To: netdev; +Cc: ast, kernel-team, ctakshak, ndixit, kafai, andriin

This patch extends batch operations support for map-in-map map-types:
BPF_MAP_TYPE_HASH_OF_MAPS and BPF_MAP_TYPE_ARRAY_OF_MAPS

A usecase where outer HASH map holds hundred of VIP entries and its
associated reuse-ports per VIP stored in REUSEPORT_SOCKARRAY type
inner map, needs to do batch operation for performance gain.

This patch leverages the exiting generic functions for most of the batch
operations. As map-in-map's value contains the actual reference of the inner map,
for BPF_MAP_TYPE_HASH_OF_MAPS type, it needed an extra step to fetch the
map_id from the reference value.

selftests are added in next patch 2/2

Signed-off-by: Takshak Chahande <ctakshak@fb.com>
---
 kernel/bpf/arraymap.c |  2 ++
 kernel/bpf/hashtab.c  | 12 ++++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7f145aefbff8..f0852b6617cc 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1344,6 +1344,8 @@ const struct bpf_map_ops array_of_maps_map_ops = {
 	.map_fd_put_ptr = bpf_map_fd_put_ptr,
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
 	.map_gen_lookup = array_of_map_gen_lookup,
+	.map_lookup_batch = generic_map_lookup_batch,
+	.map_update_batch = generic_map_update_batch,
 	.map_check_btf = map_check_no_btf,
 	.map_btf_name = "bpf_array",
 	.map_btf_id = &array_of_maps_map_btf_id,
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c68fbebc8c00..fd537bfba84c 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -139,7 +139,7 @@ static inline bool htab_use_raw_lock(const struct bpf_htab *htab)
 
 static void htab_init_buckets(struct bpf_htab *htab)
 {
-	unsigned i;
+	unsigned int i;
 
 	for (i = 0; i < htab->n_buckets; i++) {
 		INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
@@ -1594,7 +1594,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	void __user *uvalues = u64_to_user_ptr(attr->batch.values);
 	void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
 	void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
-	u32 batch, max_count, size, bucket_size;
+	u32 batch, max_count, size, bucket_size, map_id;
 	struct htab_elem *node_to_free = NULL;
 	u64 elem_map_flags, map_flags;
 	struct hlist_nulls_head *head;
@@ -1719,6 +1719,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 			}
 		} else {
 			value = l->key + roundup_key_size;
+			if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+				struct bpf_map **inner_map = value;
+				 /* Actual value is the id of the inner map */
+				map_id = map->ops->map_fd_sys_lookup_elem(*inner_map);
+				value = &map_id;
+			}
+
 			if (elem_map_flags & BPF_F_LOCK)
 				copy_map_value_locked(map, dst_val, value,
 						      true);
@@ -2425,6 +2432,7 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
 	.map_gen_lookup = htab_of_map_gen_lookup,
 	.map_check_btf = map_check_no_btf,
+	BATCH_OPS(htab),
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_of_maps_map_btf_id,
 };
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH net 2/2] virtio_net: check L3 protocol for VLAN packets
From: Hangbin Liu @ 2022-04-20  1:11 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Maxim Mikityanskiy, Mike Pattrick, Michael S . Tsirkin, netdev,
	Eric Dumazet, virtualization, Balazs Nemeth, Jakub Kicinski,
	Paolo Abeni, David S . Miller
In-Reply-To: <CA+FuTSdLGUgbkP3U+zmqoFzrewnUUN3pci8q8oNfHzo11ZhRZg@mail.gmail.com>

Hi Willem,

On Tue, Apr 19, 2022 at 09:52:46AM -0400, Willem de Bruijn wrote:
> Segmentation offload requires checksum offload. Packets that request

OK, makes sense.

> GSO but not NEEDS_CSUM are an aberration. We had to go out of our way
> to handle them because the original implementation did not explicitly
> flag and drop these. But we should not extend that to new types.

So do you mean, the current gso types are enough, we should not extend to
handle VLAN headers if no NEEDS_CSUM flag. This patch can be dropped, right?

Although I don't understand why we should not extend to support VLAN GSO.
I'm OK if you think this patch should be dropped when I re-post patch 1/2 to
net-next.

Thanks
Hangbin

^ permalink raw reply

* Re: [PATCH net 1/2] net/af_packet: adjust network header position for VLAN tagged packets
From: Hangbin Liu @ 2022-04-20  0:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Willem de Bruijn, Maxim Mikityanskiy, Mike Pattrick, netdev,
	Eric Dumazet, virtualization, Balazs Nemeth, Flavio Leitner,
	Jakub Kicinski, Paolo Abeni, David S . Miller, Jason Wang
In-Reply-To: <20220419101325-mutt-send-email-mst@kernel.org>

On Tue, Apr 19, 2022 at 10:26:09AM -0400, Michael S. Tsirkin wrote:
> > > There are also some duplicated codes in these *_snd functions.
> > > I think we can move them out to one single function.
> > 
> > Please don't refactor this code. It will complicate future backports
> > of stable fixes.
> 
> Hmm I don't know offhand which duplication this refers to specifically
> so maybe it's not worth addressing specifically but generally not
> cleaning up code just because of backports seems wrong ...

Yes, packet_snd() and tpacket_snd() share same addr/msg checking logic that
I think we can clean up.

> > > > stretching the definition of the flags to include VLAN is acceptable
> > > > (unlike outright tunnels), but even then I would suggest for net-next.
> > >
> > > As I asked, I'm not familiar with virtio code. Do you think if I should
> > > add a new VIRTIO_NET_HDR_GSO_VLAN flag? It's only a L2 flag without any L3
> > > info. If I add something like VIRTIO_NET_HDR_GSO_VLAN_TCPV4/TCPV6/UDP. That
> > > would add more combinations. Which doesn't like a good idea.
> > 
> > I would prefer a new flag to denote this type, so that we can be
> > strict and only change the datapath for packets that have this flag
> > set (and thus express the intent).
> > 
> > But the VIRTIO_NET_HDR types are defined in the virtio spec. The
> > maintainers should probably chime in.
> 
> Yes, it's a UAPI extension, not to be done lightly. In this case IIUC
> gso_type in the header is only u8 - 8 bits and 5 of these are already
> used.  So I don't think the virtio TC will be all that happy to burn up
> a bit unless a clear benefit can be demonstrated. 
> 
> I agree with the net-next proposal, I think it's more a feature than a
> bugfix. In particular I think a Fixes tag can also be dropped in that
> IIUC GSO for vlan packets didn't work even before that commit - right?

Right. virtio_net_hdr GSO with vlan doesn't work before.
I will post this to net-next.

Thanks
Hangbin

^ permalink raw reply

* Re: [PATCH 1/2] ath10k: search for default BDF name provided in DT
From: Abhishek Kumar @ 2022-04-20  0:43 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Doug Anderson, Rakesh Pillai, LKML, linux-wireless,
	David S. Miller, Jakub Kicinski, ath10k, netdev
In-Reply-To: <87fsmio9y8.fsf@kernel.org>

On Tue, Apr 12, 2022 at 3:47 AM Kalle Valo <kvalo@kernel.org> wrote:
>
> Abhishek Kumar <kuabhs@chromium.org> writes:
>
> > Hi All,
> >
> > Apologies for the late reply, too many things at the same time.
>
> Trust me, I know the feeling :)
>
> > Just a quick note, Qcomm provided a new BDF file with support for
> > 'bus=snoc,qmi-board-id=ff' as well, so even without this patch, the
> > non-programmed devices(with board-id=0xff) would work. However, for
> > optimization of the BDF search strategy, the above mentioned strategy
> > would still not work: - The stripping of full Board name would not
> > work if board-id itself is not programmed i.e. =0xff e.g. for
> > 'bus=snoc,qmi-board-id=ff,qmi-chip-id=320,variant=GO_LAZOR' => no
> > match 'bus=snoc,qmi-board-id=ff,qmi-chip-id=320' => no match
> > 'bus=snoc,qmi-board-id=ff' => no match 'bus=snoc' => no match because
> > all the BDFs contains board-id=67
>
> Sorry, not fully following your here. Are you saying that the problem is
> that WCN3990/hw1.0/board-2.bin is missing board file for 'bus=snoc'?
Ya, that is what I meant here, the board-2.bin file does not contain
an entry for 'bus=snoc' and so if board-id=oxff then still there
cannot be any BDF that matches. So adding BDF for 'bus=snoc' would
simplify the approach.
> That's easy to add, each board file within board-2.bin has multiple
> names so we can easily select one board file for which we add
> 'bus=snoc'.
>
> > So with this DT patch specifically for case 'bus=snoc,qmi-board-id=ff'
> > => no match, we fallback to default case ( the one provided in DT)
> > i.e. 'bus=snoc,qmi-board-id=67' => match . I do not see how exactly
> > the driver can know that it should check for a board-id of 67.
>
> Sorry, not following you here either. Why would the driver need to use
> board-id 67?
>
> > However, to still remove dependency on the DT, we can make the
> > board-id as no-op if it is not programmed i.e. if the board-id=ff then
> > we would pick any BDF that match 'bus=snoc,qmi-board-id=<XX>' where XX
> > can be any arbitrary number. Thoughts ?
>
> To me using just 'bus=snoc' is more logical than picking up an arbitrary
> number. But I might be missing something here.
The reason I mentioned that if the board-id=oxff then pick any
available BDF entry which matches
'bus=snoc,qmi-board-id=<XX>' is because:
- This will atleast let the wlan chip to boot.
- There is no BDF for 'bus=snoc' , so further stripping of boardname
will not find any match, but as you mentioned that BDF for 'bus=snoc'
can be added, then this will make the logic simpler and we don't have
to pick 'bus=snoc,qmi-board-id=<XX>' with any arbitrary board-id.
I will rollout a patch with this approach.
>
> --
> https://patchwork.kernel.org/project/linux-wireless/list/
>
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Thanks
Abhishek

^ permalink raw reply

* Re: [PATCH RFC 08/15] SUNRPC: Add RPC_TASK_CORK flag
From: Chuck Lever III @ 2022-04-20  0:34 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, netdev@vger.kernel.org,
	simo@redhat.com, ak@tempesta-tech.com, borisp@nvidia.com,
	linux-nfs@vger.kernel.org
In-Reply-To: <04ca809b6dc34bc300b5c39e69ad8dc79f8d8cfd.camel@hammerspace.com>


> On Apr 19, 2022, at 6:09 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Tue, 2022-04-19 at 19:40 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Apr 19, 2022, at 3:04 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>> 
>>> On Tue, 2022-04-19 at 18:16 +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Apr 18, 2022, at 10:57 PM, Trond Myklebust
>>>>> <trondmy@hammerspace.com> wrote:
>>>>> 
>>>>>> On Mon, 2022-04-18 at 12:52 -0400, Chuck Lever wrote:
>>>>>>> Introduce a mechanism to cause xprt_transmit() to break out
>>>>>>> of
>>>>>>> its
>>>>>>> sending loop at a specific rpc_rqst, rather than draining the
>>>>>>> whole
>>>>>>> transmit queue.
>>>>>>> 
>>>>>>> This enables the client to send just an RPC TLS probe and
>>>>>>> then
>>>>>>> wait
>>>>>>> for the response before proceeding with the rest of the
>>>>>>> queue.
>>>>>>> 
>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>>> ---
>>>>>>>  include/linux/sunrpc/sched.h  |    2 ++
>>>>>>>  include/trace/events/sunrpc.h |    1 +
>>>>>>>  net/sunrpc/xprt.c             |    2 ++
>>>>>>>  3 files changed, 5 insertions(+)
>>>>>>> 
>>>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>>>> b/include/linux/sunrpc/sched.h
>>>>>>> index 599133fb3c63..f8c09638fa69 100644
>>>>>>> --- a/include/linux/sunrpc/sched.h
>>>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>>>> @@ -125,6 +125,7 @@ struct rpc_task_setup {
>>>>>>>  #define RPC_TASK_TLSCRED               0x00000008      /*
>>>>>>> Use
>>>>>>> AUTH_TLS credential */
>>>>>>>  #define RPC_TASK_NULLCREDS             0x00000010      /*
>>>>>>> Use
>>>>>>> AUTH_NULL credential */
>>>>>>>  #define RPC_CALL_MAJORSEEN             0x00000020      /*
>>>>>>> major
>>>>>>> timeout seen */
>>>>>>> +#define RPC_TASK_CORK                  0x00000040      /*
>>>>>>> cork
>>>>>>> the
>>>>>>> xmit queue */
>>>>>>>  #define RPC_TASK_DYNAMIC               0x00000080      /*
>>>>>>> task
>>>>>>> was
>>>>>>> kmalloc'ed */
>>>>>>>  #define        RPC_TASK_NO_ROUND_ROBIN        
>>>>>>> 0x00000100     
>>>>>>> /*
>>>>>>> send requests on "main" xprt */
>>>>>>>  #define RPC_TASK_SOFT                  0x00000200      /*
>>>>>>> Use
>>>>>>> soft
>>>>>>> timeouts */
>>>>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>>>>>  
>>>>>>>  #define RPC_IS_ASYNC(t)                ((t)->tk_flags &
>>>>>>> RPC_TASK_ASYNC)
>>>>>>>  #define RPC_IS_SWAPPER(t)      ((t)->tk_flags &
>>>>>>> RPC_TASK_SWAPPER)
>>>>>>> +#define RPC_IS_CORK(t)         ((t)->tk_flags &
>>>>>>> RPC_TASK_CORK)
>>>>>>>  #define RPC_IS_SOFT(t)         ((t)->tk_flags &
>>>>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>>>>>  #define RPC_IS_SOFTCONN(t)     ((t)->tk_flags &
>>>>>>> RPC_TASK_SOFTCONN)
>>>>>>>  #define RPC_WAS_SENT(t)                ((t)->tk_flags &
>>>>>>> RPC_TASK_SENT)
>>>>>>> diff --git a/include/trace/events/sunrpc.h
>>>>>>> b/include/trace/events/sunrpc.h
>>>>>>> index 811187c47ebb..e8d6adff1a50 100644
>>>>>>> --- a/include/trace/events/sunrpc.h
>>>>>>> +++ b/include/trace/events/sunrpc.h
>>>>>>> @@ -312,6 +312,7 @@ TRACE_EVENT(rpc_request,
>>>>>>>                 { RPC_TASK_TLSCRED, "TLSCRED"
>>>>>>> },                        \
>>>>>>>                 { RPC_TASK_NULLCREDS, "NULLCREDS"
>>>>>>> },                    \
>>>>>>>                 { RPC_CALL_MAJORSEEN, "MAJORSEEN"
>>>>>>> },                    \
>>>>>>> +               { RPC_TASK_CORK, "CORK"
>>>>>>> },                              \
>>>>>>>                 { RPC_TASK_DYNAMIC, "DYNAMIC"
>>>>>>> },                        \
>>>>>>>                 { RPC_TASK_NO_ROUND_ROBIN, "NO_ROUND_ROBIN"
>>>>>>> },          \
>>>>>>>                 { RPC_TASK_SOFT, "SOFT"
>>>>>>> },                              \
>>>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>>>> index 86d62cffba0d..4b303b945b51 100644
>>>>>>> --- a/net/sunrpc/xprt.c
>>>>>>> +++ b/net/sunrpc/xprt.c
>>>>>>> @@ -1622,6 +1622,8 @@ xprt_transmit(struct rpc_task *task)
>>>>>>>                 if (xprt_request_data_received(task) &&
>>>>>>>                     !test_bit(RPC_TASK_NEED_XMIT, &task-
>>>>>>>> tk_runstate))
>>>>>>>                         break;
>>>>>>> +               if (RPC_IS_CORK(task))
>>>>>>> +                       break;
>>>>>>>                 cond_resched_lock(&xprt->queue_lock);
>>>>>>>         }
>>>>>>>         spin_unlock(&xprt->queue_lock);
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> This is entirely the wrong place for this kind of control
>>>>>> mechanism.
>>>>> 
>>>>> I'm not sure I entirely understand your concern, so bear with
>>>>> me while I try to clarify.
>>>>> 
>>>>> 
>>>>>> TLS vs not-TLS needs to be decided up front when we initialise
>>>>>> the
>>>>>> transport (i.e. at mount time or whenever the pNFS channels are
>>>>>> set
>>>>>> up). Otherwise, we're vulnerable to downgrade attacks.
>>>>> 
>>>>> Downgrade attacks are prevented by using "xprtsec=tls" because
>>>>> in that case, transport creation fails if either the AUTH_TLS
>>>>> fails or the handshake fails.
>>>>> 
>>>>> The TCP connection has to be established first, though. Then the
>>>>> client can send the RPC_AUTH_TLS probe, which is the same as the
>>>>> NULL ping that it already sends. That mechanism is independent
>>>>> of the lower layer transport (TCP in this case).
>>>>> 
>>>>> Therefore, RPC traffic must be stoppered while the client:
>>>>> 
>>>>> 1. waits for the AUTH_TLS probe's reply, and
>>>>> 
>>>>> 2. waits for the handshake to complete
>>>>> 
>>>>> Because an RPC message is involved in this interaction, I didn't
>>>>> see a way to implement it completely within xprtsock's TCP
>>>>> connection logic. IMO, driving the handshake has to be done by
>>>>> the generic RPC client.
>>>>> 
>>>>> So, do you mean that I need to replace RPC_TASK_CORK with a
>>>>> special return code from xs_tcp_send_request() ?
>>> 
>>> 
>>> I mean the right mechanism for controlling whether or not the
>>> transport
>>> is ready to serve RPC requests is through the XPRT_CONNECTED flag.
>>> All
>>> the existing generic RPC error handling, congestion handling, etc
>>> depends on that flag being set correctly.
>>> 
>>> Until the TLS socket has completed its handshake protocol and is
>>> ready
>>> to transmit data, it should not be declared connected. The
>>> distinction
>>> between the two states 'TCP is unconnected' and 'TLS handshake is
>>> incomplete' is a socket/transport setup detail as far as the RPC
>>> xprt
>>> layer is concerned: just another set of intermediate states between
>>> SYN_SENT and ESTABLISHED.
>> 
>> First, TLS is technically an upper layer protocol. It's not
>> part of the transport protocol. This is exactly how it's
>> implemented in the Linux kernel. And, TLS works on transports
>> other than TCP, so that makes it a reasonable candidate for
>> treatment in the generic client rather than in a particular
>> transport mechanism.
> 
> Sorry, but no! As far as the RPC layer is concerned, there is no
> difference between a TLS socket and a TCP socket. The xprt layer should
> not have to know or care about the existence of TLS other that as a
> transport option to be configured at connection time.
> 
>> 
>> Second, the "intermediate states" would be /outside/ of SYN_SENT
>> and ESTABLISHED. A TCP transport has to be in the ESTABLISHED
>> state (ie, the transport's connection handshake has to be
>> complete) before any TLS traffic can go over it.
>> 
> 
> My point is we don't give a damn about the intermediate states in the
> RPC layer.
> 
>> Most importantly, the client has to send an RPC message first
>> before it can start a TLS handshake. The RPC-with-TLS protocol
>> specification requires that the handshake be preceded with the
>> NULL AUTH_TLS request, which is an RPC. Otherwise, there's no
>> way for the server end to know when to expect a handshake.
>> 
> 
> Sure, but those are 2 non-overlapping states. The socket is first in a
> state where it needs to do a NULL ping using regular RPC/TCP. Then it
> needs to do the TLS handshake. Then it transitions into the state where
> it can act like any other transport.

Understood: this architecture more-or-less mimics what the RPC client would see for a transport like QUIC where connection establishment and the security handshake are integrated and handled concurrently.

The reality is that the RPC client’s transport layer will have to deal with these steps separately for TLS-on-UDP and TLS-on-TCP, but hiding the details under the transport switch is fine with me as long as there is a way to send the AUTH_TLS probe before the transport is marked “connected” (see below).


>> In today's RPC client, the underlying connection has to be in
>> the XPRT_CONNECTED state before the RPC client can exchange any
>> RPC transaction, including AUTH_TLS NULL.
>> 
>> To make it work the way you've suggested, we would have to build
>> a mechanism that could send the AUTH_TLS NULL and receive and
>> parse its reply /before/ the client has put the transport into
>> the XPRT_CONNECTED state, and that NULL request would have to
>> be driven from inside the transport instance (not via the FSM
>> where all other RPC traffic originates).
>> 
>> Do you have any suggestions about how to make this last point
>> less painful?
> 
> This isn't too different from what we already do with the rpcbind call
> for performing port discovery. The only difference is that the NULL
> ping needs to happen on the same transport as the one being constructed
> and that it needs to happen after the TCP connection is complete.
> 
> So I'd suggest that TLS/TCP needs to be a different xprt_class than the
> base TCP, then doing the whole "do-NULL-ping-and-TLS-handshake" in the
> connect() callback for that new class.
> 
> The connect() callback can set up a private rpc client and do the NULL
> call asynchronously just like we do in rpcb_getport_async().

“Set up a separate rpc_client to handle the AUTH_TLS probe” was the piece I was missing. The rest of this is pretty much what the RPC changes in this patch series already do, but organized a little differently (for example, they use a “done” callback as you describe below, so we’re already about halfway there).

The idea is to stopper the stream of RPC messages by leaving the xprt marked “not connected” instead of by adding and setting an RPC_TASK_CORK flag. The FSM will no longer be involved at all in dealing with TLS.


> When the
> RPC call completes, we steal the resulting socket from that private rpc
> client and kick off the TLS handshake on it. All that can be done in
> the rpc_call_done callback (i.e. the equivalent of rpcb_getport_done).
> 
> Once the TLS handshake is done, you can set the XPRT_CONNECTED state
> and call xprt_wake_pending_tasks().

^ permalink raw reply

* [PATCH v5] igb: Convert kmap() to kmap_local_page()
From: Alaa Mohamed @ 2022-04-19 23:43 UTC (permalink / raw)
  To: outreachy
  Cc: jesse.brandeburg, anthony.l.nguyen, davem, kuba, pabeni,
	intel-wired-lan, netdev, linux-kernel, ira.weiny,
	eng.alaamohamedsoliman.am

kmap() is being deprecated and these usages are all local to the thread
so there is no reason kmap_local_page() can't be used.

Replace kmap() calls with kmap_local_page().

Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
---
changes in V2:
	fix kunmap_local path value to take address of the mapped page.
---
changes in V3:
	edit commit message to be clearer
---
changes in V4:
	edit the commit message
---
changes in V5:
	-edit commit subject
	-edit commit message
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 2a5782063f4c..c14fc871dd41 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -1798,14 +1798,14 @@ static int igb_check_lbtest_frame(struct igb_rx_buffer *rx_buffer,
 
 	frame_size >>= 1;
 
-	data = kmap(rx_buffer->page);
+	data = kmap_local_page(rx_buffer->page);
 
 	if (data[3] != 0xFF ||
 	    data[frame_size + 10] != 0xBE ||
 	    data[frame_size + 12] != 0xAF)
 		match = false;
 
-	kunmap(rx_buffer->page);
+	kunmap_local(data);
 
 	return match;
 }
-- 
2.35.2


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox