Netdev List
 help / color / mirror / Atom feed
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-27 10:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180623003545-mutt-send-email-mst@kernel.org>

On Sat, 23 Jun 2018 00:43:24 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> > Would it be more helpful to focus on generic
> > migration support for vfio instead of going about it device by device?  
> 
> Just to note this approach is actually device by device *type*.  It's
> mostly device agnostic for a given device type so you can migrate
> between hosts with very different hardware.

This enables heterogeneous environments, yes.

But one drawback of that is that you cannot exploit any hardware
specialities - it seems you're limited to what the paravirtual device
supports. This is limiting for more homogeneous environments.

> 
> And support for more PV device types has other advantages
> such as security and forward compatibility to future hosts.

But again the drawback is that we can't exploit new capabilities
easily, can we?

> 
> Finally, it all can happen mostly within QEMU. User is currently
> required to enable it but it's pretty lightweight.
> 
> OTOH vfio migration generally requires actual device-specific work, and
> only works when hosts are mostly identical. When they aren't it's easy
> to blame the user, but tools for checking host compatiblity are
> currently non-existent. Upper layer management will also have to learn
> about host and device compatibility wrt migration. At the moment they
> can't even figure it out wrt software versions of vhost in kernel and
> dpdk so I won't hold my breath for all of this happening quickly.

Yes, that's a real problem.

I think one issue here is that we want to support really different
environments. For the case here, we have a lot of different networking
adapters, but the guests are basically interested in one thing: doing
network traffic. On the other hand, I'm thinking of the mainframe
environment, where we have a very limited set of devices to support,
but at the same time want to exploit their specialities, so the pv
approach is limiting. For that use case, generic migration looks more
useful.

^ permalink raw reply

* Re: s390x BPF JIT failures with test_bpf
From: Kleber Souza @ 2018-06-27 10:13 UTC (permalink / raw)
  To: Daniel Borkmann, linux-s390, netdev; +Cc: Alexei Starovoitov
In-Reply-To: <33983d2e-905b-e05b-67e3-11eb9bc6f030@iogearbox.net>

On 06/27/18 12:01, Daniel Borkmann wrote:
> Hi Kleber,
> 
> On 06/27/2018 11:40 AM, Kleber Souza wrote:
> [...]
>> When I load the test_bpf module from mainline (v4.18-rc2) with
>> CONFIG_BPF_JIT_ALWAYS_ON=y on a s390x system I get the following errors:
>>
>> test_bpf: #289 BPF_MAXINSNS: Ctx heavy transformations FAIL to
>> prog_create err=-524 len=4096
>> test_bpf: #290 BPF_MAXINSNS: Call heavy transformations FAIL to
>> prog_create err=-524 len=4096
>> [...]
>> test_bpf: #296 BPF_MAXINSNS: exec all MSH FAIL to prog_create err=-524
>> len=4096
>> test_bpf: #297 BPF_MAXINSNS: ld_abs+get_processor_id FAIL to prog_create
>> err=-524 len=4096
>>
>> From a quick look at the code it seems that
>> arch/s390/net/bpf_jit_comp.c:bpf_int_jit_compile() is failing to JIT
>> compile the test code.
>>
>> Are those failures expected and could be flagged with FLAG_EXPECTED_FAIL
>> on lib/test_bpf.c or are those caused by some issue with the s390x JIT
>> compiler that needs to be fixed?
> 
> JIT doesn't guarantee in general to map really all programs to native insns,
> so some, mostly crafted corner cases could fail. E.g. x86-64 JIT doesn't converge
> on some programs in test_bpf.c and thus falls back to interpreter or simply
> rejects the program in case of CONFIG_BPF_JIT_ALWAYS_ON=y. Above would seem
> likely that it's hitting the BPF_SIZE_MAX that s390 would do. I think it might
> make sense to either have the FLAG_EXPECTED_FAIL in lib/test_bpf.c more fine
> grained as a flag per arch, so we could say it's expected to fail on e.g. s390
> but not on x86 and the like, or just denote it as 'could potentially fail but
> doesn't have to be the case everywhere'.

Hi Daniel,

Thank you for your reply. I will run some more tests to make sure we are
hitting BPF_SIZE_MAX or what exactly is failing and send a patch to flag
it conditionally for s390x.


Thanks,
Kleber

^ permalink raw reply

* [PATCH v2 net-next 0/3] rds: IPv6 support
From: Ka-Cheong Poon @ 2018-06-27 10:23 UTC (permalink / raw)
  To: netdev; +Cc: santosh.shilimkar, davem, rds-devel

This patch set adds IPv6 support to the kernel RDS and related
modules.  Existing RDS apps using IPv4 address continue to run without
any problem.  New RDS apps which want to use IPv6 address can do so by
passing the address in struct sockaddr_in6 to bind(), connect() or
sendmsg().  And those apps also need to use the new IPv6 equivalents
of some of the existing socket options as the existing options use a
32 bit integer to store IP address.

All RDS code now use struct in6_addr to store IP address.  IPv4
address is stored as an IPv4 mapped address.

Header file changes

There are many data structures (RDS socket options) used by RDS apps
which use a 32 bit integer to store IP address. To support IPv6,
struct in6_addr needs to be used. To ensure backward compatibility, a
new data structure is introduced for each of those data structures
which use a 32 bit integer to represent an IP address. And new socket
options are introduced to use those new structures. This means that
existing apps should work without a problem with the new RDS module.
For apps which want to use IPv6, those new data structures and socket
options can be used. IPv4 mapped address is used to represent IPv4
address in the new data structures.

Internally, all RDS data structures which contain an IP address are
changed to use struct in6_addr to store the address. IPv4 address is
stored as an IPv4 mapped address. All the functions which take an IP
address as argument are also changed to use struct in6_addr.

RDS/RDMA/IB uses a private data (struct rds_ib_connect_private)
exchange between endpoints at RDS connection establishment time to
support RDMA. This private data exchange uses a 32 bit integer to
represent an IP address. This needs to be changed in order to support
IPv6. A new private data struct rds6_ib_connect_private is introduced
to handle this. To ensure backward compatibility, an IPv6 capable RDS
stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6
connection. And it continues to use the original RDS_PORT for IPv4 RDS
connections. When it needs to communicate with an IPv6 peer, it uses
the RDS_TCP_PORT to send the connection set up request.

RDS/TCP changes

TCP related code is changed to support IPv6.  Note that only an IPv6
TCP listener on port RDS_TCP_PORT is created as it can accept both
IPv4 and IPv6 connection requests.

IB/RDMA changes

The initial private data exchange between IB endpoints using RDMA is
changed to support IPv6 address instead, if the peer address is IPv6.
To ensure backward compatibility, annother RDMA listener port
(RDS_CM_PORT) is used to accept IPv6 connection. An IPv6 capable RDS
module continues to use the original RDS_PORT for IPv4 RDS
connections. When it needs to communicate with an IPv6 peer, it uses
the RDS_CM_PORT to send the connection set up request.

Ka-Cheong Poon (3):
  rds: Changing IP address internal representation to struct in6_addr
  rds: Enable RDS IPv6 support
  rds: Extend RDS API for IPv6 support

 include/uapi/linux/rds.h |  71 ++++++++++-
 net/rds/af_rds.c         | 164 ++++++++++++++++++++------
 net/rds/bind.c           | 119 ++++++++++++++-----
 net/rds/cong.c           |  23 ++--
 net/rds/connection.c     | 249 +++++++++++++++++++++++++++++----------
 net/rds/ib.c             | 114 ++++++++++++++++--
 net/rds/ib.h             |  45 +++++--
 net/rds/ib_cm.c          | 301 +++++++++++++++++++++++++++++++++++------------
 net/rds/ib_mr.h          |   2 +
 net/rds/ib_rdma.c        |  24 ++--
 net/rds/ib_recv.c        |  18 +--
 net/rds/ib_send.c        |  10 +-
 net/rds/loop.c           |   7 +-
 net/rds/rdma.c           |   6 +-
 net/rds/rdma_transport.c |  84 ++++++++++---
 net/rds/rdma_transport.h |   2 +
 net/rds/rds.h            |  85 ++++++++-----
 net/rds/recv.c           |  76 +++++++++---
 net/rds/send.c           |  95 ++++++++++++---
 net/rds/tcp.c            | 128 ++++++++++++++++----
 net/rds/tcp.h            |   4 +-
 net/rds/tcp_connect.c    |  68 ++++++++---
 net/rds/tcp_listen.c     |  58 ++++++---
 net/rds/tcp_recv.c       |   9 +-
 net/rds/tcp_send.c       |   4 +-
 net/rds/threads.c        |  69 +++++++++--
 net/rds/transport.c      |  15 ++-
 27 files changed, 1426 insertions(+), 424 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH v2 net-next 3/3] rds: Extend RDS API for IPv6 support
From: Ka-Cheong Poon @ 2018-06-27 10:23 UTC (permalink / raw)
  To: netdev; +Cc: santosh.shilimkar, davem, rds-devel
In-Reply-To: <cover.1530086216.git.ka-cheong.poon@oracle.com>

There are many data structures (RDS socket options) used by RDS apps
which use a 32 bit integer to store IP address. To support IPv6,
struct in6_addr needs to be used. To ensure backward compatibility, a
new data structure is introduced for each of those data structures
which use a 32 bit integer to represent an IP address. And new socket
options are introduced to use those new structures. This means that
existing apps should work without a problem with the new RDS module.
For apps which want to use IPv6, those new data structures and socket
options can be used. IPv4 mapped address is used to represent IPv4
address in the new data structures.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
---
 include/uapi/linux/rds.h |  71 +++++++++++++++++++++++++++++++--
 net/rds/connection.c     | 100 +++++++++++++++++++++++++++++++++++++++++++----
 net/rds/ib.c             |  52 ++++++++++++++++++++++++
 net/rds/ib_mr.h          |   2 +
 net/rds/ib_rdma.c        |  11 +++++-
 net/rds/rds.h            |   4 ++
 net/rds/recv.c           |  25 ++++++++++++
 net/rds/tcp.c            |  44 +++++++++++++++++++++
 8 files changed, 298 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 20c6bd0..518d40f 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
 /*
- * Copyright (c) 2008 Oracle.  All rights reserved.
+ * Copyright (c) 2008, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -52,7 +52,7 @@
 #define RDS_RECVERR			5
 #define RDS_CONG_MONITOR		6
 #define RDS_GET_MR_FOR_DEST		7
-#define SO_RDS_TRANSPORT		8
+#define SO_RDS_TRANSPORT		9
 
 /* Socket option to tap receive path latency
  *	SO_RDS: SO_RDS_MSG_RXPATH_LATENCY
@@ -118,7 +118,17 @@
 #define RDS_INFO_IB_CONNECTIONS		10008
 #define RDS_INFO_CONNECTION_STATS	10009
 #define RDS_INFO_IWARP_CONNECTIONS	10010
-#define RDS_INFO_LAST			10010
+
+/* PF_RDS6 options */
+#define RDS6_INFO_CONNECTIONS		10011
+#define RDS6_INFO_SEND_MESSAGES		10012
+#define RDS6_INFO_RETRANS_MESSAGES	10013
+#define RDS6_INFO_RECV_MESSAGES		10014
+#define RDS6_INFO_SOCKETS		10015
+#define RDS6_INFO_TCP_SOCKETS		10016
+#define RDS6_INFO_IB_CONNECTIONS	10017
+
+#define RDS_INFO_LAST			10017
 
 struct rds_info_counter {
 	__u8	name[32];
@@ -140,6 +150,15 @@ struct rds_info_connection {
 	__u8		flags;
 } __attribute__((packed));
 
+struct rds6_info_connection {
+	__u64		next_tx_seq;
+	__u64		next_rx_seq;
+	struct in6_addr	laddr;
+	struct in6_addr	faddr;
+	__u8		transport[TRANSNAMSIZ];		/* null term ascii */
+	__u8		flags;
+} __attribute__((packed));
+
 #define RDS_INFO_MESSAGE_FLAG_ACK               0x01
 #define RDS_INFO_MESSAGE_FLAG_FAST_ACK          0x02
 
@@ -153,6 +172,17 @@ struct rds_info_message {
 	__u8		flags;
 } __attribute__((packed));
 
+struct rds6_info_message {
+	__u64	seq;
+	__u32	len;
+	struct in6_addr	laddr;
+	struct in6_addr	faddr;
+	__be16		lport;
+	__be16		fport;
+	__u8		flags;
+	__u8		tos;
+} __attribute__((packed));
+
 struct rds_info_socket {
 	__u32		sndbuf;
 	__be32		bound_addr;
@@ -163,6 +193,16 @@ struct rds_info_socket {
 	__u64		inum;
 } __attribute__((packed));
 
+struct rds6_info_socket {
+	__u32		sndbuf;
+	struct in6_addr	bound_addr;
+	struct in6_addr	connected_addr;
+	__be16		bound_port;
+	__be16		connected_port;
+	__u32		rcvbuf;
+	__u64		inum;
+} __attribute__((packed));
+
 struct rds_info_tcp_socket {
 	__be32          local_addr;
 	__be16          local_port;
@@ -175,6 +215,18 @@ struct rds_info_tcp_socket {
 	__u32           last_seen_una;
 } __attribute__((packed));
 
+struct rds6_info_tcp_socket {
+	struct in6_addr	local_addr;
+	__be16		local_port;
+	struct in6_addr	peer_addr;
+	__be16		peer_port;
+	__u64		hdr_rem;
+	__u64		data_rem;
+	__u32		last_sent_nxt;
+	__u32		last_expected_una;
+	__u32		last_seen_una;
+} __attribute__((packed));
+
 #define RDS_IB_GID_LEN	16
 struct rds_info_rdma_connection {
 	__be32		src_addr;
@@ -189,6 +241,19 @@ struct rds_info_rdma_connection {
 	__u32		rdma_mr_size;
 };
 
+struct rds6_info_rdma_connection {
+	struct in6_addr	src_addr;
+	struct in6_addr	dst_addr;
+	__u8		src_gid[RDS_IB_GID_LEN];
+	__u8		dst_gid[RDS_IB_GID_LEN];
+
+	__u32		max_send_wr;
+	__u32		max_recv_wr;
+	__u32		max_send_sge;
+	__u32		rdma_mr_max;
+	__u32		rdma_mr_size;
+};
+
 /* RDS message Receive Path Latency points */
 enum rds_message_rxpath_latency {
 	RDS_MSG_RX_HDR_TO_DGRAM_START = 0,
diff --git a/net/rds/connection.c b/net/rds/connection.c
index 8c5d093..7151527 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -488,15 +488,19 @@ void rds_conn_destroy(struct rds_connection *conn)
 
 static void __rds_inc_msg_cp(struct rds_incoming *inc,
 			     struct rds_info_iterator *iter,
-			     void *saddr, void *daddr, int flip)
+			     void *saddr, void *daddr, int flip, bool isv6)
 {
-	rds_inc_info_copy(inc, iter, *(__be32 *)saddr, *(__be32 *)daddr, flip);
+	if (isv6)
+		rds6_inc_info_copy(inc, iter, saddr, daddr, flip);
+	else
+		rds_inc_info_copy(inc, iter, *(__be32 *)saddr,
+				  *(__be32 *)daddr, flip);
 }
 
 static void rds_conn_message_info_cmn(struct socket *sock, unsigned int len,
 				      struct rds_info_iterator *iter,
 				      struct rds_info_lengths *lens,
-				      int want_send)
+				      int want_send, bool isv6)
 {
 	struct hlist_head *head;
 	struct list_head *list;
@@ -507,7 +511,10 @@ static void rds_conn_message_info_cmn(struct socket *sock, unsigned int len,
 	size_t i;
 	int j;
 
-	len /= sizeof(struct rds_info_message);
+	if (isv6)
+		len /= sizeof(struct rds6_info_message);
+	else
+		len /= sizeof(struct rds_info_message);
 
 	rcu_read_lock();
 
@@ -517,6 +524,9 @@ static void rds_conn_message_info_cmn(struct socket *sock, unsigned int len,
 			struct rds_conn_path *cp;
 			int npaths;
 
+			if (!isv6 && conn->c_isv6)
+				continue;
+
 			npaths = (conn->c_trans->t_mp_capable ?
 				 RDS_MPATH_WORKERS : 1);
 
@@ -537,7 +547,7 @@ static void rds_conn_message_info_cmn(struct socket *sock, unsigned int len,
 								 iter,
 								 &conn->c_laddr,
 								 &conn->c_faddr,
-								 0);
+								 0, isv6);
 				}
 
 				spin_unlock_irqrestore(&cp->cp_lock, flags);
@@ -547,7 +557,10 @@ static void rds_conn_message_info_cmn(struct socket *sock, unsigned int len,
 	rcu_read_unlock();
 
 	lens->nr = total;
-	lens->each = sizeof(struct rds_info_message);
+	if (isv6)
+		lens->each = sizeof(struct rds6_info_message);
+	else
+		lens->each = sizeof(struct rds_info_message);
 }
 
 static void rds_conn_message_info(struct socket *sock, unsigned int len,
@@ -555,7 +568,15 @@ static void rds_conn_message_info(struct socket *sock, unsigned int len,
 				  struct rds_info_lengths *lens,
 				  int want_send)
 {
-	rds_conn_message_info_cmn(sock, len, iter, lens, want_send);
+	rds_conn_message_info_cmn(sock, len, iter, lens, want_send, false);
+}
+
+static void rds6_conn_message_info(struct socket *sock, unsigned int len,
+				   struct rds_info_iterator *iter,
+				   struct rds_info_lengths *lens,
+				   int want_send)
+{
+	rds_conn_message_info_cmn(sock, len, iter, lens, want_send, true);
 }
 
 static void rds_conn_message_info_send(struct socket *sock, unsigned int len,
@@ -565,6 +586,13 @@ static void rds_conn_message_info_send(struct socket *sock, unsigned int len,
 	rds_conn_message_info(sock, len, iter, lens, 1);
 }
 
+static void rds6_conn_message_info_send(struct socket *sock, unsigned int len,
+					struct rds_info_iterator *iter,
+					struct rds_info_lengths *lens)
+{
+	rds6_conn_message_info(sock, len, iter, lens, 1);
+}
+
 static void rds_conn_message_info_retrans(struct socket *sock,
 					  unsigned int len,
 					  struct rds_info_iterator *iter,
@@ -573,6 +601,14 @@ static void rds_conn_message_info_retrans(struct socket *sock,
 	rds_conn_message_info(sock, len, iter, lens, 0);
 }
 
+static void rds6_conn_message_info_retrans(struct socket *sock,
+					   unsigned int len,
+					   struct rds_info_iterator *iter,
+					   struct rds_info_lengths *lens)
+{
+	rds6_conn_message_info(sock, len, iter, lens, 0);
+}
+
 void rds_for_each_conn_info(struct socket *sock, unsigned int len,
 			  struct rds_info_iterator *iter,
 			  struct rds_info_lengths *lens,
@@ -688,6 +724,34 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
 	return 1;
 }
 
+static int rds6_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
+{
+	struct rds6_info_connection *cinfo6 = buffer;
+	struct rds_connection *conn = cp->cp_conn;
+
+	cinfo6->next_tx_seq = cp->cp_next_tx_seq;
+	cinfo6->next_rx_seq = cp->cp_next_rx_seq;
+	cinfo6->laddr = conn->c_laddr;
+	cinfo6->faddr = conn->c_faddr;
+	strncpy(cinfo6->transport, conn->c_trans->t_name,
+		sizeof(cinfo6->transport));
+	cinfo6->flags = 0;
+
+	rds_conn_info_set(cinfo6->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags),
+			  SENDING);
+	/* XXX Future: return the state rather than these funky bits */
+	rds_conn_info_set(cinfo6->flags,
+			  atomic_read(&cp->cp_state) == RDS_CONN_CONNECTING,
+			  CONNECTING);
+	rds_conn_info_set(cinfo6->flags,
+			  atomic_read(&cp->cp_state) == RDS_CONN_UP,
+			  CONNECTED);
+	/* Just return 1 as there is no error case. This is a helper function
+	 * for rds_walk_conn_path_info() and it wants a return value.
+	 */
+	return 1;
+}
+
 static void rds_conn_info(struct socket *sock, unsigned int len,
 			  struct rds_info_iterator *iter,
 			  struct rds_info_lengths *lens)
@@ -700,6 +764,18 @@ static void rds_conn_info(struct socket *sock, unsigned int len,
 				sizeof(struct rds_info_connection));
 }
 
+static void rds6_conn_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	u64 buffer[(sizeof(struct rds6_info_connection) + 7) / 8];
+
+	rds_walk_conn_path_info(sock, len, iter, lens,
+				rds6_conn_info_visitor,
+				buffer,
+				sizeof(struct rds6_info_connection));
+}
+
 int rds_conn_init(void)
 {
 	rds_conn_slab = kmem_cache_create("rds_connection",
@@ -713,6 +789,11 @@ int rds_conn_init(void)
 			       rds_conn_message_info_send);
 	rds_info_register_func(RDS_INFO_RETRANS_MESSAGES,
 			       rds_conn_message_info_retrans);
+	rds_info_register_func(RDS6_INFO_CONNECTIONS, rds6_conn_info);
+	rds_info_register_func(RDS6_INFO_SEND_MESSAGES,
+			       rds6_conn_message_info_send);
+	rds_info_register_func(RDS6_INFO_RETRANS_MESSAGES,
+			       rds6_conn_message_info_retrans);
 
 	return 0;
 }
@@ -730,6 +811,11 @@ void rds_conn_exit(void)
 				 rds_conn_message_info_send);
 	rds_info_deregister_func(RDS_INFO_RETRANS_MESSAGES,
 				 rds_conn_message_info_retrans);
+	rds_info_deregister_func(RDS6_INFO_CONNECTIONS, rds6_conn_info);
+	rds_info_deregister_func(RDS6_INFO_SEND_MESSAGES,
+				 rds6_conn_message_info_send);
+	rds_info_deregister_func(RDS6_INFO_RETRANS_MESSAGES,
+				 rds6_conn_message_info_retrans);
 }
 
 /*
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 756225c..63d95ea 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -321,6 +321,43 @@ static int rds_ib_conn_info_visitor(struct rds_connection *conn,
 	return 1;
 }
 
+/* IPv6 version of rds_ib_conn_info_visitor(). */
+static int rds6_ib_conn_info_visitor(struct rds_connection *conn,
+				     void *buffer)
+{
+	struct rds6_info_rdma_connection *iinfo6 = buffer;
+	struct rds_ib_connection *ic;
+
+	/* We will only ever look at IB transports */
+	if (conn->c_trans != &rds_ib_transport)
+		return 0;
+
+	iinfo6->src_addr = conn->c_laddr;
+	iinfo6->dst_addr = conn->c_faddr;
+
+	memset(&iinfo6->src_gid, 0, sizeof(iinfo6->src_gid));
+	memset(&iinfo6->dst_gid, 0, sizeof(iinfo6->dst_gid));
+
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		struct rds_ib_device *rds_ibdev;
+		struct rdma_dev_addr *dev_addr;
+
+		ic = conn->c_transport_data;
+		dev_addr = &ic->i_cm_id->route.addr.dev_addr;
+		rdma_addr_get_sgid(dev_addr,
+				   (union ib_gid *)&iinfo6->src_gid);
+		rdma_addr_get_dgid(dev_addr,
+				   (union ib_gid *)&iinfo6->dst_gid);
+
+		rds_ibdev = ic->rds_ibdev;
+		iinfo6->max_send_wr = ic->i_send_ring.w_nr;
+		iinfo6->max_recv_wr = ic->i_recv_ring.w_nr;
+		iinfo6->max_send_sge = rds_ibdev->max_sge;
+		rds6_ib_get_mr_info(rds_ibdev, iinfo6);
+	}
+	return 1;
+}
+
 static void rds_ib_ic_info(struct socket *sock, unsigned int len,
 			   struct rds_info_iterator *iter,
 			   struct rds_info_lengths *lens)
@@ -333,6 +370,19 @@ static void rds_ib_ic_info(struct socket *sock, unsigned int len,
 				sizeof(struct rds_info_rdma_connection));
 }
 
+/* IPv6 version of rds_ib_ic_info(). */
+static void rds6_ib_ic_info(struct socket *sock, unsigned int len,
+			    struct rds_info_iterator *iter,
+			    struct rds_info_lengths *lens)
+{
+	u64 buffer[(sizeof(struct rds6_info_rdma_connection) + 7) / 8];
+
+	rds_for_each_conn_info(sock, len, iter, lens,
+			       rds6_ib_conn_info_visitor,
+			       buffer,
+			       sizeof(struct rds6_info_rdma_connection));
+}
+
 /*
  * Early RDS/IB was built to only bind to an address if there is an IPoIB
  * device with that address set.
@@ -441,6 +491,7 @@ void rds_ib_exit(void)
 	rds_ib_set_unloading();
 	synchronize_rcu();
 	rds_info_deregister_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info);
+	rds_info_deregister_func(RDS6_INFO_IB_CONNECTIONS, rds6_ib_ic_info);
 	rds_ib_unregister_client();
 	rds_ib_destroy_nodev_conns();
 	rds_ib_sysctl_exit();
@@ -502,6 +553,7 @@ int rds_ib_init(void)
 	rds_trans_register(&rds_ib_transport);
 
 	rds_info_register_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info);
+	rds_info_register_func(RDS6_INFO_IB_CONNECTIONS, rds6_ib_ic_info);
 
 	goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 0ea4ab0..f440ace 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -113,6 +113,8 @@ struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 					     int npages);
 void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev,
 			struct rds_info_rdma_connection *iinfo);
+void rds6_ib_get_mr_info(struct rds_ib_device *rds_ibdev,
+			 struct rds6_info_rdma_connection *iinfo6);
 void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *);
 void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
 		    struct rds_sock *rs, u32 *key_ret);
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 0ec9df0..e3c8bbb 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -180,6 +180,15 @@ void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_co
 	iinfo->rdma_mr_size = pool_1m->fmr_attr.max_pages;
 }
 
+void rds6_ib_get_mr_info(struct rds_ib_device *rds_ibdev,
+			 struct rds6_info_rdma_connection *iinfo6)
+{
+	struct rds_ib_mr_pool *pool_1m = rds_ibdev->mr_1m_pool;
+
+	iinfo6->rdma_mr_max = pool_1m->max_items;
+	iinfo6->rdma_mr_size = pool_1m->fmr_attr.max_pages;
+}
+
 struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *pool)
 {
 	struct rds_ib_mr *ibmr = NULL;
diff --git a/net/rds/rds.h b/net/rds/rds.h
index f5f99d1..fd63b86 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -875,6 +875,10 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 void rds_inc_info_copy(struct rds_incoming *inc,
 		       struct rds_info_iterator *iter,
 		       __be32 saddr, __be32 daddr, int flip);
+void rds6_inc_info_copy(struct rds_incoming *inc,
+			struct rds_info_iterator *iter,
+			struct in6_addr *saddr, struct in6_addr *daddr,
+			int flip);
 
 /* send.c */
 int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 4217961..10a2d0f 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -792,3 +792,28 @@ void rds_inc_info_copy(struct rds_incoming *inc,
 
 	rds_info_copy(iter, &minfo, sizeof(minfo));
 }
+
+void rds6_inc_info_copy(struct rds_incoming *inc,
+			struct rds_info_iterator *iter,
+			struct in6_addr *saddr, struct in6_addr *daddr,
+			int flip)
+{
+	struct rds6_info_message minfo6;
+
+	minfo6.seq = be64_to_cpu(inc->i_hdr.h_sequence);
+	minfo6.len = be32_to_cpu(inc->i_hdr.h_len);
+
+	if (flip) {
+		minfo6.laddr = *daddr;
+		minfo6.faddr = *saddr;
+		minfo6.lport = inc->i_hdr.h_dport;
+		minfo6.fport = inc->i_hdr.h_sport;
+	} else {
+		minfo6.laddr = *saddr;
+		minfo6.faddr = *daddr;
+		minfo6.lport = inc->i_hdr.h_sport;
+		minfo6.fport = inc->i_hdr.h_dport;
+	}
+
+	rds_info_copy(iter, &minfo6, sizeof(minfo6));
+}
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 890d0e1..7028d6e 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -273,6 +273,48 @@ static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
 	spin_unlock_irqrestore(&rds_tcp_tc_list_lock, flags);
 }
 
+/* Handle RDS6_INFO_TCP_SOCKETS socket option. It returns both IPv4 and
+ * IPv6 connections. IPv4 connection address is returned in an IPv4 mapped
+ * address.
+ */
+static void rds6_tcp_tc_info(struct socket *sock, unsigned int len,
+			     struct rds_info_iterator *iter,
+			     struct rds_info_lengths *lens)
+{
+	struct rds6_info_tcp_socket tsinfo6;
+	struct rds_tcp_connection *tc;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rds_tcp_tc_list_lock, flags);
+
+	if (len / sizeof(tsinfo6) < rds6_tcp_tc_count)
+		goto out;
+
+	list_for_each_entry(tc, &rds_tcp_tc_list, t_list_item) {
+		struct sock *sk = tc->t_sock->sk;
+		struct inet_sock *inet = inet_sk(sk);
+
+		tsinfo6.local_addr = sk->sk_v6_rcv_saddr;
+		tsinfo6.local_port = inet->inet_sport;
+		tsinfo6.peer_addr = sk->sk_v6_daddr;
+		tsinfo6.peer_port = inet->inet_dport;
+
+		tsinfo6.hdr_rem = tc->t_tinc_hdr_rem;
+		tsinfo6.data_rem = tc->t_tinc_data_rem;
+		tsinfo6.last_sent_nxt = tc->t_last_sent_nxt;
+		tsinfo6.last_expected_una = tc->t_last_expected_una;
+		tsinfo6.last_seen_una = tc->t_last_seen_una;
+
+		rds_info_copy(iter, &tsinfo6, sizeof(tsinfo6));
+	}
+
+out:
+	lens->nr = rds6_tcp_tc_count;
+	lens->each = sizeof(tsinfo6);
+
+	spin_unlock_irqrestore(&rds_tcp_tc_list_lock, flags);
+}
+
 static int rds_tcp_laddr_check(struct net *net, const struct in6_addr *addr,
 			       __u32 scope_id)
 {
@@ -628,6 +670,7 @@ static void rds_tcp_exit(void)
 	rds_tcp_set_unloading();
 	synchronize_rcu();
 	rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
+	rds_info_deregister_func(RDS6_INFO_TCP_SOCKETS, rds6_tcp_tc_info);
 	unregister_pernet_device(&rds_tcp_net_ops);
 	rds_tcp_destroy_conns();
 	rds_trans_unregister(&rds_tcp_transport);
@@ -659,6 +702,7 @@ static int rds_tcp_init(void)
 	rds_trans_register(&rds_tcp_transport);
 
 	rds_info_register_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
+	rds_info_register_func(RDS6_INFO_TCP_SOCKETS, rds6_tcp_tc_info);
 
 	goto out;
 out_recv:
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
From: Ka-Cheong Poon @ 2018-06-27 10:23 UTC (permalink / raw)
  To: netdev; +Cc: santosh.shilimkar, davem, rds-devel
In-Reply-To: <cover.1530086216.git.ka-cheong.poon@oracle.com>

This patch changes the internal representation of an IP address to use
struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
All the functions which take an IP address as argument are also
changed to use struct in6_addr.  But RDS socket layer is not modified
such that it still does not accept IPv6 address from an application.
And RDS layer does not accept nor initiate IPv6 connections.

v2: Fixed sparse warnings.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
---
 net/rds/af_rds.c         | 138 ++++++++++++++++------
 net/rds/bind.c           |  91 +++++++++-----
 net/rds/cong.c           |  23 ++--
 net/rds/connection.c     | 132 +++++++++++++--------
 net/rds/ib.c             |  17 +--
 net/rds/ib.h             |  45 +++++--
 net/rds/ib_cm.c          | 300 +++++++++++++++++++++++++++++++++++------------
 net/rds/ib_rdma.c        |  15 +--
 net/rds/ib_recv.c        |  18 +--
 net/rds/ib_send.c        |  10 +-
 net/rds/loop.c           |   7 +-
 net/rds/rdma.c           |   6 +-
 net/rds/rdma_transport.c |  56 ++++++---
 net/rds/rds.h            |  69 +++++++----
 net/rds/recv.c           |  51 +++++---
 net/rds/send.c           |  67 ++++++++---
 net/rds/tcp.c            |  32 ++++-
 net/rds/tcp_connect.c    |  34 +++---
 net/rds/tcp_listen.c     |  18 +--
 net/rds/tcp_recv.c       |   9 +-
 net/rds/tcp_send.c       |   4 +-
 net/rds/threads.c        |  69 +++++++++--
 net/rds/transport.c      |  15 ++-
 23 files changed, 857 insertions(+), 369 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index ab751a1..fc1a5c6 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -35,6 +35,7 @@
 #include <linux/kernel.h>
 #include <linux/gfp.h>
 #include <linux/in.h>
+#include <linux/ipv6.h>
 #include <linux/poll.h>
 #include <net/sock.h>
 
@@ -113,26 +114,63 @@ void rds_wake_sk_sleep(struct rds_sock *rs)
 static int rds_getname(struct socket *sock, struct sockaddr *uaddr,
 		       int peer)
 {
-	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
 	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
-
-	memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+	struct sockaddr_in6 *sin6;
+	struct sockaddr_in *sin;
+	int uaddr_len;
 
 	/* racey, don't care */
 	if (peer) {
-		if (!rs->rs_conn_addr)
+		if (ipv6_addr_any(&rs->rs_conn_addr))
 			return -ENOTCONN;
 
-		sin->sin_port = rs->rs_conn_port;
-		sin->sin_addr.s_addr = rs->rs_conn_addr;
+		if (ipv6_addr_v4mapped(&rs->rs_conn_addr)) {
+			sin = (struct sockaddr_in *)uaddr;
+			memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+			sin->sin_family = AF_INET;
+			sin->sin_port = rs->rs_conn_port;
+			sin->sin_addr.s_addr = rs->rs_conn_addr_v4;
+			uaddr_len = sizeof(*sin);
+		} else {
+			sin6 = (struct sockaddr_in6 *)uaddr;
+			sin6->sin6_family = AF_INET6;
+			sin6->sin6_port = rs->rs_conn_port;
+			sin6->sin6_addr = rs->rs_conn_addr;
+			sin6->sin6_flowinfo = 0;
+			/* scope_id is the same as in the bound address. */
+			sin6->sin6_scope_id = rs->rs_bound_scope_id;
+			uaddr_len = sizeof(*sin6);
+		}
 	} else {
-		sin->sin_port = rs->rs_bound_port;
-		sin->sin_addr.s_addr = rs->rs_bound_addr;
+		/* If socket is not yet bound, set the return address family
+		 * to be AF_UNSPEC (value 0) and the address size to be that
+		 * of an IPv4 address.
+		 */
+		if (ipv6_addr_any(&rs->rs_bound_addr)) {
+			sin = (struct sockaddr_in *)uaddr;
+			memset(sin, 0, sizeof(*sin));
+			sin->sin_family = AF_UNSPEC;
+			return sizeof(*sin);
+		}
+		if (ipv6_addr_v4mapped(&rs->rs_bound_addr)) {
+			sin = (struct sockaddr_in *)uaddr;
+			memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+			sin->sin_family = AF_INET;
+			sin->sin_port = rs->rs_bound_port;
+			sin->sin_addr.s_addr = rs->rs_bound_addr_v4;
+			uaddr_len = sizeof(*sin);
+		} else {
+			sin6 = (struct sockaddr_in6 *)uaddr;
+			sin6->sin6_family = AF_INET6;
+			sin6->sin6_port = rs->rs_bound_port;
+			sin6->sin6_addr = rs->rs_bound_addr;
+			sin6->sin6_flowinfo = 0;
+			sin6->sin6_scope_id = rs->rs_bound_scope_id;
+			uaddr_len = sizeof(*sin6);
+		}
 	}
 
-	sin->sin_family = AF_INET;
-
-	return sizeof(*sin);
+	return uaddr_len;
 }
 
 /*
@@ -203,11 +241,12 @@ static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval,
 			      int len)
 {
+	struct sockaddr_in6 sin6;
 	struct sockaddr_in sin;
 	int ret = 0;
 
 	/* racing with another thread binding seems ok here */
-	if (rs->rs_bound_addr == 0) {
+	if (ipv6_addr_any(&rs->rs_bound_addr)) {
 		ret = -ENOTCONN; /* XXX not a great errno */
 		goto out;
 	}
@@ -215,14 +254,23 @@ static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval,
 	if (len < sizeof(struct sockaddr_in)) {
 		ret = -EINVAL;
 		goto out;
+	} else if (len < sizeof(struct sockaddr_in6)) {
+		/* Assume IPv4 */
+		if (copy_from_user(&sin, optval, sizeof(struct sockaddr_in))) {
+			ret = -EFAULT;
+			goto out;
+		}
+		ipv6_addr_set_v4mapped(sin.sin_addr.s_addr, &sin6.sin6_addr);
+		sin6.sin6_port = sin.sin_port;
+	} else {
+		if (copy_from_user(&sin6, optval,
+				   sizeof(struct sockaddr_in6))) {
+			ret = -EFAULT;
+			goto out;
+		}
 	}
 
-	if (copy_from_user(&sin, optval, sizeof(sin))) {
-		ret = -EFAULT;
-		goto out;
-	}
-
-	rds_send_drop_to(rs, &sin);
+	rds_send_drop_to(rs, &sin6);
 out:
 	return ret;
 }
@@ -435,31 +483,41 @@ static int rds_connect(struct socket *sock, struct sockaddr *uaddr,
 		       int addr_len, int flags)
 {
 	struct sock *sk = sock->sk;
-	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+	struct sockaddr_in *sin;
 	struct rds_sock *rs = rds_sk_to_rs(sk);
 	int ret = 0;
 
 	lock_sock(sk);
 
-	if (addr_len != sizeof(struct sockaddr_in)) {
-		ret = -EINVAL;
-		goto out;
-	}
+	switch (addr_len) {
+	case sizeof(struct sockaddr_in):
+		sin = (struct sockaddr_in *)uaddr;
+		if (sin->sin_family != AF_INET) {
+			ret = -EAFNOSUPPORT;
+			break;
+		}
+		if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
+			ret = -EDESTADDRREQ;
+			break;
+		}
+		if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) ||
+		    sin->sin_addr.s_addr == htonl(INADDR_BROADCAST)) {
+			ret = -EINVAL;
+			break;
+		}
+		ipv6_addr_set_v4mapped(sin->sin_addr.s_addr, &rs->rs_conn_addr);
+		rs->rs_conn_port = sin->sin_port;
+		break;
 
-	if (sin->sin_family != AF_INET) {
-		ret = -EAFNOSUPPORT;
-		goto out;
-	}
+	case sizeof(struct sockaddr_in6):
+		ret = -EPROTONOSUPPORT;
+		break;
 
-	if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
-		ret = -EDESTADDRREQ;
-		goto out;
+	default:
+		ret = -EINVAL;
+		break;
 	}
 
-	rs->rs_conn_addr = sin->sin_addr.s_addr;
-	rs->rs_conn_port = sin->sin_port;
-
-out:
 	release_sock(sk);
 	return ret;
 }
@@ -578,8 +636,10 @@ static void rds_sock_inc_info(struct socket *sock, unsigned int len,
 		list_for_each_entry(inc, &rs->rs_recv_queue, i_item) {
 			total++;
 			if (total <= len)
-				rds_inc_info_copy(inc, iter, inc->i_saddr,
-						  rs->rs_bound_addr, 1);
+				rds_inc_info_copy(inc, iter,
+						  inc->i_saddr.s6_addr32[3],
+						  rs->rs_bound_addr_v4,
+						  1);
 		}
 
 		read_unlock(&rs->rs_recv_lock);
@@ -608,8 +668,8 @@ static void rds_sock_info(struct socket *sock, unsigned int len,
 	list_for_each_entry(rs, &rds_sock_list, rs_item) {
 		sinfo.sndbuf = rds_sk_sndbuf(rs);
 		sinfo.rcvbuf = rds_sk_rcvbuf(rs);
-		sinfo.bound_addr = rs->rs_bound_addr;
-		sinfo.connected_addr = rs->rs_conn_addr;
+		sinfo.bound_addr = rs->rs_bound_addr_v4;
+		sinfo.connected_addr = rs->rs_conn_addr_v4;
 		sinfo.bound_port = rs->rs_bound_port;
 		sinfo.connected_port = rs->rs_conn_port;
 		sinfo.inum = sock_i_ino(rds_rs_to_sk(rs));
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 5aa3a64..3822886 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -33,6 +33,7 @@
 #include <linux/kernel.h>
 #include <net/sock.h>
 #include <linux/in.h>
+#include <linux/ipv6.h>
 #include <linux/if_arp.h>
 #include <linux/jhash.h>
 #include <linux/ratelimit.h>
@@ -42,42 +43,58 @@
 
 static const struct rhashtable_params ht_parms = {
 	.nelem_hint = 768,
-	.key_len = sizeof(u64),
+	.key_len = RDS_BOUND_KEY_LEN,
 	.key_offset = offsetof(struct rds_sock, rs_bound_key),
 	.head_offset = offsetof(struct rds_sock, rs_bound_node),
 	.max_size = 16384,
 	.min_size = 1024,
 };
 
+/* Create a key for the bind hash table manipulation.  Port is in network byte
+ * order.
+ */
+static inline void __rds_create_bind_key(u8 *key, const struct in6_addr *addr,
+					 __be16 port, __u32 scope_id)
+{
+	memcpy(key, addr, sizeof(*addr));
+	key += sizeof(*addr);
+	memcpy(key, &port, sizeof(port));
+	key += sizeof(port);
+	memcpy(key, &scope_id, sizeof(scope_id));
+}
+
 /*
  * Return the rds_sock bound at the given local address.
  *
  * The rx path can race with rds_release.  We notice if rds_release() has
  * marked this socket and don't return a rs ref to the rx path.
  */
-struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
+struct rds_sock *rds_find_bound(const struct in6_addr *addr, __be16 port,
+				__u32 scope_id)
 {
-	u64 key = ((u64)addr << 32) | port;
+	u8 key[RDS_BOUND_KEY_LEN];
 	struct rds_sock *rs;
 
-	rs = rhashtable_lookup_fast(&bind_hash_table, &key, ht_parms);
+	__rds_create_bind_key(key, addr, port, scope_id);
+	rs = rhashtable_lookup_fast(&bind_hash_table, key, ht_parms);
 	if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
 		rds_sock_addref(rs);
 	else
 		rs = NULL;
 
-	rdsdebug("returning rs %p for %pI4:%u\n", rs, &addr,
-		ntohs(port));
+	rdsdebug("returning rs %p for %pI6c:%u\n", rs, addr,
+		 ntohs(port));
 
 	return rs;
 }
 
 /* returns -ve errno or +ve port */
-static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
+static int rds_add_bound(struct rds_sock *rs, const struct in6_addr *addr,
+			 __be16 *port, __u32 scope_id)
 {
 	int ret = -EADDRINUSE;
 	u16 rover, last;
-	u64 key;
+	u8 key[RDS_BOUND_KEY_LEN];
 
 	if (*port != 0) {
 		rover = be16_to_cpu(*port);
@@ -95,12 +112,13 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
 
 		if (rover == RDS_FLAG_PROBE_PORT)
 			continue;
-		key = ((u64)addr << 32) | cpu_to_be16(rover);
-		if (rhashtable_lookup_fast(&bind_hash_table, &key, ht_parms))
+		__rds_create_bind_key(key, addr, cpu_to_be16(rover),
+				      scope_id);
+		if (rhashtable_lookup_fast(&bind_hash_table, key, ht_parms))
 			continue;
 
-		rs->rs_bound_key = key;
-		rs->rs_bound_addr = addr;
+		memcpy(rs->rs_bound_key, key, sizeof(rs->rs_bound_key));
+		rs->rs_bound_addr = *addr;
 		net_get_random_once(&rs->rs_hash_initval,
 				    sizeof(rs->rs_hash_initval));
 		rs->rs_bound_port = cpu_to_be16(rover);
@@ -114,7 +132,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
 			  rs, &addr, (int)ntohs(*port));
 			break;
 		} else {
-			rs->rs_bound_addr = 0;
+			rs->rs_bound_addr = in6addr_any;
 			rds_sock_put(rs);
 			ret = -ENOMEM;
 			break;
@@ -127,44 +145,61 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
 void rds_remove_bound(struct rds_sock *rs)
 {
 
-	if (!rs->rs_bound_addr)
+	if (ipv6_addr_any(&rs->rs_bound_addr))
 		return;
 
-	rdsdebug("rs %p unbinding from %pI4:%d\n",
+	rdsdebug("rs %p unbinding from %pI6c:%d\n",
 		 rs, &rs->rs_bound_addr,
 		 ntohs(rs->rs_bound_port));
 
 	rhashtable_remove_fast(&bind_hash_table, &rs->rs_bound_node, ht_parms);
 	rds_sock_put(rs);
-	rs->rs_bound_addr = 0;
+	rs->rs_bound_addr = in6addr_any;
 }
 
 int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
 	struct sock *sk = sock->sk;
-	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
 	struct rds_sock *rs = rds_sk_to_rs(sk);
+	struct in6_addr v6addr, *binding_addr;
 	struct rds_transport *trans;
+	__u32 scope_id = 0;
 	int ret = 0;
+	__be16 port;
 
+	/* We only allow an RDS socket to be bound to and IPv4 address. IPv6
+	 * address support will be added later.
+	 */
+	if (addr_len == sizeof(struct sockaddr_in)) {
+		struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+
+		if (sin->sin_family != AF_INET ||
+		    sin->sin_addr.s_addr == htonl(INADDR_ANY))
+			return -EINVAL;
+		ipv6_addr_set_v4mapped(sin->sin_addr.s_addr, &v6addr);
+		binding_addr = &v6addr;
+		port = sin->sin_port;
+	} else if (addr_len == sizeof(struct sockaddr_in6)) {
+		return -EPROTONOSUPPORT;
+	} else {
+		return -EINVAL;
+	}
 	lock_sock(sk);
 
-	if (addr_len != sizeof(struct sockaddr_in) ||
-	    sin->sin_family != AF_INET ||
-	    rs->rs_bound_addr ||
-	    sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
+	/* RDS socket does not allow re-binding. */
+	if (!ipv6_addr_any(&rs->rs_bound_addr)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	ret = rds_add_bound(rs, sin->sin_addr.s_addr, &sin->sin_port);
+	ret = rds_add_bound(rs, binding_addr, &port, scope_id);
 	if (ret)
 		goto out;
 
 	if (rs->rs_transport) { /* previously bound */
 		trans = rs->rs_transport;
 		if (trans->laddr_check(sock_net(sock->sk),
-				       sin->sin_addr.s_addr) != 0) {
+				       binding_addr, scope_id) != 0) {
 			ret = -ENOPROTOOPT;
 			rds_remove_bound(rs);
 		} else {
@@ -172,13 +207,13 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		}
 		goto out;
 	}
-	trans = rds_trans_get_preferred(sock_net(sock->sk),
-					sin->sin_addr.s_addr);
+	trans = rds_trans_get_preferred(sock_net(sock->sk), binding_addr,
+					scope_id);
 	if (!trans) {
 		ret = -EADDRNOTAVAIL;
 		rds_remove_bound(rs);
-		pr_info_ratelimited("RDS: %s could not find a transport for %pI4, load rds_tcp or rds_rdma?\n",
-				    __func__, &sin->sin_addr.s_addr);
+		pr_info_ratelimited("RDS: %s could not find a transport for %pI6c, load rds_tcp or rds_rdma?\n",
+				    __func__, binding_addr);
 		goto out;
 	}
 
diff --git a/net/rds/cong.c b/net/rds/cong.c
index 63da9d2..ccdff09 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2007 Oracle.  All rights reserved.
+ * Copyright (c) 2007, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -101,7 +101,7 @@
 static DEFINE_SPINLOCK(rds_cong_lock);
 static struct rb_root rds_cong_tree = RB_ROOT;
 
-static struct rds_cong_map *rds_cong_tree_walk(__be32 addr,
+static struct rds_cong_map *rds_cong_tree_walk(const struct in6_addr *addr,
 					       struct rds_cong_map *insert)
 {
 	struct rb_node **p = &rds_cong_tree.rb_node;
@@ -109,12 +109,15 @@ static struct rds_cong_map *rds_cong_tree_walk(__be32 addr,
 	struct rds_cong_map *map;
 
 	while (*p) {
+		int diff;
+
 		parent = *p;
 		map = rb_entry(parent, struct rds_cong_map, m_rb_node);
 
-		if (addr < map->m_addr)
+		diff = rds_addr_cmp(addr, &map->m_addr);
+		if (diff < 0)
 			p = &(*p)->rb_left;
-		else if (addr > map->m_addr)
+		else if (diff > 0)
 			p = &(*p)->rb_right;
 		else
 			return map;
@@ -132,7 +135,7 @@ static struct rds_cong_map *rds_cong_tree_walk(__be32 addr,
  * these bitmaps in the process getting pointers to them.  The bitmaps are only
  * ever freed as the module is removed after all connections have been freed.
  */
-static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
+static struct rds_cong_map *rds_cong_from_addr(const struct in6_addr *addr)
 {
 	struct rds_cong_map *map;
 	struct rds_cong_map *ret = NULL;
@@ -144,7 +147,7 @@ static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
 	if (!map)
 		return NULL;
 
-	map->m_addr = addr;
+	map->m_addr = *addr;
 	init_waitqueue_head(&map->m_waitq);
 	INIT_LIST_HEAD(&map->m_conn_list);
 
@@ -171,7 +174,7 @@ static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
 		kfree(map);
 	}
 
-	rdsdebug("map %p for addr %x\n", ret, be32_to_cpu(addr));
+	rdsdebug("map %p for addr %pI6c\n", ret, addr);
 
 	return ret;
 }
@@ -202,8 +205,8 @@ void rds_cong_remove_conn(struct rds_connection *conn)
 
 int rds_cong_get_maps(struct rds_connection *conn)
 {
-	conn->c_lcong = rds_cong_from_addr(conn->c_laddr);
-	conn->c_fcong = rds_cong_from_addr(conn->c_faddr);
+	conn->c_lcong = rds_cong_from_addr(&conn->c_laddr);
+	conn->c_fcong = rds_cong_from_addr(&conn->c_faddr);
 
 	if (!(conn->c_lcong && conn->c_fcong))
 		return -ENOMEM;
@@ -353,7 +356,7 @@ void rds_cong_remove_socket(struct rds_sock *rs)
 
 	/* update congestion map for now-closed port */
 	spin_lock_irqsave(&rds_cong_lock, flags);
-	map = rds_cong_tree_walk(rs->rs_bound_addr, NULL);
+	map = rds_cong_tree_walk(&rs->rs_bound_addr, NULL);
 	spin_unlock_irqrestore(&rds_cong_lock, flags);
 
 	if (map && rds_cong_test_bit(map, rs->rs_bound_port)) {
diff --git a/net/rds/connection.c b/net/rds/connection.c
index abef75d..ca72563 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -34,7 +34,8 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/export.h>
-#include <net/inet_hashtables.h>
+#include <net/ipv6.h>
+#include <net/inet6_hashtables.h>
 
 #include "rds.h"
 #include "loop.h"
@@ -49,18 +50,21 @@
 static struct hlist_head rds_conn_hash[RDS_CONNECTION_HASH_ENTRIES];
 static struct kmem_cache *rds_conn_slab;
 
-static struct hlist_head *rds_conn_bucket(__be32 laddr, __be32 faddr)
+static struct hlist_head *rds_conn_bucket(const struct in6_addr *laddr,
+					  const struct in6_addr *faddr)
 {
+	static u32 rds6_hash_secret __read_mostly;
 	static u32 rds_hash_secret __read_mostly;
 
-	unsigned long hash;
+	u32 lhash, fhash, hash;
 
 	net_get_random_once(&rds_hash_secret, sizeof(rds_hash_secret));
+	net_get_random_once(&rds6_hash_secret, sizeof(rds6_hash_secret));
+
+	lhash = (__force u32)laddr->s6_addr32[3];
+	fhash = __ipv6_addr_jhash(faddr, rds6_hash_secret);
+	hash = __inet6_ehashfn(lhash, 0, fhash, 0, rds_hash_secret);
 
-	/* Pass NULL, don't need struct net for hash */
-	hash = __inet_ehashfn(be32_to_cpu(laddr), 0,
-			      be32_to_cpu(faddr), 0,
-			      rds_hash_secret);
 	return &rds_conn_hash[hash & RDS_CONNECTION_HASH_MASK];
 }
 
@@ -72,20 +76,25 @@ static struct hlist_head *rds_conn_bucket(__be32 laddr, __be32 faddr)
 /* rcu read lock must be held or the connection spinlock */
 static struct rds_connection *rds_conn_lookup(struct net *net,
 					      struct hlist_head *head,
-					      __be32 laddr, __be32 faddr,
-					      struct rds_transport *trans)
+					      const struct in6_addr *laddr,
+					      const struct in6_addr *faddr,
+					      struct rds_transport *trans,
+					      int dev_if)
 {
 	struct rds_connection *conn, *ret = NULL;
 
 	hlist_for_each_entry_rcu(conn, head, c_hash_node) {
-		if (conn->c_faddr == faddr && conn->c_laddr == laddr &&
-		    conn->c_trans == trans && net == rds_conn_net(conn)) {
+		if (ipv6_addr_equal(&conn->c_faddr, faddr) &&
+		    ipv6_addr_equal(&conn->c_laddr, laddr) &&
+		    conn->c_trans == trans &&
+		    net == rds_conn_net(conn) &&
+		    conn->c_dev_if == dev_if) {
 			ret = conn;
 			break;
 		}
 	}
-	rdsdebug("returning conn %p for %pI4 -> %pI4\n", ret,
-		 &laddr, &faddr);
+	rdsdebug("returning conn %p for %pI6c -> %pI6c\n", ret,
+		 laddr, faddr);
 	return ret;
 }
 
@@ -99,8 +108,8 @@ static void rds_conn_path_reset(struct rds_conn_path *cp)
 {
 	struct rds_connection *conn = cp->cp_conn;
 
-	rdsdebug("connection %pI4 to %pI4 reset\n",
-	  &conn->c_laddr, &conn->c_faddr);
+	rdsdebug("connection %pI6c to %pI6c reset\n",
+		 &conn->c_laddr, &conn->c_faddr);
 
 	rds_stats_inc(s_conn_reset);
 	rds_send_path_reset(cp);
@@ -142,9 +151,12 @@ static void __rds_conn_path_init(struct rds_connection *conn,
  * are torn down as the module is removed, if ever.
  */
 static struct rds_connection *__rds_conn_create(struct net *net,
-						__be32 laddr, __be32 faddr,
-				       struct rds_transport *trans, gfp_t gfp,
-				       int is_outgoing)
+						const struct in6_addr *laddr,
+						const struct in6_addr *faddr,
+						struct rds_transport *trans,
+						gfp_t gfp,
+						int is_outgoing,
+						int dev_if)
 {
 	struct rds_connection *conn, *parent = NULL;
 	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
@@ -154,9 +166,12 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1);
 
 	rcu_read_lock();
-	conn = rds_conn_lookup(net, head, laddr, faddr, trans);
-	if (conn && conn->c_loopback && conn->c_trans != &rds_loop_transport &&
-	    laddr == faddr && !is_outgoing) {
+	conn = rds_conn_lookup(net, head, laddr, faddr, trans, dev_if);
+	if (conn &&
+	    conn->c_loopback &&
+	    conn->c_trans != &rds_loop_transport &&
+	    ipv6_addr_equal(laddr, faddr) &&
+	    !is_outgoing) {
 		/* This is a looped back IB connection, and we're
 		 * called by the code handling the incoming connect.
 		 * We need a second connection object into which we
@@ -181,8 +196,10 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	}
 
 	INIT_HLIST_NODE(&conn->c_hash_node);
-	conn->c_laddr = laddr;
-	conn->c_faddr = faddr;
+	conn->c_laddr = *laddr;
+	conn->c_isv6 = !ipv6_addr_v4mapped(laddr);
+	conn->c_faddr = *faddr;
+	conn->c_dev_if = dev_if;
 
 	rds_conn_net_set(conn, net);
 
@@ -199,7 +216,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	 * can bind to the destination address then we'd rather the messages
 	 * flow through loopback rather than either transport.
 	 */
-	loop_trans = rds_trans_get_preferred(net, faddr);
+	loop_trans = rds_trans_get_preferred(net, faddr, conn->c_dev_if);
 	if (loop_trans) {
 		rds_trans_put(loop_trans);
 		conn->c_loopback = 1;
@@ -233,10 +250,10 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		goto out;
 	}
 
-	rdsdebug("allocated conn %p for %pI4 -> %pI4 over %s %s\n",
-	  conn, &laddr, &faddr,
-	  strnlen(trans->t_name, sizeof(trans->t_name)) ? trans->t_name :
-	  "[unknown]", is_outgoing ? "(outgoing)" : "");
+	rdsdebug("allocated conn %p for %pI6c -> %pI6c over %s %s\n",
+		 conn, laddr, faddr,
+		 strnlen(trans->t_name, sizeof(trans->t_name)) ?
+		 trans->t_name : "[unknown]", is_outgoing ? "(outgoing)" : "");
 
 	/*
 	 * Since we ran without holding the conn lock, someone could
@@ -262,7 +279,8 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		/* Creating normal conn */
 		struct rds_connection *found;
 
-		found = rds_conn_lookup(net, head, laddr, faddr, trans);
+		found = rds_conn_lookup(net, head, laddr, faddr, trans,
+					dev_if);
 		if (found) {
 			struct rds_conn_path *cp;
 			int i;
@@ -295,18 +313,22 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 }
 
 struct rds_connection *rds_conn_create(struct net *net,
-				       __be32 laddr, __be32 faddr,
-				       struct rds_transport *trans, gfp_t gfp)
+				       const struct in6_addr *laddr,
+				       const struct in6_addr *faddr,
+				       struct rds_transport *trans, gfp_t gfp,
+				       int dev_if)
 {
-	return __rds_conn_create(net, laddr, faddr, trans, gfp, 0);
+	return __rds_conn_create(net, laddr, faddr, trans, gfp, 0, dev_if);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create);
 
 struct rds_connection *rds_conn_create_outgoing(struct net *net,
-						__be32 laddr, __be32 faddr,
-				       struct rds_transport *trans, gfp_t gfp)
+						const struct in6_addr *laddr,
+						const struct in6_addr *faddr,
+						struct rds_transport *trans,
+						gfp_t gfp, int dev_if)
 {
-	return __rds_conn_create(net, laddr, faddr, trans, gfp, 1);
+	return __rds_conn_create(net, laddr, faddr, trans, gfp, 1, dev_if);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create_outgoing);
 
@@ -502,12 +524,17 @@ static void rds_conn_message_info(struct socket *sock, unsigned int len,
 
 				/* XXX too lazy to maintain counts.. */
 				list_for_each_entry(rm, list, m_conn_item) {
+					__be32 laddr;
+					__be32 faddr;
+
 					total++;
+					laddr = conn->c_laddr.s6_addr32[3];
+					faddr = conn->c_faddr.s6_addr32[3];
 					if (total <= len)
 						rds_inc_info_copy(&rm->m_inc,
 								  iter,
-								  conn->c_laddr,
-								  conn->c_faddr,
+								  laddr,
+								  faddr,
 								  0);
 				}
 
@@ -584,7 +611,6 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
 	struct hlist_head *head;
 	struct rds_connection *conn;
 	size_t i;
-	int j;
 
 	rcu_read_lock();
 
@@ -595,17 +621,20 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
 	     i++, head++) {
 		hlist_for_each_entry_rcu(conn, head, c_hash_node) {
 			struct rds_conn_path *cp;
-			int npaths;
 
-			npaths = (conn->c_trans->t_mp_capable ?
-				 RDS_MPATH_WORKERS : 1);
-			for (j = 0; j < npaths; j++) {
-				cp = &conn->c_path[j];
+			/* XXX We only copy the information from the first
+			 * path for now.  The problem is that if there are
+			 * more than one underlying paths, we cannot report
+			 * information of all of them using the exisitng
+			 * API.  For example, there is only one next_tx_seq,
+			 * which path's next_tx_seq should we report?  It is
+			 * a bug in the design of MPRDS.
+			 */
+			cp = conn->c_path;
 
-				/* XXX no cp_lock usage.. */
-				if (!visitor(cp, buffer))
-					continue;
-			}
+			/* XXX no cp_lock usage.. */
+			if (!visitor(cp, buffer))
+				continue;
 
 			/* We copy as much as we can fit in the buffer,
 			 * but we count all items so that the caller
@@ -624,12 +653,13 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
 static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
 {
 	struct rds_info_connection *cinfo = buffer;
+	struct rds_connection *conn = cp->cp_conn;
 
 	cinfo->next_tx_seq = cp->cp_next_tx_seq;
 	cinfo->next_rx_seq = cp->cp_next_rx_seq;
-	cinfo->laddr = cp->cp_conn->c_laddr;
-	cinfo->faddr = cp->cp_conn->c_faddr;
-	strncpy(cinfo->transport, cp->cp_conn->c_trans->t_name,
+	cinfo->laddr = conn->c_laddr.s6_addr32[3];
+	cinfo->faddr = conn->c_faddr.s6_addr32[3];
+	strncpy(cinfo->transport, conn->c_trans->t_name,
 		sizeof(cinfo->transport));
 	cinfo->flags = 0;
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index b6ad38e..c712a84 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -296,8 +296,8 @@ static int rds_ib_conn_info_visitor(struct rds_connection *conn,
 	if (conn->c_trans != &rds_ib_transport)
 		return 0;
 
-	iinfo->src_addr = conn->c_laddr;
-	iinfo->dst_addr = conn->c_faddr;
+	iinfo->src_addr = conn->c_laddr.s6_addr32[3];
+	iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
 
 	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
 	memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
@@ -341,7 +341,8 @@ static void rds_ib_ic_info(struct socket *sock, unsigned int len,
  * allowed to influence which paths have priority.  We could call userspace
  * asserting this policy "routing".
  */
-static int rds_ib_laddr_check(struct net *net, __be32 addr)
+static int rds_ib_laddr_check(struct net *net, const struct in6_addr *addr,
+			      __u32 scope_id)
 {
 	int ret;
 	struct rdma_cm_id *cm_id;
@@ -357,7 +358,7 @@ static int rds_ib_laddr_check(struct net *net, __be32 addr)
 
 	memset(&sin, 0, sizeof(sin));
 	sin.sin_family = AF_INET;
-	sin.sin_addr.s_addr = addr;
+	sin.sin_addr.s_addr = addr->s6_addr32[3];
 
 	/* rdma_bind_addr will only succeed for IB & iWARP devices */
 	ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
@@ -367,9 +368,9 @@ static int rds_ib_laddr_check(struct net *net, __be32 addr)
 	    cm_id->device->node_type != RDMA_NODE_IB_CA)
 		ret = -EADDRNOTAVAIL;
 
-	rdsdebug("addr %pI4 ret %d node type %d\n",
-		&addr, ret,
-		cm_id->device ? cm_id->device->node_type : -1);
+	rdsdebug("addr %pI6c ret %d node type %d\n",
+		 addr, ret,
+		 cm_id->device ? cm_id->device->node_type : -1);
 
 	rdma_destroy_id(cm_id);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index a6f4d7d..12f96b3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -57,16 +57,38 @@ struct rds_ib_refill_cache {
 	struct list_head	 *ready;
 };
 
+struct rds_ib_conn_priv_cmn {
+	u8			ricpc_protocol_major;
+	u8			ricpc_protocol_minor;
+	__be16			ricpc_protocol_minor_mask;	/* bitmask */
+	__be32			ricpc_reserved1;
+	__be64			ricpc_ack_seq;
+	__be32			ricpc_credit;	/* non-zero enables flow ctl */
+};
+
 struct rds_ib_connect_private {
 	/* Add new fields at the end, and don't permute existing fields. */
-	__be32			dp_saddr;
-	__be32			dp_daddr;
-	u8			dp_protocol_major;
-	u8			dp_protocol_minor;
-	__be16			dp_protocol_minor_mask; /* bitmask */
-	__be32			dp_reserved1;
-	__be64			dp_ack_seq;
-	__be32			dp_credit;		/* non-zero enables flow ctl */
+	__be32				dp_saddr;
+	__be32				dp_daddr;
+	struct rds_ib_conn_priv_cmn	dp_cmn;
+};
+
+struct rds6_ib_connect_private {
+	/* Add new fields at the end, and don't permute existing fields. */
+	struct in6_addr			dp_saddr;
+	struct in6_addr			dp_daddr;
+	struct rds_ib_conn_priv_cmn	dp_cmn;
+};
+
+#define dp_protocol_major	dp_cmn.ricpc_protocol_major
+#define dp_protocol_minor	dp_cmn.ricpc_protocol_minor
+#define dp_protocol_minor_mask	dp_cmn.ricpc_protocol_minor_mask
+#define dp_ack_seq		dp_cmn.ricpc_ack_seq
+#define dp_credit		dp_cmn.ricpc_credit
+
+union rds_ib_conn_priv {
+	struct rds_ib_connect_private	ricp_v4;
+	struct rds6_ib_connect_private	ricp_v6;
 };
 
 struct rds_ib_send_work {
@@ -351,8 +373,8 @@ static inline void rds_ib_dma_sync_sg_for_device(struct ib_device *dev,
 __printf(2, 3)
 void __rds_ib_conn_error(struct rds_connection *conn, const char *, ...);
 int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
-			     struct rdma_cm_event *event);
-int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id);
+			     struct rdma_cm_event *event, bool isv6);
+int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6);
 void rds_ib_cm_connect_complete(struct rds_connection *conn,
 				struct rdma_cm_event *event);
 
@@ -361,7 +383,8 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn,
 	__rds_ib_conn_error(conn, KERN_WARNING "RDS/IB: " fmt)
 
 /* ib_rdma.c */
-int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr);
+int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
+			 struct in6_addr *ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn);
 void rds_ib_destroy_nodev_conns(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index f1684ae..5b8b181 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -35,10 +35,12 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/ratelimit.h>
+#include <net/addrconf.h>
 
 #include "rds_single_path.h"
 #include "rds.h"
 #include "ib.h"
+#include "tcp.h"
 
 /*
  * Set the selected protocol version
@@ -95,25 +97,45 @@ static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits)
  */
 void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_event *event)
 {
-	const struct rds_ib_connect_private *dp = NULL;
 	struct rds_ib_connection *ic = conn->c_transport_data;
+	const union rds_ib_conn_priv *dp = NULL;
 	struct ib_qp_attr qp_attr;
+	__be64 ack_seq = 0;
+	__be32 credit = 0;
+	u8 major = 0;
+	u8 minor = 0;
 	int err;
 
-	if (event->param.conn.private_data_len >= sizeof(*dp)) {
-		dp = event->param.conn.private_data;
-
-		/* make sure it isn't empty data */
-		if (dp->dp_protocol_major) {
-			rds_ib_set_protocol(conn,
-				RDS_PROTOCOL(dp->dp_protocol_major,
-				dp->dp_protocol_minor));
-			rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+	dp = event->param.conn.private_data;
+	if (conn->c_isv6) {
+		if (event->param.conn.private_data_len >=
+		    sizeof(struct rds6_ib_connect_private)) {
+			major = dp->ricp_v6.dp_protocol_major;
+			minor = dp->ricp_v6.dp_protocol_minor;
+			credit = dp->ricp_v6.dp_credit;
+			/* dp structure start is not guaranteed to be 8 bytes
+			 * aligned.  Since dp_ack_seq is 64-bit extended load
+			 * operations can be used so go through get_unaligned
+			 * to avoid unaligned errors.
+			 */
+			ack_seq = get_unaligned(&dp->ricp_v6.dp_ack_seq);
 		}
+	} else if (event->param.conn.private_data_len >=
+		   sizeof(struct rds_ib_connect_private)) {
+		major = dp->ricp_v4.dp_protocol_major;
+		minor = dp->ricp_v4.dp_protocol_minor;
+		credit = dp->ricp_v4.dp_credit;
+		ack_seq = get_unaligned(&dp->ricp_v4.dp_ack_seq);
+	}
+
+	/* make sure it isn't empty data */
+	if (major) {
+		rds_ib_set_protocol(conn, RDS_PROTOCOL(major, minor));
+		rds_ib_set_flow_control(conn, be32_to_cpu(credit));
 	}
 
 	if (conn->c_version < RDS_PROTOCOL(3, 1)) {
-		pr_notice("RDS/IB: Connection <%pI4,%pI4> version %u.%u no longer supported\n",
+		pr_notice("RDS/IB: Connection <%pI6c,%pI6c> version %u.%u no longer supported\n",
 			  &conn->c_laddr, &conn->c_faddr,
 			  RDS_PROTOCOL_MAJOR(conn->c_version),
 			  RDS_PROTOCOL_MINOR(conn->c_version));
@@ -121,7 +143,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 		rds_conn_destroy(conn);
 		return;
 	} else {
-		pr_notice("RDS/IB: %s conn connected <%pI4,%pI4> version %u.%u%s\n",
+		pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c> version %u.%u%s\n",
 			  ic->i_active_side ? "Active" : "Passive",
 			  &conn->c_laddr, &conn->c_faddr,
 			  RDS_PROTOCOL_MAJOR(conn->c_version),
@@ -150,7 +172,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 		printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", err);
 
 	/* update ib_device with this local ipaddr */
-	err = rds_ib_update_ipaddr(ic->rds_ibdev, conn->c_laddr);
+	err = rds_ib_update_ipaddr(ic->rds_ibdev, &conn->c_laddr);
 	if (err)
 		printk(KERN_ERR "rds_ib_update_ipaddr failed (%d)\n",
 			err);
@@ -158,14 +180,8 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 	/* If the peer gave us the last packet it saw, process this as if
 	 * we had received a regular ACK. */
 	if (dp) {
-		/* dp structure start is not guaranteed to be 8 bytes aligned.
-		 * Since dp_ack_seq is 64-bit extended load operations can be
-		 * used so go through get_unaligned to avoid unaligned errors.
-		 */
-		__be64 dp_ack_seq = get_unaligned(&dp->dp_ack_seq);
-
-		if (dp_ack_seq)
-			rds_send_drop_acked(conn, be64_to_cpu(dp_ack_seq),
+		if (ack_seq)
+			rds_send_drop_acked(conn, be64_to_cpu(ack_seq),
 					    NULL);
 	}
 
@@ -173,11 +189,12 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 }
 
 static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
-			struct rdma_conn_param *conn_param,
-			struct rds_ib_connect_private *dp,
-			u32 protocol_version,
-			u32 max_responder_resources,
-			u32 max_initiator_depth)
+				      struct rdma_conn_param *conn_param,
+				      union rds_ib_conn_priv *dp,
+				      u32 protocol_version,
+				      u32 max_responder_resources,
+				      u32 max_initiator_depth,
+				      bool isv6)
 {
 	struct rds_ib_connection *ic = conn->c_transport_data;
 	struct rds_ib_device *rds_ibdev = ic->rds_ibdev;
@@ -193,24 +210,49 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 
 	if (dp) {
 		memset(dp, 0, sizeof(*dp));
-		dp->dp_saddr = conn->c_laddr;
-		dp->dp_daddr = conn->c_faddr;
-		dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
-		dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
-		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
-		dp->dp_ack_seq = cpu_to_be64(rds_ib_piggyb_ack(ic));
+		if (isv6) {
+			dp->ricp_v6.dp_saddr = conn->c_laddr;
+			dp->ricp_v6.dp_daddr = conn->c_faddr;
+			dp->ricp_v6.dp_protocol_major =
+			    RDS_PROTOCOL_MAJOR(protocol_version);
+			dp->ricp_v6.dp_protocol_minor =
+			    RDS_PROTOCOL_MINOR(protocol_version);
+			dp->ricp_v6.dp_protocol_minor_mask =
+			    cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
+			dp->ricp_v6.dp_ack_seq =
+			    cpu_to_be64(rds_ib_piggyb_ack(ic));
+
+			conn_param->private_data = &dp->ricp_v6;
+			conn_param->private_data_len = sizeof(dp->ricp_v6);
+		} else {
+			dp->ricp_v4.dp_saddr = conn->c_laddr.s6_addr32[3];
+			dp->ricp_v4.dp_daddr = conn->c_faddr.s6_addr32[3];
+			dp->ricp_v4.dp_protocol_major =
+			    RDS_PROTOCOL_MAJOR(protocol_version);
+			dp->ricp_v4.dp_protocol_minor =
+			    RDS_PROTOCOL_MINOR(protocol_version);
+			dp->ricp_v4.dp_protocol_minor_mask =
+			    cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
+			dp->ricp_v4.dp_ack_seq =
+			    cpu_to_be64(rds_ib_piggyb_ack(ic));
+
+			conn_param->private_data = &dp->ricp_v4;
+			conn_param->private_data_len = sizeof(dp->ricp_v4);
+		}
 
 		/* Advertise flow control */
 		if (ic->i_flowctl) {
 			unsigned int credits;
 
-			credits = IB_GET_POST_CREDITS(atomic_read(&ic->i_credits));
-			dp->dp_credit = cpu_to_be32(credits);
-			atomic_sub(IB_SET_POST_CREDITS(credits), &ic->i_credits);
+			credits = IB_GET_POST_CREDITS
+				(atomic_read(&ic->i_credits));
+			if (isv6)
+				dp->ricp_v6.dp_credit = cpu_to_be32(credits);
+			else
+				dp->ricp_v4.dp_credit = cpu_to_be32(credits);
+			atomic_sub(IB_SET_POST_CREDITS(credits),
+				   &ic->i_credits);
 		}
-
-		conn_param->private_data = dp;
-		conn_param->private_data_len = sizeof(*dp);
 	}
 }
 
@@ -349,7 +391,7 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data)
 		break;
 	default:
 		rdsdebug("Fatal QP Event %u (%s) "
-			"- connection %pI4->%pI4, reconnecting\n",
+			"- connection %pI6c->%pI6c, reconnecting\n",
 			event->event, ib_event_msg(event->event),
 			&conn->c_laddr, &conn->c_faddr);
 		rds_conn_drop(conn);
@@ -580,11 +622,13 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	return ret;
 }
 
-static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event)
+static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event, bool isv6)
 {
-	const struct rds_ib_connect_private *dp = event->param.conn.private_data;
-	u16 common;
+	const union rds_ib_conn_priv *dp = event->param.conn.private_data;
+	u8 data_len, major, minor;
 	u32 version = 0;
+	__be16 mask;
+	u16 common;
 
 	/*
 	 * rdma_cm private data is odd - when there is any private data in the
@@ -603,51 +647,126 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event)
 		return 0;
 	}
 
+	if (isv6) {
+		data_len = sizeof(struct rds6_ib_connect_private);
+		major = dp->ricp_v6.dp_protocol_major;
+		minor = dp->ricp_v6.dp_protocol_minor;
+		mask = dp->ricp_v6.dp_protocol_minor_mask;
+	} else {
+		data_len = sizeof(struct rds_ib_connect_private);
+		major = dp->ricp_v4.dp_protocol_major;
+		minor = dp->ricp_v4.dp_protocol_minor;
+		mask = dp->ricp_v4.dp_protocol_minor_mask;
+	}
+
 	/* Even if len is crap *now* I still want to check it. -ASG */
-	if (event->param.conn.private_data_len < sizeof (*dp) ||
-	    dp->dp_protocol_major == 0)
+	if (event->param.conn.private_data_len < data_len || major == 0)
 		return RDS_PROTOCOL_3_0;
 
-	common = be16_to_cpu(dp->dp_protocol_minor_mask) & RDS_IB_SUPPORTED_PROTOCOLS;
-	if (dp->dp_protocol_major == 3 && common) {
+	common = be16_to_cpu(mask) & RDS_IB_SUPPORTED_PROTOCOLS;
+	if (major == 3 && common) {
 		version = RDS_PROTOCOL_3_0;
 		while ((common >>= 1) != 0)
 			version++;
-	} else
-		printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI4 using incompatible protocol version %u.%u\n",
-				&dp->dp_saddr,
-				dp->dp_protocol_major,
-				dp->dp_protocol_minor);
+	} else {
+		if (isv6)
+			printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI6c using incompatible protocol version %u.%u\n",
+					   &dp->ricp_v6.dp_saddr, major, minor);
+		else
+			printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI4 using incompatible protocol version %u.%u\n",
+					   &dp->ricp_v4.dp_saddr, major, minor);
+	}
 	return version;
 }
 
+/* Given an IPv6 address, find the IB net_device which hosts that address and
+ * return its index.  This is used by the rds_ib_cm_handle_connect() code to
+ * find the interface index of where an incoming request comes from when
+ * the request is using a link local address.
+ *
+ * Note one problem in this search.  It is possible that two interfaces have
+ * the same link local address.  Unfortunately, this cannot be solved unless
+ * the underlying layer gives us the interface which an incoming RDMA connect
+ * request comes from.
+ */
+static u32 __rds_find_ifindex(struct net *net, const struct in6_addr *addr)
+{
+	struct net_device *dev;
+	int idx = 0;
+
+	rcu_read_lock();
+	for_each_netdev_rcu(net, dev) {
+		if (dev->type == ARPHRD_INFINIBAND &&
+		    ipv6_chk_addr(net, addr, dev, 0)) {
+			idx = dev->ifindex;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return idx;
+}
+
 int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
-				    struct rdma_cm_event *event)
+			     struct rdma_cm_event *event, bool isv6)
 {
 	__be64 lguid = cm_id->route.path_rec->sgid.global.interface_id;
 	__be64 fguid = cm_id->route.path_rec->dgid.global.interface_id;
-	const struct rds_ib_connect_private *dp = event->param.conn.private_data;
-	struct rds_ib_connect_private dp_rep;
+	const struct rds_ib_conn_priv_cmn *dp_cmn;
 	struct rds_connection *conn = NULL;
 	struct rds_ib_connection *ic = NULL;
 	struct rdma_conn_param conn_param;
+	const union rds_ib_conn_priv *dp;
+	union rds_ib_conn_priv dp_rep;
+	struct in6_addr s_mapped_addr;
+	struct in6_addr d_mapped_addr;
+	const struct in6_addr *saddr6;
+	const struct in6_addr *daddr6;
+	int destroy = 1;
+	u32 ifindex = 0;
 	u32 version;
-	int err = 1, destroy = 1;
+	int err = 1;
 
 	/* Check whether the remote protocol version matches ours. */
-	version = rds_ib_protocol_compatible(event);
+	version = rds_ib_protocol_compatible(event, isv6);
 	if (!version)
 		goto out;
 
-	rdsdebug("saddr %pI4 daddr %pI4 RDSv%u.%u lguid 0x%llx fguid "
-		 "0x%llx\n", &dp->dp_saddr, &dp->dp_daddr,
+	dp = event->param.conn.private_data;
+	if (isv6) {
+		dp_cmn = &dp->ricp_v6.dp_cmn;
+		saddr6 = &dp->ricp_v6.dp_saddr;
+		daddr6 = &dp->ricp_v6.dp_daddr;
+		/* If the local address is link local, need to find the
+		 * interface index in order to create a proper RDS
+		 * connection.
+		 */
+		if (ipv6_addr_type(daddr6) & IPV6_ADDR_LINKLOCAL) {
+			/* Using init_net for now ..  */
+			ifindex = __rds_find_ifindex(&init_net, daddr6);
+			/* No index found...  Need to bail out. */
+			if (ifindex == 0) {
+				err = -EOPNOTSUPP;
+				goto out;
+			}
+		}
+	} else {
+		dp_cmn = &dp->ricp_v4.dp_cmn;
+		ipv6_addr_set_v4mapped(dp->ricp_v4.dp_saddr, &s_mapped_addr);
+		ipv6_addr_set_v4mapped(dp->ricp_v4.dp_daddr, &d_mapped_addr);
+		saddr6 = &s_mapped_addr;
+		daddr6 = &d_mapped_addr;
+	}
+
+	rdsdebug("saddr %pI6c daddr %pI6c RDSv%u.%u lguid 0x%llx fguid "
+		 "0x%llx\n", saddr6, daddr6,
 		 RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version),
 		 (unsigned long long)be64_to_cpu(lguid),
 		 (unsigned long long)be64_to_cpu(fguid));
 
 	/* RDS/IB is not currently netns aware, thus init_net */
-	conn = rds_conn_create(&init_net, dp->dp_daddr, dp->dp_saddr,
-			       &rds_ib_transport, GFP_KERNEL);
+	conn = rds_conn_create(&init_net, daddr6, saddr6,
+			       &rds_ib_transport, GFP_KERNEL, ifindex);
 	if (IS_ERR(conn)) {
 		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
 		conn = NULL;
@@ -678,12 +797,13 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 	ic = conn->c_transport_data;
 
 	rds_ib_set_protocol(conn, version);
-	rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+	rds_ib_set_flow_control(conn, be32_to_cpu(dp_cmn->ricpc_credit));
 
 	/* If the peer gave us the last packet it saw, process this as if
 	 * we had received a regular ACK. */
-	if (dp->dp_ack_seq)
-		rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL);
+	if (dp_cmn->ricpc_ack_seq)
+		rds_send_drop_acked(conn, be64_to_cpu(dp_cmn->ricpc_ack_seq),
+				    NULL);
 
 	BUG_ON(cm_id->context);
 	BUG_ON(ic->i_cm_id);
@@ -702,8 +822,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 	}
 
 	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp_rep, version,
-		event->param.conn.responder_resources,
-		event->param.conn.initiator_depth);
+				  event->param.conn.responder_resources,
+				  event->param.conn.initiator_depth, isv6);
 
 	/* rdma_accept() calls rdma_reject() internally if it fails */
 	if (rdma_accept(cm_id, &conn_param))
@@ -718,12 +838,12 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 }
 
 
-int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
+int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6)
 {
 	struct rds_connection *conn = cm_id->context;
 	struct rds_ib_connection *ic = conn->c_transport_data;
 	struct rdma_conn_param conn_param;
-	struct rds_ib_connect_private dp;
+	union rds_ib_conn_priv dp;
 	int ret;
 
 	/* If the peer doesn't do protocol negotiation, we must
@@ -738,7 +858,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
 	}
 
 	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION,
-		UINT_MAX, UINT_MAX);
+				  UINT_MAX, UINT_MAX, isv6);
 	ret = rdma_connect(cm_id, &conn_param);
 	if (ret)
 		rds_ib_conn_error(conn, "rdma_connect failed (%d)\n", ret);
@@ -758,13 +878,17 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
 int rds_ib_conn_path_connect(struct rds_conn_path *cp)
 {
 	struct rds_connection *conn = cp->cp_conn;
-	struct rds_ib_connection *ic = conn->c_transport_data;
-	struct sockaddr_in src, dest;
+	struct sockaddr_storage src, dest;
+	rdma_cm_event_handler handler;
+	struct rds_ib_connection *ic;
 	int ret;
 
+	ic = conn->c_transport_data;
+
 	/* XXX I wonder what affect the port space has */
 	/* delegate cm event handler to rdma_transport */
-	ic->i_cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, conn,
+	handler = rds_rdma_cm_event_handler;
+	ic->i_cm_id = rdma_create_id(&init_net, handler, conn,
 				     RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(ic->i_cm_id)) {
 		ret = PTR_ERR(ic->i_cm_id);
@@ -775,13 +899,33 @@ int rds_ib_conn_path_connect(struct rds_conn_path *cp)
 
 	rdsdebug("created cm id %p for conn %p\n", ic->i_cm_id, conn);
 
-	src.sin_family = AF_INET;
-	src.sin_addr.s_addr = (__force u32)conn->c_laddr;
-	src.sin_port = (__force u16)htons(0);
+	if (ipv6_addr_v4mapped(&conn->c_faddr)) {
+		struct sockaddr_in *sin;
+
+		sin = (struct sockaddr_in *)&src;
+		sin->sin_family = AF_INET;
+		sin->sin_addr.s_addr = conn->c_laddr.s6_addr32[3];
+		sin->sin_port = 0;
 
-	dest.sin_family = AF_INET;
-	dest.sin_addr.s_addr = (__force u32)conn->c_faddr;
-	dest.sin_port = (__force u16)htons(RDS_PORT);
+		sin = (struct sockaddr_in *)&dest;
+		sin->sin_family = AF_INET;
+		sin->sin_addr.s_addr = conn->c_faddr.s6_addr32[3];
+		sin->sin_port = htons(RDS_PORT);
+	} else {
+		struct sockaddr_in6 *sin6;
+
+		sin6 = (struct sockaddr_in6 *)&src;
+		sin6->sin6_family = AF_INET6;
+		sin6->sin6_addr = conn->c_laddr;
+		sin6->sin6_port = 0;
+		sin6->sin6_scope_id = conn->c_dev_if;
+
+		sin6 = (struct sockaddr_in6 *)&dest;
+		sin6->sin6_family = AF_INET6;
+		sin6->sin6_addr = conn->c_faddr;
+		sin6->sin6_port = htons(RDS_TCP_PORT);
+		sin6->sin6_scope_id = conn->c_dev_if;
+	}
 
 	ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src,
 				(struct sockaddr *)&dest,
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index e678699..0ec9df0 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -100,18 +100,19 @@ static void rds_ib_remove_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
 		kfree_rcu(to_free, rcu);
 }
 
-int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
+int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
+			 struct in6_addr *ipaddr)
 {
 	struct rds_ib_device *rds_ibdev_old;
 
-	rds_ibdev_old = rds_ib_get_device(ipaddr);
+	rds_ibdev_old = rds_ib_get_device(ipaddr->s6_addr32[3]);
 	if (!rds_ibdev_old)
-		return rds_ib_add_ipaddr(rds_ibdev, ipaddr);
+		return rds_ib_add_ipaddr(rds_ibdev, ipaddr->s6_addr32[3]);
 
 	if (rds_ibdev_old != rds_ibdev) {
-		rds_ib_remove_ipaddr(rds_ibdev_old, ipaddr);
+		rds_ib_remove_ipaddr(rds_ibdev_old, ipaddr->s6_addr32[3]);
 		rds_ib_dev_put(rds_ibdev_old);
-		return rds_ib_add_ipaddr(rds_ibdev, ipaddr);
+		return rds_ib_add_ipaddr(rds_ibdev, ipaddr->s6_addr32[3]);
 	}
 	rds_ib_dev_put(rds_ibdev_old);
 
@@ -544,7 +545,7 @@ void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
 	struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;
 	int ret;
 
-	rds_ibdev = rds_ib_get_device(rs->rs_bound_addr);
+	rds_ibdev = rds_ib_get_device(rs->rs_bound_addr.s6_addr32[3]);
 	if (!rds_ibdev) {
 		ret = -ENODEV;
 		goto out;
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index b4e421a..f101bed 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -266,7 +266,7 @@ static struct rds_ib_incoming *rds_ib_refill_one_inc(struct rds_ib_connection *i
 		rds_ib_stats_inc(s_ib_rx_total_incs);
 	}
 	INIT_LIST_HEAD(&ibinc->ii_frags);
-	rds_inc_init(&ibinc->ii_inc, ic->conn, ic->conn->c_faddr);
+	rds_inc_init(&ibinc->ii_inc, ic->conn, &ic->conn->c_faddr);
 
 	return ibinc;
 }
@@ -420,7 +420,7 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
 		ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
 		if (ret) {
 			rds_ib_conn_error(conn, "recv post on "
-			       "%pI4 returned %d, disconnecting and "
+			       "%pI6c returned %d, disconnecting and "
 			       "reconnecting\n", &conn->c_faddr,
 			       ret);
 			break;
@@ -850,7 +850,7 @@ static void rds_ib_process_recv(struct rds_connection *conn,
 
 	if (data_len < sizeof(struct rds_header)) {
 		rds_ib_conn_error(conn, "incoming message "
-		       "from %pI4 didn't include a "
+		       "from %pI6c didn't include a "
 		       "header, disconnecting and "
 		       "reconnecting\n",
 		       &conn->c_faddr);
@@ -863,7 +863,7 @@ static void rds_ib_process_recv(struct rds_connection *conn,
 	/* Validate the checksum. */
 	if (!rds_message_verify_checksum(ihdr)) {
 		rds_ib_conn_error(conn, "incoming message "
-		       "from %pI4 has corrupted header - "
+		       "from %pI6c has corrupted header - "
 		       "forcing a reconnect\n",
 		       &conn->c_faddr);
 		rds_stats_inc(s_recv_drop_bad_checksum);
@@ -943,10 +943,10 @@ static void rds_ib_process_recv(struct rds_connection *conn,
 		ic->i_recv_data_rem = 0;
 		ic->i_ibinc = NULL;
 
-		if (ibinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP)
+		if (ibinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP) {
 			rds_ib_cong_recv(conn, ibinc);
-		else {
-			rds_recv_incoming(conn, conn->c_faddr, conn->c_laddr,
+		} else {
+			rds_recv_incoming(conn, &conn->c_faddr, &conn->c_laddr,
 					  &ibinc->ii_inc, GFP_ATOMIC);
 			state->ack_next = be64_to_cpu(hdr->h_sequence);
 			state->ack_next_valid = 1;
@@ -990,7 +990,7 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic,
 	} else {
 		/* We expect errors as the qp is drained during shutdown */
 		if (rds_conn_up(conn) || rds_conn_connecting(conn))
-			rds_ib_conn_error(conn, "recv completion on <%pI4,%pI4> had status %u (%s), disconnecting and reconnecting\n",
+			rds_ib_conn_error(conn, "recv completion on <%pI6c,%pI6c> had status %u (%s), disconnecting and reconnecting\n",
 					  &conn->c_laddr, &conn->c_faddr,
 					  wc->status,
 					  ib_wc_status_msg(wc->status));
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 8557a1c..c4cdfe49 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -305,7 +305,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 
 	/* We expect errors as the qp is drained during shutdown */
 	if (wc->status != IB_WC_SUCCESS && rds_conn_up(conn)) {
-		rds_ib_conn_error(conn, "send completion on <%pI4,%pI4> had status %u (%s), disconnecting and reconnecting\n",
+		rds_ib_conn_error(conn, "send completion on <%pI6c,%pI6c> had status %u (%s), disconnecting and reconnecting\n",
 				  &conn->c_laddr, &conn->c_faddr, wc->status,
 				  ib_wc_status_msg(wc->status));
 	}
@@ -730,7 +730,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
 		 first, &first->s_wr, ret, failed_wr);
 	BUG_ON(failed_wr != &first->s_wr);
 	if (ret) {
-		printk(KERN_WARNING "RDS/IB: ib_post_send to %pI4 "
+		printk(KERN_WARNING "RDS/IB: ib_post_send to %pI6c "
 		       "returned %d\n", &conn->c_faddr, ret);
 		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
 		rds_ib_sub_signaled(ic, nr_sig);
@@ -827,7 +827,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct rm_atomic_op *op)
 		 send, &send->s_atomic_wr, ret, failed_wr);
 	BUG_ON(failed_wr != &send->s_atomic_wr.wr);
 	if (ret) {
-		printk(KERN_WARNING "RDS/IB: atomic ib_post_send to %pI4 "
+		printk(KERN_WARNING "RDS/IB: atomic ib_post_send to %pI6c "
 		       "returned %d\n", &conn->c_faddr, ret);
 		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
 		rds_ib_sub_signaled(ic, nr_sig);
@@ -967,7 +967,7 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rm_rdma_op *op)
 		 first, &first->s_rdma_wr.wr, ret, failed_wr);
 	BUG_ON(failed_wr != &first->s_rdma_wr.wr);
 	if (ret) {
-		printk(KERN_WARNING "RDS/IB: rdma ib_post_send to %pI4 "
+		printk(KERN_WARNING "RDS/IB: rdma ib_post_send to %pI6c "
 		       "returned %d\n", &conn->c_faddr, ret);
 		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
 		rds_ib_sub_signaled(ic, nr_sig);
diff --git a/net/rds/loop.c b/net/rds/loop.c
index dac6218..0dca895 100644
--- a/net/rds/loop.c
+++ b/net/rds/loop.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -33,6 +33,7 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/in.h>
+#include <linux/ipv6.h>
 
 #include "rds_single_path.h"
 #include "rds.h"
@@ -75,11 +76,11 @@ static int rds_loop_xmit(struct rds_connection *conn, struct rds_message *rm,
 
 	BUG_ON(hdr_off || sg || off);
 
-	rds_inc_init(&rm->m_inc, conn, conn->c_laddr);
+	rds_inc_init(&rm->m_inc, conn, &conn->c_laddr);
 	/* For the embedded inc. Matching put is in loop_inc_free() */
 	rds_message_addref(rm);
 
-	rds_recv_incoming(conn, conn->c_laddr, conn->c_faddr, &rm->m_inc,
+	rds_recv_incoming(conn, &conn->c_laddr, &conn->c_faddr, &rm->m_inc,
 			  GFP_KERNEL);
 
 	rds_send_drop_acked(conn, be64_to_cpu(rm->m_inc.i_hdr.h_sequence),
diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 634cfcb..7b39980 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2007 Oracle.  All rights reserved.
+ * Copyright (c) 2007, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -183,7 +183,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args,
 	long i;
 	int ret;
 
-	if (rs->rs_bound_addr == 0 || !rs->rs_transport) {
+	if (ipv6_addr_any(&rs->rs_bound_addr) || !rs->rs_transport) {
 		ret = -ENOTCONN; /* XXX not a great errno */
 		goto out;
 	}
@@ -574,7 +574,7 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
 
 	args = CMSG_DATA(cmsg);
 
-	if (rs->rs_bound_addr == 0) {
+	if (ipv6_addr_any(&rs->rs_bound_addr)) {
 		ret = -ENOTCONN; /* XXX not a great errno */
 		goto out_ret;
 	}
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index fc59821..aef73e7 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2009 Oracle.  All rights reserved.
+ * Copyright (c) 2009, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -39,8 +39,9 @@
 
 static struct rdma_cm_id *rds_rdma_listen_id;
 
-int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
-			      struct rdma_cm_event *event)
+static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
+					 struct rdma_cm_event *event,
+					 bool isv6)
 {
 	/* this can be null in the listening path */
 	struct rds_connection *conn = cm_id->context;
@@ -72,7 +73,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 
 	switch (event->event) {
 	case RDMA_CM_EVENT_CONNECT_REQUEST:
-		ret = trans->cm_handle_connect(cm_id, event);
+		ret = trans->cm_handle_connect(cm_id, event, isv6);
 		break;
 
 	case RDMA_CM_EVENT_ADDR_RESOLVED:
@@ -90,7 +91,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 
 			ibic = conn->c_transport_data;
 			if (ibic && ibic->i_cm_id == cm_id)
-				ret = trans->cm_initiate_connect(cm_id);
+				ret = trans->cm_initiate_connect(cm_id, isv6);
 			else
 				rds_conn_drop(conn);
 		}
@@ -116,14 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 
 	case RDMA_CM_EVENT_DISCONNECTED:
 		rdsdebug("DISCONNECT event - dropping connection "
-			"%pI4->%pI4\n", &conn->c_laddr,
+			 "%pI6c->%pI6c\n", &conn->c_laddr,
 			 &conn->c_faddr);
 		rds_conn_drop(conn);
 		break;
 
 	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
 		if (conn) {
-			pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: dropping connection %pI4->%pI4\n",
+			pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: dropping connection %pI6c->%pI6c\n",
 				&conn->c_laddr, &conn->c_faddr);
 			rds_conn_drop(conn);
 		}
@@ -146,13 +147,20 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
-static int rds_rdma_listen_init(void)
+int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
+			      struct rdma_cm_event *event)
+{
+	return rds_rdma_cm_event_handler_cmn(cm_id, event, false);
+}
+
+static int rds_rdma_listen_init_common(rdma_cm_event_handler handler,
+				       struct sockaddr *sa,
+				       struct rdma_cm_id **ret_cm_id)
 {
-	struct sockaddr_in sin;
 	struct rdma_cm_id *cm_id;
 	int ret;
 
-	cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, NULL,
+	cm_id = rdma_create_id(&init_net, handler, NULL,
 			       RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(cm_id)) {
 		ret = PTR_ERR(cm_id);
@@ -161,15 +169,11 @@ static int rds_rdma_listen_init(void)
 		return ret;
 	}
 
-	sin.sin_family = AF_INET;
-	sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY);
-	sin.sin_port = (__force u16)htons(RDS_PORT);
-
 	/*
 	 * XXX I bet this binds the cm_id to a device.  If we want to support
 	 * fail-over we'll have to take this into consideration.
 	 */
-	ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
+	ret = rdma_bind_addr(cm_id, sa);
 	if (ret) {
 		printk(KERN_ERR "RDS/RDMA: failed to setup listener, "
 		       "rdma_bind_addr() returned %d\n", ret);
@@ -185,7 +189,7 @@ static int rds_rdma_listen_init(void)
 
 	rdsdebug("cm %p listening on port %u\n", cm_id, RDS_PORT);
 
-	rds_rdma_listen_id = cm_id;
+	*ret_cm_id = cm_id;
 	cm_id = NULL;
 out:
 	if (cm_id)
@@ -193,6 +197,26 @@ static int rds_rdma_listen_init(void)
 	return ret;
 }
 
+/* Initialize the RDS RDMA listeners.  We create two listeners for
+ * compatibility reason.  The one on RDS_PORT is used for IPv4
+ * requests only.  The one on RDS_TCP_PORT is used for IPv6 requests
+ * only.  So only IPv6 enabled RDS module will communicate using this
+ * port.
+ */
+static int rds_rdma_listen_init(void)
+{
+	int ret;
+	struct sockaddr_in sin;
+
+	sin.sin_family = PF_INET;
+	sin.sin_addr.s_addr = htonl(INADDR_ANY);
+	sin.sin_port = htons(RDS_PORT);
+	ret = rds_rdma_listen_init_common(rds_rdma_cm_event_handler,
+					  (struct sockaddr *)&sin,
+					  &rds_rdma_listen_id);
+	return ret;
+}
+
 static void rds_rdma_listen_stop(void)
 {
 	if (rds_rdma_listen_id) {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index f2272fb..859808a 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -10,6 +10,7 @@
 #include <linux/rds.h>
 #include <linux/rhashtable.h>
 #include <linux/refcount.h>
+#include <linux/in6.h>
 
 #include "info.h"
 
@@ -61,7 +62,7 @@ void rdsdebug(char *fmt, ...)
 
 struct rds_cong_map {
 	struct rb_node		m_rb_node;
-	__be32			m_addr;
+	struct in6_addr		m_addr;
 	wait_queue_head_t	m_waitq;
 	struct list_head	m_conn_list;
 	unsigned long		m_page_addrs[RDS_CONG_MAP_PAGES];
@@ -136,11 +137,13 @@ struct rds_conn_path {
 /* One rds_connection per RDS address pair */
 struct rds_connection {
 	struct hlist_node	c_hash_node;
-	__be32			c_laddr;
-	__be32			c_faddr;
+	struct in6_addr		c_laddr;
+	struct in6_addr		c_faddr;
+	int			c_dev_if; /* c_laddrs's interface index */
 	unsigned int		c_loopback:1,
+				c_isv6:1,
 				c_ping_triggered:1,
-				c_pad_to_32:30;
+				c_pad_to_32:29;
 	int			c_npaths;
 	struct rds_connection	*c_passive;
 	struct rds_transport	*c_trans;
@@ -269,7 +272,7 @@ struct rds_incoming {
 	struct rds_conn_path	*i_conn_path;
 	struct rds_header	i_hdr;
 	unsigned long		i_rx_jiffies;
-	__be32			i_saddr;
+	struct in6_addr		i_saddr;
 
 	rds_rdma_cookie_t	i_rdma_cookie;
 	struct timeval		i_rx_tstamp;
@@ -386,7 +389,7 @@ struct rds_message {
 	struct list_head	m_conn_item;
 	struct rds_incoming	m_inc;
 	u64			m_ack_seq;
-	__be32			m_daddr;
+	struct in6_addr		m_daddr;
 	unsigned long		m_flags;
 
 	/* Never access m_rs without holding m_rs_lock.
@@ -519,7 +522,8 @@ struct rds_transport {
 				t_mp_capable:1;
 	unsigned int		t_type;
 
-	int (*laddr_check)(struct net *net, __be32 addr);
+	int (*laddr_check)(struct net *net, const struct in6_addr *addr,
+			   __u32 scope_id);
 	int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp);
 	void (*conn_free)(void *data);
 	int (*conn_path_connect)(struct rds_conn_path *cp);
@@ -535,8 +539,8 @@ struct rds_transport {
 	void (*inc_free)(struct rds_incoming *inc);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
-				 struct rdma_cm_event *event);
-	int (*cm_initiate_connect)(struct rdma_cm_id *cm_id);
+				 struct rdma_cm_event *event, bool isv6);
+	int (*cm_initiate_connect)(struct rdma_cm_id *cm_id, bool isv6);
 	void (*cm_connect_complete)(struct rds_connection *conn,
 				    struct rdma_cm_event *event);
 
@@ -551,6 +555,12 @@ struct rds_transport {
 	bool (*t_unloading)(struct rds_connection *conn);
 };
 
+/* Bind hash table key length.  It is the sum of the size of a struct
+ * in6_addr, a scope_id  and a port.
+ */
+#define RDS_BOUND_KEY_LEN \
+	(sizeof(struct in6_addr) + sizeof(__u32) + sizeof(__be16))
+
 struct rds_sock {
 	struct sock		rs_sk;
 
@@ -562,10 +572,14 @@ struct rds_sock {
 	 * support.
 	 */
 	struct rhash_head	rs_bound_node;
-	u64			rs_bound_key;
-	__be32			rs_bound_addr;
-	__be32			rs_conn_addr;
-	__be16			rs_bound_port;
+	u8			rs_bound_key[RDS_BOUND_KEY_LEN];
+	struct sockaddr_in6	rs_bound_sin6;
+#define rs_bound_addr		rs_bound_sin6.sin6_addr
+#define rs_bound_addr_v4	rs_bound_sin6.sin6_addr.s6_addr32[3]
+#define rs_bound_port		rs_bound_sin6.sin6_port
+#define rs_bound_scope_id	rs_bound_sin6.sin6_scope_id
+	struct in6_addr		rs_conn_addr;
+#define rs_conn_addr_v4		rs_conn_addr.s6_addr32[3]
 	__be16			rs_conn_port;
 	struct rds_transport    *rs_transport;
 
@@ -701,7 +715,8 @@ static inline void __rds_wake_sk_sleep(struct sock *sk)
 /* bind.c */
 int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
 void rds_remove_bound(struct rds_sock *rs);
-struct rds_sock *rds_find_bound(__be32 addr, __be16 port);
+struct rds_sock *rds_find_bound(const struct in6_addr *addr, __be16 port,
+				__u32 scope_id);
 int rds_bind_lock_init(void);
 void rds_bind_lock_destroy(void);
 
@@ -725,11 +740,15 @@ static inline void __rds_wake_sk_sleep(struct sock *sk)
 int rds_conn_init(void);
 void rds_conn_exit(void);
 struct rds_connection *rds_conn_create(struct net *net,
-				       __be32 laddr, __be32 faddr,
-				       struct rds_transport *trans, gfp_t gfp);
+				       const struct in6_addr *laddr,
+				       const struct in6_addr *faddr,
+				       struct rds_transport *trans, gfp_t gfp,
+				       int dev_if);
 struct rds_connection *rds_conn_create_outgoing(struct net *net,
-						__be32 laddr, __be32 faddr,
-			       struct rds_transport *trans, gfp_t gfp);
+						const struct in6_addr *laddr,
+						const struct in6_addr *faddr,
+						struct rds_transport *trans,
+						gfp_t gfp, int dev_if);
 void rds_conn_shutdown(struct rds_conn_path *cpath);
 void rds_conn_destroy(struct rds_connection *conn);
 void rds_conn_drop(struct rds_connection *conn);
@@ -840,11 +859,12 @@ int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes,
 
 /* recv.c */
 void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
-		  __be32 saddr);
+		  struct in6_addr *saddr);
 void rds_inc_path_init(struct rds_incoming *inc, struct rds_conn_path *conn,
-		       __be32 saddr);
+		       struct in6_addr *saddr);
 void rds_inc_put(struct rds_incoming *inc);
-void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
+void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
+		       struct in6_addr *daddr,
 		       struct rds_incoming *inc, gfp_t gfp);
 int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		int msg_flags);
@@ -859,7 +879,7 @@ void rds_inc_info_copy(struct rds_incoming *inc,
 void rds_send_path_reset(struct rds_conn_path *conn);
 int rds_send_xmit(struct rds_conn_path *cp);
 struct sockaddr_in;
-void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest);
+void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in6 *dest);
 typedef int (*is_acked_func)(struct rds_message *rm, uint64_t ack);
 void rds_send_drop_acked(struct rds_connection *conn, u64 ack,
 			 is_acked_func is_acked);
@@ -946,11 +966,14 @@ void rds_stats_info_copy(struct rds_info_iterator *iter,
 void rds_recv_worker(struct work_struct *);
 void rds_connect_path_complete(struct rds_conn_path *conn, int curr);
 void rds_connect_complete(struct rds_connection *conn);
+int rds_addr_cmp(const struct in6_addr *a1, const struct in6_addr *a2);
 
 /* transport.c */
 void rds_trans_register(struct rds_transport *trans);
 void rds_trans_unregister(struct rds_transport *trans);
-struct rds_transport *rds_trans_get_preferred(struct net *net, __be32 addr);
+struct rds_transport *rds_trans_get_preferred(struct net *net,
+					      const struct in6_addr *addr,
+					      __u32 scope_id);
 void rds_trans_put(struct rds_transport *trans);
 unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter,
 				       unsigned int avail);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 192ac6f..4217961 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -41,14 +41,14 @@
 #include "rds.h"
 
 void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
-		  __be32 saddr)
+		 struct in6_addr *saddr)
 {
 	int i;
 
 	refcount_set(&inc->i_refcount, 1);
 	INIT_LIST_HEAD(&inc->i_item);
 	inc->i_conn = conn;
-	inc->i_saddr = saddr;
+	inc->i_saddr = *saddr;
 	inc->i_rdma_cookie = 0;
 	inc->i_rx_tstamp.tv_sec = 0;
 	inc->i_rx_tstamp.tv_usec = 0;
@@ -59,13 +59,13 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
 void rds_inc_path_init(struct rds_incoming *inc, struct rds_conn_path *cp,
-		       __be32 saddr)
+		       struct in6_addr  *saddr)
 {
 	refcount_set(&inc->i_refcount, 1);
 	INIT_LIST_HEAD(&inc->i_item);
 	inc->i_conn = cp->cp_conn;
 	inc->i_conn_path = cp;
-	inc->i_saddr = saddr;
+	inc->i_saddr = *saddr;
 	inc->i_rdma_cookie = 0;
 	inc->i_rx_tstamp.tv_sec = 0;
 	inc->i_rx_tstamp.tv_usec = 0;
@@ -110,7 +110,7 @@ static void rds_recv_rcvbuf_delta(struct rds_sock *rs, struct sock *sk,
 
 	now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs);
 
-	rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d "
+	rdsdebug("rs %p (%pI6c:%u) recv bytes %d buf %d "
 	  "now_cong %d delta %d\n",
 	  rs, &rs->rs_bound_addr,
 	  ntohs(rs->rs_bound_port), rs->rs_rcv_bytes,
@@ -260,7 +260,7 @@ static void rds_start_mprds(struct rds_connection *conn)
 	struct rds_conn_path *cp;
 
 	if (conn->c_npaths > 1 &&
-	    IS_CANONICAL(conn->c_laddr, conn->c_faddr)) {
+	    rds_addr_cmp(&conn->c_laddr, &conn->c_faddr) < 0) {
 		for (i = 0; i < conn->c_npaths; i++) {
 			cp = &conn->c_path[i];
 			rds_conn_path_connect_if_down(cp);
@@ -284,7 +284,8 @@ static void rds_start_mprds(struct rds_connection *conn)
  * conn.  This lets loopback, who only has one conn for both directions,
  * tell us which roles the addrs in the conn are playing for this message.
  */
-void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
+void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
+		       struct in6_addr *daddr,
 		       struct rds_incoming *inc, gfp_t gfp)
 {
 	struct rds_sock *rs = NULL;
@@ -339,7 +340,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
 
 	if (rds_sysctl_ping_enable && inc->i_hdr.h_dport == 0) {
 		if (inc->i_hdr.h_sport == 0) {
-			rdsdebug("ignore ping with 0 sport from 0x%x\n", saddr);
+			rdsdebug("ignore ping with 0 sport from %pI6c\n",
+				 saddr);
 			goto out;
 		}
 		rds_stats_inc(s_recv_ping);
@@ -362,7 +364,7 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
 		goto out;
 	}
 
-	rs = rds_find_bound(daddr, inc->i_hdr.h_dport);
+	rs = rds_find_bound(daddr, inc->i_hdr.h_dport, conn->c_dev_if);
 	if (!rs) {
 		rds_stats_inc(s_recv_drop_no_sock);
 		goto out;
@@ -625,6 +627,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	struct rds_sock *rs = rds_sk_to_rs(sk);
 	long timeo;
 	int ret = 0, nonblock = msg_flags & MSG_DONTWAIT;
+	DECLARE_SOCKADDR(struct sockaddr_in6 *, sin6, msg->msg_name);
 	DECLARE_SOCKADDR(struct sockaddr_in *, sin, msg->msg_name);
 	struct rds_incoming *inc = NULL;
 
@@ -673,7 +676,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 			break;
 		}
 
-		rdsdebug("copying inc %p from %pI4:%u to user\n", inc,
+		rdsdebug("copying inc %p from %pI6c:%u to user\n", inc,
 			 &inc->i_conn->c_faddr,
 			 ntohs(inc->i_hdr.h_sport));
 		ret = inc->i_conn->c_trans->inc_copy_to_user(inc, &msg->msg_iter);
@@ -707,12 +710,26 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 
 		rds_stats_inc(s_recv_delivered);
 
-		if (sin) {
-			sin->sin_family = AF_INET;
-			sin->sin_port = inc->i_hdr.h_sport;
-			sin->sin_addr.s_addr = inc->i_saddr;
-			memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
-			msg->msg_namelen = sizeof(*sin);
+		if (msg->msg_name) {
+			if (ipv6_addr_v4mapped(&inc->i_saddr)) {
+				sin = (struct sockaddr_in *)msg->msg_name;
+
+				sin->sin_family = AF_INET;
+				sin->sin_port = inc->i_hdr.h_sport;
+				sin->sin_addr.s_addr =
+				    inc->i_saddr.s6_addr32[3];
+				memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+				msg->msg_namelen = sizeof(*sin);
+			} else {
+				sin6 = (struct sockaddr_in6 *)msg->msg_name;
+
+				sin6->sin6_family = AF_INET6;
+				sin6->sin6_port = inc->i_hdr.h_sport;
+				sin6->sin6_addr = inc->i_saddr;
+				sin6->sin6_flowinfo = 0;
+				sin6->sin6_scope_id = rs->rs_bound_scope_id;
+				msg->msg_namelen = sizeof(*sin6);
+			}
 		}
 		break;
 	}
diff --git a/net/rds/send.c b/net/rds/send.c
index 94c7f74..6ed2e92 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -709,7 +709,7 @@ void rds_send_drop_acked(struct rds_connection *conn, u64 ack,
 }
 EXPORT_SYMBOL_GPL(rds_send_drop_acked);
 
-void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
+void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in6 *dest)
 {
 	struct rds_message *rm, *tmp;
 	struct rds_connection *conn;
@@ -721,8 +721,9 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 	spin_lock_irqsave(&rs->rs_lock, flags);
 
 	list_for_each_entry_safe(rm, tmp, &rs->rs_send_queue, m_sock_item) {
-		if (dest && (dest->sin_addr.s_addr != rm->m_daddr ||
-			     dest->sin_port != rm->m_inc.i_hdr.h_dport))
+		if (dest &&
+		    (!ipv6_addr_equal(&dest->sin6_addr, &rm->m_daddr) ||
+		     dest->sin6_port != rm->m_inc.i_hdr.h_dport))
 			continue;
 
 		list_move(&rm->m_sock_item, &list);
@@ -1059,8 +1060,8 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 {
 	struct sock *sk = sock->sk;
 	struct rds_sock *rs = rds_sk_to_rs(sk);
+	DECLARE_SOCKADDR(struct sockaddr_in6 *, sin6, msg->msg_name);
 	DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
-	__be32 daddr;
 	__be16 dport;
 	struct rds_message *rm = NULL;
 	struct rds_connection *conn;
@@ -1069,10 +1070,13 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 	int nonblock = msg->msg_flags & MSG_DONTWAIT;
 	long timeo = sock_sndtimeo(sk, nonblock);
 	struct rds_conn_path *cpath;
+	struct in6_addr daddr;
+	__u32 scope_id = 0;
 	size_t total_payload_len = payload_len, rdma_payload_len = 0;
 	bool zcopy = ((msg->msg_flags & MSG_ZEROCOPY) &&
 		      sock_flag(rds_rs_to_sk(rs), SOCK_ZEROCOPY));
 	int num_sgs = ceil(payload_len, PAGE_SIZE);
+	int namelen;
 
 	/* Mirror Linux UDP mirror of BSD error message compatibility */
 	/* XXX: Perhaps MSG_MORE someday */
@@ -1081,27 +1085,59 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 		goto out;
 	}
 
-	if (msg->msg_namelen) {
-		/* XXX fail non-unicast destination IPs? */
-		if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != AF_INET) {
+	namelen = msg->msg_namelen;
+	if (namelen != 0) {
+		if (namelen < sizeof(*usin)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		switch (namelen) {
+		case sizeof(*usin):
+			if (usin->sin_family != AF_INET ||
+			    usin->sin_addr.s_addr == htonl(INADDR_ANY) ||
+			    usin->sin_addr.s_addr == htonl(INADDR_BROADCAST) ||
+			    IN_MULTICAST(ntohl(usin->sin_addr.s_addr))) {
+				ret = -EINVAL;
+				goto out;
+			}
+			ipv6_addr_set_v4mapped(usin->sin_addr.s_addr, &daddr);
+			dport = usin->sin_port;
+			break;
+
+		case sizeof(*sin6): {
+			ret = -EPROTONOSUPPORT;
+			goto out;
+		}
+
+		default:
 			ret = -EINVAL;
 			goto out;
 		}
-		daddr = usin->sin_addr.s_addr;
-		dport = usin->sin_port;
 	} else {
 		/* We only care about consistency with ->connect() */
 		lock_sock(sk);
 		daddr = rs->rs_conn_addr;
 		dport = rs->rs_conn_port;
+		scope_id = rs->rs_bound_scope_id;
 		release_sock(sk);
 	}
 
 	lock_sock(sk);
-	if (daddr == 0 || rs->rs_bound_addr == 0) {
+	if (ipv6_addr_any(&rs->rs_bound_addr) || ipv6_addr_any(&daddr)) {
 		release_sock(sk);
-		ret = -ENOTCONN; /* XXX not a great errno */
+		ret = -ENOTCONN;
 		goto out;
+	} else if (namelen != 0) {
+		/* Cannot send to an IPv4 address using an IPv6 source
+		 * address and cannot send to an IPv6 address using an
+		 * IPv4 source address.
+		 */
+		if (ipv6_addr_v4mapped(&daddr) ^
+		    ipv6_addr_v4mapped(&rs->rs_bound_addr)) {
+			release_sock(sk);
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
 	}
 	release_sock(sk);
 
@@ -1155,13 +1191,14 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* rds_conn_create has a spinlock that runs with IRQ off.
 	 * Caching the conn in the socket helps a lot. */
-	if (rs->rs_conn && rs->rs_conn->c_faddr == daddr)
+	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr))
 		conn = rs->rs_conn;
 	else {
 		conn = rds_conn_create_outgoing(sock_net(sock->sk),
-						rs->rs_bound_addr, daddr,
-					rs->rs_transport,
-					sock->sk->sk_allocation);
+						&rs->rs_bound_addr, &daddr,
+						rs->rs_transport,
+						sock->sk->sk_allocation,
+						scope_id);
 		if (IS_ERR(conn)) {
 			ret = PTR_ERR(conn);
 			goto out;
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 351a284..dadb337 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -37,6 +37,8 @@
 #include <net/tcp.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
+#include <net/tcp.h>
+#include <net/addrconf.h>
 
 #include "rds.h"
 #include "tcp.h"
@@ -262,9 +264,33 @@ static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
 	spin_unlock_irqrestore(&rds_tcp_tc_list_lock, flags);
 }
 
-static int rds_tcp_laddr_check(struct net *net, __be32 addr)
+static int rds_tcp_laddr_check(struct net *net, const struct in6_addr *addr,
+			       __u32 scope_id)
 {
-	if (inet_addr_type(net, addr) == RTN_LOCAL)
+	struct net_device *dev = NULL;
+	int ret;
+
+	if (ipv6_addr_v4mapped(addr)) {
+		if (inet_addr_type(net, addr->s6_addr32[3]) == RTN_LOCAL)
+			return 0;
+		return -EADDRNOTAVAIL;
+	}
+
+	/* If the scope_id is specified, check only those addresses
+	 * hosted on the specified interface.
+	 */
+	if (scope_id != 0) {
+		rcu_read_lock();
+		dev = dev_get_by_index_rcu(net, scope_id);
+		/* scope_id is not valid... */
+		if (!dev) {
+			rcu_read_unlock();
+			return -EADDRNOTAVAIL;
+		}
+		rcu_read_unlock();
+	}
+	ret = ipv6_chk_addr(net, addr, dev, 0);
+	if (ret)
 		return 0;
 	return -EADDRNOTAVAIL;
 }
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index d999e70..231ae92 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -66,7 +66,8 @@ void rds_tcp_state_change(struct sock *sk)
 		 * RDS connection as RDS_CONN_UP until the reconnect,
 		 * to avoid RDS datagram loss.
 		 */
-		if (!IS_CANONICAL(cp->cp_conn->c_laddr, cp->cp_conn->c_faddr) &&
+		if (rds_addr_cmp(&cp->cp_conn->c_laddr,
+				 &cp->cp_conn->c_faddr) >= 0 &&
 		    rds_conn_path_transition(cp, RDS_CONN_CONNECTING,
 					     RDS_CONN_ERROR)) {
 			rds_conn_path_drop(cp, false);
@@ -88,7 +89,9 @@ void rds_tcp_state_change(struct sock *sk)
 int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
 {
 	struct socket *sock = NULL;
-	struct sockaddr_in src, dest;
+	struct sockaddr_in sin;
+	struct sockaddr *addr;
+	int addrlen;
 	int ret;
 	struct rds_connection *conn = cp->cp_conn;
 	struct rds_tcp_connection *tc = cp->cp_transport_data;
@@ -112,30 +115,33 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
 
 	rds_tcp_tune(sock);
 
-	src.sin_family = AF_INET;
-	src.sin_addr.s_addr = (__force u32)conn->c_laddr;
-	src.sin_port = (__force u16)htons(0);
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = conn->c_laddr.s6_addr32[3];
+	sin.sin_port = 0;
+	addr = (struct sockaddr *)&sin;
+	addrlen = sizeof(sin);
 
-	ret = sock->ops->bind(sock, (struct sockaddr *)&src, sizeof(src));
+	ret = sock->ops->bind(sock, addr, addrlen);
 	if (ret) {
-		rdsdebug("bind failed with %d at address %pI4\n",
+		rdsdebug("bind failed with %d at address %pI6c\n",
 			 ret, &conn->c_laddr);
 		goto out;
 	}
 
-	dest.sin_family = AF_INET;
-	dest.sin_addr.s_addr = (__force u32)conn->c_faddr;
-	dest.sin_port = (__force u16)htons(RDS_TCP_PORT);
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = conn->c_faddr.s6_addr32[3];
+	sin.sin_port = htons(RDS_TCP_PORT);
+	addr = (struct sockaddr *)&sin;
+	addrlen = sizeof(sin);
 
 	/*
 	 * once we call connect() we can start getting callbacks and they
 	 * own the socket
 	 */
 	rds_tcp_set_callbacks(sock, cp);
-	ret = sock->ops->connect(sock, (struct sockaddr *)&dest, sizeof(dest),
-				 O_NONBLOCK);
+	ret = sock->ops->connect(sock, addr, addrlen, O_NONBLOCK);
 
-	rdsdebug("connect to address %pI4 returned %d\n", &conn->c_faddr, ret);
+	rdsdebug("connect to address %pI6c returned %d\n", &conn->c_faddr, ret);
 	if (ret == -EINPROGRESS)
 		ret = 0;
 	if (ret == 0) {
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 2257118..4fdf5b3 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2018 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -83,13 +83,12 @@ int rds_tcp_keepalive(struct socket *sock)
 struct rds_tcp_connection *rds_tcp_accept_one_path(struct rds_connection *conn)
 {
 	int i;
-	bool peer_is_smaller = IS_CANONICAL(conn->c_faddr, conn->c_laddr);
 	int npaths = max_t(int, 1, conn->c_npaths);
 
 	/* for mprds, all paths MUST be initiated by the peer
 	 * with the smaller address.
 	 */
-	if (!peer_is_smaller) {
+	if (rds_addr_cmp(&conn->c_faddr, &conn->c_laddr) >= 0) {
 		/* Make sure we initiate at least one path if this
 		 * has not already been done; rds_start_mprds() will
 		 * take care of additional paths, if necessary.
@@ -164,13 +163,16 @@ int rds_tcp_accept_one(struct socket *sock)
 
 	inet = inet_sk(new_sock->sk);
 
-	rdsdebug("accepted tcp %pI4:%u -> %pI4:%u\n",
-		 &inet->inet_saddr, ntohs(inet->inet_sport),
-		 &inet->inet_daddr, ntohs(inet->inet_dport));
+	rdsdebug("accepted tcp %pI6c:%u -> %pI6c:%u\n",
+		 &new_sock->sk->sk_v6_rcv_saddr, ntohs(inet->inet_sport),
+		 &new_sock->sk->sk_v6_daddr, ntohs(inet->inet_dport));
 
 	conn = rds_conn_create(sock_net(sock->sk),
-			       inet->inet_saddr, inet->inet_daddr,
-			       &rds_tcp_transport, GFP_KERNEL);
+			       &new_sock->sk->sk_v6_rcv_saddr,
+			       &new_sock->sk->sk_v6_daddr,
+			       &rds_tcp_transport, GFP_KERNEL,
+			       new_sock->sk->sk_bound_dev_if);
+
 	if (IS_ERR(conn)) {
 		ret = PTR_ERR(conn);
 		goto out;
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index b9fbd2e..42c5ff1 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -179,7 +179,7 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, struct sk_buff *skb,
 			tc->t_tinc = tinc;
 			rdsdebug("alloced tinc %p\n", tinc);
 			rds_inc_path_init(&tinc->ti_inc, cp,
-					  cp->cp_conn->c_faddr);
+					  &cp->cp_conn->c_faddr);
 			tinc->ti_inc.i_rx_lat_trace[RDS_MSG_RX_HDR] =
 					local_clock();
 
@@ -239,8 +239,9 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, struct sk_buff *skb,
 			if (tinc->ti_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP)
 				rds_tcp_cong_recv(conn, tinc);
 			else
-				rds_recv_incoming(conn, conn->c_faddr,
-						  conn->c_laddr, &tinc->ti_inc,
+				rds_recv_incoming(conn, &conn->c_faddr,
+						  &conn->c_laddr,
+						  &tinc->ti_inc,
 						  arg->gfp);
 
 			tc->t_tinc_hdr_rem = sizeof(struct rds_header);
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 7df869d..78a2554 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -153,7 +153,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 			 * an incoming RST.
 			 */
 			if (rds_conn_path_up(cp)) {
-				pr_warn("RDS/tcp: send to %pI4 on cp [%d]"
+				pr_warn("RDS/tcp: send to %pI6c on cp [%d]"
 					"returned %d, "
 					"disconnecting and reconnecting\n",
 					&conn->c_faddr, cp->cp_index, ret);
diff --git a/net/rds/threads.c b/net/rds/threads.c
index c52861d..e64f9e4 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -82,8 +82,8 @@ void rds_connect_path_complete(struct rds_conn_path *cp, int curr)
 		return;
 	}
 
-	rdsdebug("conn %p for %pI4 to %pI4 complete\n",
-	  cp->cp_conn, &cp->cp_conn->c_laddr, &cp->cp_conn->c_faddr);
+	rdsdebug("conn %p for %pI6c to %pI6c complete\n",
+		 cp->cp_conn, &cp->cp_conn->c_laddr, &cp->cp_conn->c_faddr);
 
 	cp->cp_reconnect_jiffies = 0;
 	set_bit(0, &cp->cp_conn->c_map_queued);
@@ -125,13 +125,13 @@ void rds_queue_reconnect(struct rds_conn_path *cp)
 	unsigned long rand;
 	struct rds_connection *conn = cp->cp_conn;
 
-	rdsdebug("conn %p for %pI4 to %pI4 reconnect jiffies %lu\n",
-	  conn, &conn->c_laddr, &conn->c_faddr,
-	  cp->cp_reconnect_jiffies);
+	rdsdebug("conn %p for %pI6c to %pI6c reconnect jiffies %lu\n",
+		 conn, &conn->c_laddr, &conn->c_faddr,
+		 cp->cp_reconnect_jiffies);
 
 	/* let peer with smaller addr initiate reconnect, to avoid duels */
 	if (conn->c_trans->t_type == RDS_TRANS_TCP &&
-	    !IS_CANONICAL(conn->c_laddr, conn->c_faddr))
+	    rds_addr_cmp(&conn->c_laddr, &conn->c_faddr) >= 0)
 		return;
 
 	set_bit(RDS_RECONNECT_PENDING, &cp->cp_flags);
@@ -145,7 +145,7 @@ void rds_queue_reconnect(struct rds_conn_path *cp)
 	}
 
 	get_random_bytes(&rand, sizeof(rand));
-	rdsdebug("%lu delay %lu ceil conn %p for %pI4 -> %pI4\n",
+	rdsdebug("%lu delay %lu ceil conn %p for %pI6c -> %pI6c\n",
 		 rand % cp->cp_reconnect_jiffies, cp->cp_reconnect_jiffies,
 		 conn, &conn->c_laddr, &conn->c_faddr);
 	rcu_read_lock();
@@ -167,14 +167,14 @@ void rds_connect_worker(struct work_struct *work)
 	int ret;
 
 	if (cp->cp_index > 0 &&
-	    !IS_CANONICAL(cp->cp_conn->c_laddr, cp->cp_conn->c_faddr))
+	    rds_addr_cmp(&cp->cp_conn->c_laddr, &cp->cp_conn->c_faddr) >= 0)
 		return;
 	clear_bit(RDS_RECONNECT_PENDING, &cp->cp_flags);
 	ret = rds_conn_path_transition(cp, RDS_CONN_DOWN, RDS_CONN_CONNECTING);
 	if (ret) {
 		ret = conn->c_trans->conn_path_connect(cp);
-		rdsdebug("conn %p for %pI4 to %pI4 dispatched, ret %d\n",
-			conn, &conn->c_laddr, &conn->c_faddr, ret);
+		rdsdebug("conn %p for %pI6c to %pI6c dispatched, ret %d\n",
+			 conn, &conn->c_laddr, &conn->c_faddr, ret);
 
 		if (ret) {
 			if (rds_conn_path_transition(cp,
@@ -259,3 +259,50 @@ int rds_threads_init(void)
 
 	return 0;
 }
+
+/* Compare two IPv6 addresses.  Return 0 if the two addresses are equal.
+ * Return 1 if the first is greater.  Return -1 if the second is greater.
+ */
+int rds_addr_cmp(const struct in6_addr *addr1,
+		 const struct in6_addr *addr2)
+{
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
+	const __be64 *a1, *a2;
+	u64 x, y;
+
+	a1 = (__be64 *)addr1;
+	a2 = (__be64 *)addr2;
+
+	if (*a1 != *a2) {
+		if (be64_to_cpu(*a1) < be64_to_cpu(*a2))
+			return -1;
+		else
+			return 1;
+	} else {
+		x = be64_to_cpu(*++a1);
+		y = be64_to_cpu(*++a2);
+		if (x < y)
+			return -1;
+		else if (x > y)
+			return 1;
+		else
+			return 0;
+	}
+#else
+	u32 a, b;
+	int i;
+
+	for (i = 0; i < 4; i++) {
+		if (addr1->s6_addr32[i] != addr2->s6_addr32[i]) {
+			a = ntohl(addr1->s6_addr32[i]);
+			b = ntohl(addr2->s6_addr32[i]);
+			if (a < b)
+				return -1;
+			else if (a > b)
+				return 1;
+		}
+	}
+	return 0;
+#endif
+}
+EXPORT_SYMBOL_GPL(rds_addr_cmp);
diff --git a/net/rds/transport.c b/net/rds/transport.c
index 0b188dd..c9788db 100644
--- a/net/rds/transport.c
+++ b/net/rds/transport.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 Oracle.  All rights reserved.
+ * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -33,6 +33,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/in.h>
+#include <linux/ipv6.h>
 
 #include "rds.h"
 #include "loop.h"
@@ -75,20 +76,26 @@ void rds_trans_put(struct rds_transport *trans)
 		module_put(trans->t_owner);
 }
 
-struct rds_transport *rds_trans_get_preferred(struct net *net, __be32 addr)
+struct rds_transport *rds_trans_get_preferred(struct net *net,
+					      const struct in6_addr *addr,
+					      __u32 scope_id)
 {
 	struct rds_transport *ret = NULL;
 	struct rds_transport *trans;
 	unsigned int i;
 
-	if (IN_LOOPBACK(ntohl(addr)))
+	if (ipv6_addr_v4mapped(addr)) {
+		if (*(u_int8_t *)&addr->s6_addr32[3] == IN_LOOPBACKNET)
+			return &rds_loop_transport;
+	} else if (ipv6_addr_loopback(addr)) {
 		return &rds_loop_transport;
+	}
 
 	down_read(&rds_trans_sem);
 	for (i = 0; i < RDS_TRANS_COUNT; i++) {
 		trans = transports[i];
 
-		if (trans && (trans->laddr_check(net, addr) == 0) &&
+		if (trans && (trans->laddr_check(net, addr, scope_id) == 0) &&
 		    (!trans->t_owner || try_module_get(trans->t_owner))) {
 			ret = trans;
 			break;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 net-next 2/3] rds: Enable RDS IPv6 support
From: Ka-Cheong Poon @ 2018-06-27 10:23 UTC (permalink / raw)
  To: netdev; +Cc: santosh.shilimkar, davem, rds-devel
In-Reply-To: <cover.1530086216.git.ka-cheong.poon@oracle.com>

This patch enables RDS to use IPv6 addresses. For RDS/TCP, the
listener is now an IPv6 endpoint which accepts both IPv4 and IPv6
connection requests.  RDS/RDMA/IB uses a private data (struct
rds_ib_connect_private) exchange between endpoints at RDS connection
establishment time to support RDMA. This private data exchange uses a
32 bit integer to represent an IP address. This needs to be changed in
order to support IPv6. A new private data struct
rds6_ib_connect_private is introduced to handle this. To ensure
backward compatibility, an IPv6 capable RDS stack uses another RDMA
listener port (RDS_CM_PORT) to accept IPv6 connection. And it
continues to use the original RDS_PORT for IPv4 RDS connections. When
it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to
send the connection set up request.

v2: Fixed bound and peer address scope mismatched issue.
    Added back rds_connect() IPv6 changes.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
---
 net/rds/af_rds.c         | 28 +++++++++++++++++++++++-
 net/rds/bind.c           | 34 +++++++++++++++++++++++++-----
 net/rds/connection.c     | 43 ++++++++++++++++++++++++-------------
 net/rds/ib.c             | 55 +++++++++++++++++++++++++++++++++++++++++-------
 net/rds/ib_cm.c          | 13 ++++++------
 net/rds/rdma_transport.c | 32 ++++++++++++++++++++++++++--
 net/rds/rdma_transport.h |  2 ++
 net/rds/rds.h            | 12 ++++++-----
 net/rds/send.c           | 32 ++++++++++++++++++++++++++--
 net/rds/tcp.c            | 54 +++++++++++++++++++++++++++++------------------
 net/rds/tcp.h            |  4 +---
 net/rds/tcp_connect.c    | 54 ++++++++++++++++++++++++++++++++++++-----------
 net/rds/tcp_listen.c     | 40 +++++++++++++++++++++++++++--------
 13 files changed, 315 insertions(+), 88 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index fc1a5c6..8ce2d92 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -484,7 +484,9 @@ static int rds_connect(struct socket *sock, struct sockaddr *uaddr,
 {
 	struct sock *sk = sock->sk;
 	struct sockaddr_in *sin;
+	struct sockaddr_in6 *sin6;
 	struct rds_sock *rs = rds_sk_to_rs(sk);
+	int addr_type;
 	int ret = 0;
 
 	lock_sock(sk);
@@ -510,7 +512,31 @@ static int rds_connect(struct socket *sock, struct sockaddr *uaddr,
 		break;
 
 	case sizeof(struct sockaddr_in6):
-		ret = -EPROTONOSUPPORT;
+		sin6 = (struct sockaddr_in6 *)uaddr;
+		if (sin6->sin6_family != AF_INET6) {
+			ret = -EAFNOSUPPORT;
+			break;
+		}
+		addr_type = ipv6_addr_type(&sin6->sin6_addr);
+		if (!(addr_type & IPV6_ADDR_UNICAST)) {
+			ret = -EPROTOTYPE;
+			break;
+		}
+		if (addr_type & IPV6_ADDR_LINKLOCAL) {
+			if (sin6->sin6_scope_id == 0 ||
+			    (!ipv6_addr_any(&rs->rs_bound_addr) &&
+			     sin6->sin6_scope_id != rs->rs_bound_scope_id)) {
+				ret = -EINVAL;
+				break;
+			}
+			/* Remember the connected address scope ID.  It will
+			 * be checked against the binding local address when
+			 * the socket is bound.
+			 */
+			rs->rs_bound_scope_id = sin6->sin6_scope_id;
+		}
+		rs->rs_conn_addr = sin6->sin6_addr;
+		rs->rs_conn_port = sin6->sin6_port;
 		break;
 
 	default:
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 3822886..6e6e4ea 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -127,9 +127,10 @@ static int rds_add_bound(struct rds_sock *rs, const struct in6_addr *addr,
 		if (!rhashtable_insert_fast(&bind_hash_table,
 					    &rs->rs_bound_node, ht_parms)) {
 			*port = rs->rs_bound_port;
+			rs->rs_bound_scope_id = scope_id;
 			ret = 0;
-			rdsdebug("rs %p binding to %pI4:%d\n",
-			  rs, &addr, (int)ntohs(*port));
+			rdsdebug("rs %p binding to %pI6c:%d\n",
+				 rs, addr, (int)ntohs(*port));
 			break;
 		} else {
 			rs->rs_bound_addr = in6addr_any;
@@ -164,11 +165,12 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	struct in6_addr v6addr, *binding_addr;
 	struct rds_transport *trans;
 	__u32 scope_id = 0;
+	int addr_type;
 	int ret = 0;
 	__be16 port;
 
-	/* We only allow an RDS socket to be bound to and IPv4 address. IPv6
-	 * address support will be added later.
+	/* We allow an RDS socket to be bound to either IPv4 or IPv6
+	 * address.
 	 */
 	if (addr_len == sizeof(struct sockaddr_in)) {
 		struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
@@ -180,7 +182,21 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		binding_addr = &v6addr;
 		port = sin->sin_port;
 	} else if (addr_len == sizeof(struct sockaddr_in6)) {
-		return -EPROTONOSUPPORT;
+		struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)uaddr;
+
+		addr_type = ipv6_addr_type(&sin6->sin6_addr);
+		if (sin6->sin6_family != AF_INET6 ||
+		    !(addr_type & IPV6_ADDR_UNICAST)) {
+			return -EINVAL;
+		}
+		/* The scope ID must be specified for link local address. */
+		if (addr_type & IPV6_ADDR_LINKLOCAL) {
+			if (sin6->sin6_scope_id == 0)
+				return -EINVAL;
+			scope_id = sin6->sin6_scope_id;
+		}
+		binding_addr = &sin6->sin6_addr;
+		port = sin6->sin6_port;
 	} else {
 		return -EINVAL;
 	}
@@ -191,6 +207,14 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		ret = -EINVAL;
 		goto out;
 	}
+	/* Socket is connected.  The binding address should have the same
+	 * scope ID as the connected address.
+	 */
+	if (!ipv6_addr_any(&rs->rs_conn_addr) &&
+	    scope_id != rs->rs_bound_scope_id) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	ret = rds_add_bound(rs, binding_addr, &port, scope_id);
 	if (ret)
diff --git a/net/rds/connection.c b/net/rds/connection.c
index ca72563..8c5d093 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -486,10 +486,17 @@ void rds_conn_destroy(struct rds_connection *conn)
 }
 EXPORT_SYMBOL_GPL(rds_conn_destroy);
 
-static void rds_conn_message_info(struct socket *sock, unsigned int len,
-				  struct rds_info_iterator *iter,
-				  struct rds_info_lengths *lens,
-				  int want_send)
+static void __rds_inc_msg_cp(struct rds_incoming *inc,
+			     struct rds_info_iterator *iter,
+			     void *saddr, void *daddr, int flip)
+{
+	rds_inc_info_copy(inc, iter, *(__be32 *)saddr, *(__be32 *)daddr, flip);
+}
+
+static void rds_conn_message_info_cmn(struct socket *sock, unsigned int len,
+				      struct rds_info_iterator *iter,
+				      struct rds_info_lengths *lens,
+				      int want_send)
 {
 	struct hlist_head *head;
 	struct list_head *list;
@@ -524,18 +531,13 @@ static void rds_conn_message_info(struct socket *sock, unsigned int len,
 
 				/* XXX too lazy to maintain counts.. */
 				list_for_each_entry(rm, list, m_conn_item) {
-					__be32 laddr;
-					__be32 faddr;
-
 					total++;
-					laddr = conn->c_laddr.s6_addr32[3];
-					faddr = conn->c_faddr.s6_addr32[3];
 					if (total <= len)
-						rds_inc_info_copy(&rm->m_inc,
-								  iter,
-								  laddr,
-								  faddr,
-								  0);
+						__rds_inc_msg_cp(&rm->m_inc,
+								 iter,
+								 &conn->c_laddr,
+								 &conn->c_faddr,
+								 0);
 				}
 
 				spin_unlock_irqrestore(&cp->cp_lock, flags);
@@ -548,6 +550,14 @@ static void rds_conn_message_info(struct socket *sock, unsigned int len,
 	lens->each = sizeof(struct rds_info_message);
 }
 
+static void rds_conn_message_info(struct socket *sock, unsigned int len,
+				  struct rds_info_iterator *iter,
+				  struct rds_info_lengths *lens,
+				  int want_send)
+{
+	rds_conn_message_info_cmn(sock, len, iter, lens, want_send);
+}
+
 static void rds_conn_message_info_send(struct socket *sock, unsigned int len,
 				       struct rds_info_iterator *iter,
 				       struct rds_info_lengths *lens)
@@ -655,6 +665,9 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
 	struct rds_info_connection *cinfo = buffer;
 	struct rds_connection *conn = cp->cp_conn;
 
+	if (conn->c_isv6)
+		return 0;
+
 	cinfo->next_tx_seq = cp->cp_next_tx_seq;
 	cinfo->next_rx_seq = cp->cp_next_rx_seq;
 	cinfo->laddr = conn->c_laddr.s6_addr32[3];
diff --git a/net/rds/ib.c b/net/rds/ib.c
index c712a84..756225c 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -39,6 +39,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <net/addrconf.h>
 
 #include "rds_single_path.h"
 #include "rds.h"
@@ -295,6 +296,8 @@ static int rds_ib_conn_info_visitor(struct rds_connection *conn,
 	/* We will only ever look at IB transports */
 	if (conn->c_trans != &rds_ib_transport)
 		return 0;
+	if (conn->c_isv6)
+		return 0;
 
 	iinfo->src_addr = conn->c_laddr.s6_addr32[3];
 	iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
@@ -330,7 +333,6 @@ static void rds_ib_ic_info(struct socket *sock, unsigned int len,
 				sizeof(struct rds_info_rdma_connection));
 }
 
-
 /*
  * Early RDS/IB was built to only bind to an address if there is an IPoIB
  * device with that address set.
@@ -346,8 +348,12 @@ static int rds_ib_laddr_check(struct net *net, const struct in6_addr *addr,
 {
 	int ret;
 	struct rdma_cm_id *cm_id;
+	struct sockaddr_in6 sin6;
 	struct sockaddr_in sin;
+	struct sockaddr *sa;
+	bool isv4;
 
+	isv4 = ipv6_addr_v4mapped(addr);
 	/* Create a CMA ID and try to bind it. This catches both
 	 * IB and iWARP capable NICs.
 	 */
@@ -356,20 +362,53 @@ static int rds_ib_laddr_check(struct net *net, const struct in6_addr *addr,
 	if (IS_ERR(cm_id))
 		return PTR_ERR(cm_id);
 
-	memset(&sin, 0, sizeof(sin));
-	sin.sin_family = AF_INET;
-	sin.sin_addr.s_addr = addr->s6_addr32[3];
+	if (isv4) {
+		memset(&sin, 0, sizeof(sin));
+		sin.sin_family = AF_INET;
+		sin.sin_addr.s_addr = addr->s6_addr32[3];
+		sa = (struct sockaddr *)&sin;
+	} else {
+		memset(&sin6, 0, sizeof(sin6));
+		sin6.sin6_family = AF_INET6;
+		sin6.sin6_addr = *addr;
+		sin6.sin6_scope_id = scope_id;
+		sa = (struct sockaddr *)&sin6;
+
+		/* XXX Do a special IPv6 link local address check here.  The
+		 * reason is that rdma_bind_addr() always succeeds with IPv6
+		 * link local address regardless it is indeed configured in a
+		 * system.
+		 */
+		if (ipv6_addr_type(addr) & IPV6_ADDR_LINKLOCAL) {
+			struct net_device *dev;
+
+			if (scope_id == 0)
+				return -EADDRNOTAVAIL;
+
+			/* Use init_net for now as RDS is not network
+			 * name space aware.
+			 */
+			dev = dev_get_by_index(&init_net, scope_id);
+			if (!dev)
+				return -EADDRNOTAVAIL;
+			if (!ipv6_chk_addr(&init_net, addr, dev, 1)) {
+				dev_put(dev);
+				return -EADDRNOTAVAIL;
+			}
+			dev_put(dev);
+		}
+	}
 
 	/* rdma_bind_addr will only succeed for IB & iWARP devices */
-	ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
+	ret = rdma_bind_addr(cm_id, sa);
 	/* due to this, we will claim to support iWARP devices unless we
 	   check node_type. */
 	if (ret || !cm_id->device ||
 	    cm_id->device->node_type != RDMA_NODE_IB_CA)
 		ret = -EADDRNOTAVAIL;
 
-	rdsdebug("addr %pI6c ret %d node type %d\n",
-		 addr, ret,
+	rdsdebug("addr %pI6c%%%u ret %d node type %d\n",
+		 addr, scope_id, ret,
 		 cm_id->device ? cm_id->device->node_type : -1);
 
 	rdma_destroy_id(cm_id);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 5b8b181..250be1c 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -40,7 +40,6 @@
 #include "rds_single_path.h"
 #include "rds.h"
 #include "ib.h"
-#include "tcp.h"
 
 /*
  * Set the selected protocol version
@@ -679,7 +678,7 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event, bool isv6)
 	return version;
 }
 
-/* Given an IPv6 address, find the IB net_device which hosts that address and
+/* Given an IPv6 address, find the net_device which hosts that address and
  * return its index.  This is used by the rds_ib_cm_handle_connect() code to
  * find the interface index of where an incoming request comes from when
  * the request is using a link local address.
@@ -696,8 +695,7 @@ static u32 __rds_find_ifindex(struct net *net, const struct in6_addr *addr)
 
 	rcu_read_lock();
 	for_each_netdev_rcu(net, dev) {
-		if (dev->type == ARPHRD_INFINIBAND &&
-		    ipv6_chk_addr(net, addr, dev, 0)) {
+		if (ipv6_chk_addr(net, addr, dev, 0)) {
 			idx = dev->ifindex;
 			break;
 		}
@@ -887,7 +885,10 @@ int rds_ib_conn_path_connect(struct rds_conn_path *cp)
 
 	/* XXX I wonder what affect the port space has */
 	/* delegate cm event handler to rdma_transport */
-	handler = rds_rdma_cm_event_handler;
+	if (conn->c_isv6)
+		handler = rds6_rdma_cm_event_handler;
+	else
+		handler = rds_rdma_cm_event_handler;
 	ic->i_cm_id = rdma_create_id(&init_net, handler, conn,
 				     RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(ic->i_cm_id)) {
@@ -923,7 +924,7 @@ int rds_ib_conn_path_connect(struct rds_conn_path *cp)
 		sin6 = (struct sockaddr_in6 *)&dest;
 		sin6->sin6_family = AF_INET6;
 		sin6->sin6_addr = conn->c_faddr;
-		sin6->sin6_port = htons(RDS_TCP_PORT);
+		sin6->sin6_port = htons(RDS_CM_PORT);
 		sin6->sin6_scope_id = conn->c_dev_if;
 	}
 
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index aef73e7..bd67e55 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -37,7 +37,9 @@
 #include "rdma_transport.h"
 #include "ib.h"
 
+/* Global IPv4 and IPv6 RDS RDMA listener cm_id */
 static struct rdma_cm_id *rds_rdma_listen_id;
+static struct rdma_cm_id *rds6_rdma_listen_id;
 
 static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 					 struct rdma_cm_event *event,
@@ -153,6 +155,12 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 	return rds_rdma_cm_event_handler_cmn(cm_id, event, false);
 }
 
+int rds6_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
+			       struct rdma_cm_event *event)
+{
+	return rds_rdma_cm_event_handler_cmn(cm_id, event, true);
+}
+
 static int rds_rdma_listen_init_common(rdma_cm_event_handler handler,
 				       struct sockaddr *sa,
 				       struct rdma_cm_id **ret_cm_id)
@@ -199,13 +207,14 @@ static int rds_rdma_listen_init_common(rdma_cm_event_handler handler,
 
 /* Initialize the RDS RDMA listeners.  We create two listeners for
  * compatibility reason.  The one on RDS_PORT is used for IPv4
- * requests only.  The one on RDS_TCP_PORT is used for IPv6 requests
+ * requests only.  The one on RDS_CM_PORT is used for IPv6 requests
  * only.  So only IPv6 enabled RDS module will communicate using this
  * port.
  */
 static int rds_rdma_listen_init(void)
 {
 	int ret;
+	struct sockaddr_in6 sin6;
 	struct sockaddr_in sin;
 
 	sin.sin_family = PF_INET;
@@ -214,7 +223,21 @@ static int rds_rdma_listen_init(void)
 	ret = rds_rdma_listen_init_common(rds_rdma_cm_event_handler,
 					  (struct sockaddr *)&sin,
 					  &rds_rdma_listen_id);
-	return ret;
+	if (ret != 0)
+		return ret;
+
+	sin6.sin6_family = PF_INET6;
+	sin6.sin6_addr = in6addr_any;
+	sin6.sin6_port = htons(RDS_CM_PORT);
+	sin6.sin6_scope_id = 0;
+	sin6.sin6_flowinfo = 0;
+	ret = rds_rdma_listen_init_common(rds6_rdma_cm_event_handler,
+					  (struct sockaddr *)&sin6,
+					  &rds6_rdma_listen_id);
+	/* Keep going even when IPv6 is not enabled in the system. */
+	if (ret != 0)
+		rdsdebug("Cannot set up IPv6 RDMA listener\n");
+	return 0;
 }
 
 static void rds_rdma_listen_stop(void)
@@ -224,6 +247,11 @@ static void rds_rdma_listen_stop(void)
 		rdma_destroy_id(rds_rdma_listen_id);
 		rds_rdma_listen_id = NULL;
 	}
+	if (rds6_rdma_listen_id) {
+		rdsdebug("cm %p\n", rds6_rdma_listen_id);
+		rdma_destroy_id(rds6_rdma_listen_id);
+		rds6_rdma_listen_id = NULL;
+	}
 }
 
 static int rds_rdma_init(void)
diff --git a/net/rds/rdma_transport.h b/net/rds/rdma_transport.h
index d309c44..bc3c639 100644
--- a/net/rds/rdma_transport.h
+++ b/net/rds/rdma_transport.h
@@ -11,6 +11,8 @@
 int rds_rdma_conn_connect(struct rds_connection *conn);
 int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 			      struct rdma_cm_event *event);
+int rds6_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
+			       struct rdma_cm_event *event);
 
 /* from ib.c */
 extern struct rds_transport rds_ib_transport;
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 859808a..f5f99d1 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -24,13 +24,15 @@
 #define RDS_PROTOCOL_MINOR(v)	((v) & 255)
 #define RDS_PROTOCOL(maj, min)	(((maj) << 8) | min)
 
-/*
- * XXX randomly chosen, but at least seems to be unused:
- * #               18464-18768 Unassigned
- * We should do better.  We want a reserved port to discourage unpriv'ed
- * userspace from listening.
+/* The following ports, 16385, 18634, 18635, are registered with IANA as
+ * the ports to be used for RDS over TCP and UDP.  18634 is the historical
+ * value used for the RDMA_CM listener port.  RDS/TCP uses port 16385.  After
+ * IPv6 work, RDMA_CM also uses 16385 as the listener port.  18634 is kept
+ * to ensure compatibility with older RDS modules.
  */
 #define RDS_PORT	18634
+#define RDS_CM_PORT	16385
+#define RDS_TCP_PORT	RDS_CM_PORT
 
 #ifdef ATOMIC64_INIT
 #define KERNEL_HAS_ATOMIC64
diff --git a/net/rds/send.c b/net/rds/send.c
index 6ed2e92..c0e4f0b 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1105,8 +1105,28 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 			break;
 
 		case sizeof(*sin6): {
-			ret = -EPROTONOSUPPORT;
-			goto out;
+			int addr_type;
+
+			if (sin6->sin6_family != AF_INET6) {
+				ret = -EINVAL;
+				goto out;
+			}
+			addr_type = ipv6_addr_type(&sin6->sin6_addr);
+			if (!(addr_type & IPV6_ADDR_UNICAST)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (addr_type & IPV6_ADDR_LINKLOCAL) {
+				if (sin6->sin6_scope_id == 0) {
+					ret = -EINVAL;
+					goto out;
+				}
+				scope_id = sin6->sin6_scope_id;
+			}
+
+			daddr = sin6->sin6_addr;
+			dport = sin6->sin6_port;
+			break;
 		}
 
 		default:
@@ -1138,6 +1158,14 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 			ret = -EOPNOTSUPP;
 			goto out;
 		}
+		/* If the socket is already bound to a link local address,
+		 * it can only send to peers on the same link.
+		 */
+		if (scope_id != rs->rs_bound_scope_id) {
+			release_sock(sk);
+			ret = -EINVAL;
+			goto out;
+		}
 	}
 	release_sock(sk);
 
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index dadb337..890d0e1 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2017 Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -46,7 +46,12 @@
 /* only for info exporting */
 static DEFINE_SPINLOCK(rds_tcp_tc_list_lock);
 static LIST_HEAD(rds_tcp_tc_list);
+
+/* rds_tcp_tc_count counts only IPv4 connections.
+ * rds6_tcp_tc_count counts both IPv4 and IPv6 connections.
+ */
 static unsigned int rds_tcp_tc_count;
+static unsigned int rds6_tcp_tc_count;
 
 /* Track rds_tcp_connection structs so they can be cleaned up */
 static DEFINE_SPINLOCK(rds_tcp_conn_lock);
@@ -113,7 +118,9 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 	/* done under the callback_lock to serialize with write_space */
 	spin_lock(&rds_tcp_tc_list_lock);
 	list_del_init(&tc->t_list_item);
-	rds_tcp_tc_count--;
+	rds6_tcp_tc_count--;
+	if (!tc->t_cpath->cp_conn->c_isv6)
+		rds_tcp_tc_count--;
 	spin_unlock(&rds_tcp_tc_list_lock);
 
 	tc->t_sock = NULL;
@@ -200,7 +207,9 @@ void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp)
 	/* done under the callback_lock to serialize with write_space */
 	spin_lock(&rds_tcp_tc_list_lock);
 	list_add_tail(&tc->t_list_item, &rds_tcp_tc_list);
-	rds_tcp_tc_count++;
+	rds6_tcp_tc_count++;
+	if (!tc->t_cpath->cp_conn->c_isv6)
+		rds_tcp_tc_count++;
 	spin_unlock(&rds_tcp_tc_list_lock);
 
 	/* accepted sockets need our listen data ready undone */
@@ -221,6 +230,9 @@ void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp)
 	write_unlock_bh(&sock->sk->sk_callback_lock);
 }
 
+/* Handle RDS_INFO_TCP_SOCKETS socket option.  It only returns IPv4
+ * connections for backward compatibility.
+ */
 static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
 			    struct rds_info_iterator *iter,
 			    struct rds_info_lengths *lens)
@@ -228,8 +240,6 @@ static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
 	struct rds_info_tcp_socket tsinfo;
 	struct rds_tcp_connection *tc;
 	unsigned long flags;
-	struct sockaddr_in sin;
-	struct socket *sock;
 
 	spin_lock_irqsave(&rds_tcp_tc_list_lock, flags);
 
@@ -237,16 +247,15 @@ static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
 		goto out;
 
 	list_for_each_entry(tc, &rds_tcp_tc_list, t_list_item) {
+		struct inet_sock *inet = inet_sk(tc->t_sock->sk);
 
-		sock = tc->t_sock;
-		if (sock) {
-			sock->ops->getname(sock, (struct sockaddr *)&sin, 0);
-			tsinfo.local_addr = sin.sin_addr.s_addr;
-			tsinfo.local_port = sin.sin_port;
-			sock->ops->getname(sock, (struct sockaddr *)&sin, 1);
-			tsinfo.peer_addr = sin.sin_addr.s_addr;
-			tsinfo.peer_port = sin.sin_port;
-		}
+		if (tc->t_cpath->cp_conn->c_isv6)
+			continue;
+
+		tsinfo.local_addr = inet->inet_saddr;
+		tsinfo.local_port = inet->inet_sport;
+		tsinfo.peer_addr = inet->inet_daddr;
+		tsinfo.peer_port = inet->inet_dport;
 
 		tsinfo.hdr_rem = tc->t_tinc_hdr_rem;
 		tsinfo.data_rem = tc->t_tinc_data_rem;
@@ -494,13 +503,18 @@ static __net_init int rds_tcp_init_net(struct net *net)
 		err = -ENOMEM;
 		goto fail;
 	}
-	rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net);
+	rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, true);
 	if (!rtn->rds_tcp_listen_sock) {
-		pr_warn("could not set up listen sock\n");
-		unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
-		rtn->rds_tcp_sysctl = NULL;
-		err = -EAFNOSUPPORT;
-		goto fail;
+		pr_warn("could not set up IPv6 listen sock\n");
+
+		/* Try IPv4 as some systems disable IPv6 */
+		rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, false);
+		if (!rtn->rds_tcp_listen_sock) {
+			unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
+			rtn->rds_tcp_sysctl = NULL;
+			err = -EAFNOSUPPORT;
+			goto fail;
+		}
 	}
 	INIT_WORK(&rtn->rds_tcp_accept_w, rds_tcp_accept_worker);
 	return 0;
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index c6fa080..6a948c1 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -2,8 +2,6 @@
 #ifndef _RDS_TCP_H
 #define _RDS_TCP_H
 
-#define RDS_TCP_PORT	16385
-
 struct rds_tcp_incoming {
 	struct rds_incoming	ti_inc;
 	struct sk_buff_head	ti_skb_list;
@@ -67,7 +65,7 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 void rds_tcp_state_change(struct sock *sk);
 
 /* tcp_listen.c */
-struct socket *rds_tcp_listen_init(struct net *);
+struct socket *rds_tcp_listen_init(struct net *net, bool isv6);
 void rds_tcp_listen_stop(struct socket *sock, struct work_struct *acceptor);
 void rds_tcp_listen_data_ready(struct sock *sk);
 int rds_tcp_accept_one(struct socket *sock);
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index 231ae92..008f50f 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -89,9 +89,11 @@ void rds_tcp_state_change(struct sock *sk)
 int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
 {
 	struct socket *sock = NULL;
+	struct sockaddr_in6 sin6;
 	struct sockaddr_in sin;
 	struct sockaddr *addr;
 	int addrlen;
+	bool isv6;
 	int ret;
 	struct rds_connection *conn = cp->cp_conn;
 	struct rds_tcp_connection *tc = cp->cp_transport_data;
@@ -108,18 +110,36 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
 		mutex_unlock(&tc->t_conn_path_lock);
 		return 0;
 	}
-	ret = sock_create_kern(rds_conn_net(conn), PF_INET,
-			       SOCK_STREAM, IPPROTO_TCP, &sock);
+	if (ipv6_addr_v4mapped(&conn->c_laddr)) {
+		ret = sock_create_kern(rds_conn_net(conn), PF_INET,
+				       SOCK_STREAM, IPPROTO_TCP, &sock);
+		isv6 = false;
+	} else {
+		ret = sock_create_kern(rds_conn_net(conn), PF_INET6,
+				       SOCK_STREAM, IPPROTO_TCP, &sock);
+		isv6 = true;
+	}
+
 	if (ret < 0)
 		goto out;
 
 	rds_tcp_tune(sock);
 
-	sin.sin_family = AF_INET;
-	sin.sin_addr.s_addr = conn->c_laddr.s6_addr32[3];
-	sin.sin_port = 0;
-	addr = (struct sockaddr *)&sin;
-	addrlen = sizeof(sin);
+	if (isv6) {
+		sin6.sin6_family = AF_INET6;
+		sin6.sin6_addr = conn->c_laddr;
+		sin6.sin6_port = 0;
+		sin6.sin6_flowinfo = 0;
+		sin6.sin6_scope_id = conn->c_dev_if;
+		addr = (struct sockaddr *)&sin6;
+		addrlen = sizeof(sin6);
+	} else {
+		sin.sin_family = AF_INET;
+		sin.sin_addr.s_addr = conn->c_laddr.s6_addr32[3];
+		sin.sin_port = 0;
+		addr = (struct sockaddr *)&sin;
+		addrlen = sizeof(sin);
+	}
 
 	ret = sock->ops->bind(sock, addr, addrlen);
 	if (ret) {
@@ -128,11 +148,21 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
 		goto out;
 	}
 
-	sin.sin_family = AF_INET;
-	sin.sin_addr.s_addr = conn->c_faddr.s6_addr32[3];
-	sin.sin_port = htons(RDS_TCP_PORT);
-	addr = (struct sockaddr *)&sin;
-	addrlen = sizeof(sin);
+	if (isv6) {
+		sin6.sin6_family = AF_INET6;
+		sin6.sin6_addr = conn->c_faddr;
+		sin6.sin6_port = htons(RDS_TCP_PORT);
+		sin6.sin6_flowinfo = 0;
+		sin6.sin6_scope_id = conn->c_dev_if;
+		addr = (struct sockaddr *)&sin6;
+		addrlen = sizeof(sin6);
+	} else {
+		sin.sin_family = AF_INET;
+		sin.sin_addr.s_addr = conn->c_faddr.s6_addr32[3];
+		sin.sin_port = htons(RDS_TCP_PORT);
+		addr = (struct sockaddr *)&sin;
+		addrlen = sizeof(sin);
+	}
 
 	/*
 	 * once we call connect() we can start getting callbacks and they
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 4fdf5b3..0f996e4 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -256,15 +256,22 @@ void rds_tcp_listen_data_ready(struct sock *sk)
 		ready(sk);
 }
 
-struct socket *rds_tcp_listen_init(struct net *net)
+struct socket *rds_tcp_listen_init(struct net *net, bool isv6)
 {
-	struct sockaddr_in sin;
 	struct socket *sock = NULL;
+	struct sockaddr_storage ss;
+	struct sockaddr_in6 *sin6;
+	struct sockaddr_in *sin;
+	int addr_len;
 	int ret;
 
-	ret = sock_create_kern(net, PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
-	if (ret < 0)
+	ret = sock_create_kern(net, isv6 ? PF_INET6 : PF_INET, SOCK_STREAM,
+			       IPPROTO_TCP, &sock);
+	if (ret < 0) {
+		rdsdebug("could not create %s listener socket: %d\n",
+			 isv6 ? "IPv6" : "IPv4", ret);
 		goto out;
+	}
 
 	sock->sk->sk_reuse = SK_CAN_REUSE;
 	rds_tcp_nonagle(sock);
@@ -274,13 +281,28 @@ struct socket *rds_tcp_listen_init(struct net *net)
 	sock->sk->sk_data_ready = rds_tcp_listen_data_ready;
 	write_unlock_bh(&sock->sk->sk_callback_lock);
 
-	sin.sin_family = PF_INET;
-	sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY);
-	sin.sin_port = (__force u16)htons(RDS_TCP_PORT);
+	if (isv6) {
+		sin6 = (struct sockaddr_in6 *)&ss;
+		sin6->sin6_family = PF_INET6;
+		sin6->sin6_addr = in6addr_any;
+		sin6->sin6_port = (__force u16)htons(RDS_TCP_PORT);
+		sin6->sin6_scope_id = 0;
+		sin6->sin6_flowinfo = 0;
+		addr_len = sizeof(*sin6);
+	} else {
+		sin = (struct sockaddr_in *)&ss;
+		sin->sin_family = PF_INET;
+		sin->sin_addr.s_addr = INADDR_ANY;
+		sin->sin_port = (__force u16)htons(RDS_TCP_PORT);
+		addr_len = sizeof(*sin);
+	}
 
-	ret = sock->ops->bind(sock, (struct sockaddr *)&sin, sizeof(sin));
-	if (ret < 0)
+	ret = sock->ops->bind(sock, (struct sockaddr *)&ss, addr_len);
+	if (ret < 0) {
+		rdsdebug("could not bind %s listener socket: %d\n",
+			 isv6 ? "IPv6" : "IPv4", ret);
 		goto out;
+	}
 
 	ret = sock->ops->listen(sock, 64);
 	if (ret < 0)
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
From: Sowmini Varadhan @ 2018-06-27 10:29 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: netdev, santosh.shilimkar, davem, rds-devel
In-Reply-To: <e83dbf4a-b813-4edc-9245-d738faaedb4b@oracle.com>

On (06/27/18 18:07), Ka-Cheong Poon wrote:
> 
> There is a reason for that.  It is the way folks expect
> how IPv6 addresses are being used.

have you tried "traceoute6 -s abc::2 fe80::2" on linux?

> It is not just forwarding.  The simple case is that one
> picks a global address in a different link and then
> use it to send to a link local address in another link.

This is actually not any different than ipv4's strong/weak ES model.

Global addresses are supposed to be globally routable. For your
above example, if yuu do that, it is assumed that your routing
table has been set up suitably.

To state what may be well-known:
This does not work for link-locals, becuase, as the name 
suggests, those are local to the link and you may have the same
link-local on multiple links

> This does not work.  And the RDS connection created will
> be stuck forever.  

that is a different problem in the RDS implementation (that
it does not backoff and timeout a failing reconnect)

As you can see from the traceroute6 example, global <-> link-local 
is supported for udp (and probably also tcp sockets, I have not checked
that case)

> I don't expect RDS apps will want to use link local address
> in the first place.  In fact, most normal network apps don't.
   :
> Do you know of any IPv4 RDS app which uses IPv4 link local
> address?  In fact, IPv4 link local address is explicitly
> disallowed for active active bonding.

Are we talking about "why this ok for my particular use
of link-local, so I can slide my patch forward" or, 
"why this is correct IPv6 behavior"?

> Can you explain why DNS name resolution will return an IPv6
> link local address?  I'm surprised if it actually does.

It depends on how you set up your DNS.

It seems like this is all about "I dont want to deal with this
now", so I dont want to continue this discussion which is really
going nowhere.

Thanks

--Sowmini

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Daniel Borkmann @ 2018-06-27 10:34 UTC (permalink / raw)
  To: Jakub Kicinski, Martin KaFai Lau
  Cc: Okash Khawaja, Alexei Starovoitov, Yonghong Song, Quentin Monnet,
	David S. Miller, netdev, kernel-team, linux-kernel
In-Reply-To: <20180626153559.0511f709@cakuba.netronome.com>

On 06/27/2018 12:35 AM, Jakub Kicinski wrote:
> On Tue, 26 Jun 2018 15:27:09 -0700, Martin KaFai Lau wrote:
>> On Tue, Jun 26, 2018 at 01:31:33PM -0700, Jakub Kicinski wrote:
[...]
>>> Implementing both outputs in one series will help you structure your
>>> code to best suit both of the formats up front.  
>> hex and "formatted" are the only things missing?  As always, things
>> can be refactored when new use case comes up.  Lets wait for
>> Okash input.
>>
>> Regardless, plaintext is our current use case.  Having the current
>> patchset in does not stop us or others from contributing other use
>> cases (json, "bpftool map find"...etc),  and IMO it is actually
>> the opposite.  Others may help us get there faster than us alone.
>> We should not stop making forward progress and take this patch
>> as hostage because "abc" and "xyz" are not done together.
> 
> Parity between JSON and plain text output is non negotiable.

Longish discussion and some confusion in this thread. :-) First of all
thanks a lot for working on it, very useful! My $0.02 on it is that so far
great care has been taken in bpftool to indeed have feature parity between
JSON and plain text, so it would be highly desirable to keep continuing
this practice if the consensus is that it indeed is feasible and makes
sense wrt BTF data. There has been mentioned that given BTF data can be
dynamic depending on what the user loads via bpf(2) so a potential JSON
output may look different/break each time anyway. This however could all be
embedded under a container object that has a fixed key like 'formatted'
where tools like jq(1) can query into it. I think this would be fine since
the rest of the (non-dynamic) output is still retained as-is and then
wouldn't confuse or collide with existing users, and anyone programmatically
parsing deeper into the BTF data under such JSON container object needs
to have awareness of what specific data it wants to query from it; so
there's no conflict wrt breaking anything here. Imho, both outputs would
be very valuable.

Thanks,
Daniel

^ permalink raw reply

* Re: s390x BPF JIT failures with test_bpf
From: Daniel Borkmann @ 2018-06-27 10:36 UTC (permalink / raw)
  To: Kleber Souza, linux-s390, netdev; +Cc: Alexei Starovoitov
In-Reply-To: <908de5ba-3a65-dfa1-3d5e-de25b4e984a2@canonical.com>

On 06/27/2018 12:13 PM, Kleber Souza wrote:
> On 06/27/18 12:01, Daniel Borkmann wrote:
>> On 06/27/2018 11:40 AM, Kleber Souza wrote:
>> [...]
>>> When I load the test_bpf module from mainline (v4.18-rc2) with
>>> CONFIG_BPF_JIT_ALWAYS_ON=y on a s390x system I get the following errors:
>>>
>>> test_bpf: #289 BPF_MAXINSNS: Ctx heavy transformations FAIL to
>>> prog_create err=-524 len=4096
>>> test_bpf: #290 BPF_MAXINSNS: Call heavy transformations FAIL to
>>> prog_create err=-524 len=4096
>>> [...]
>>> test_bpf: #296 BPF_MAXINSNS: exec all MSH FAIL to prog_create err=-524
>>> len=4096
>>> test_bpf: #297 BPF_MAXINSNS: ld_abs+get_processor_id FAIL to prog_create
>>> err=-524 len=4096
>>>
>>> From a quick look at the code it seems that
>>> arch/s390/net/bpf_jit_comp.c:bpf_int_jit_compile() is failing to JIT
>>> compile the test code.
>>>
>>> Are those failures expected and could be flagged with FLAG_EXPECTED_FAIL
>>> on lib/test_bpf.c or are those caused by some issue with the s390x JIT
>>> compiler that needs to be fixed?
>>
>> JIT doesn't guarantee in general to map really all programs to native insns,
>> so some, mostly crafted corner cases could fail. E.g. x86-64 JIT doesn't converge
>> on some programs in test_bpf.c and thus falls back to interpreter or simply
>> rejects the program in case of CONFIG_BPF_JIT_ALWAYS_ON=y. Above would seem
>> likely that it's hitting the BPF_SIZE_MAX that s390 would do. I think it might
>> make sense to either have the FLAG_EXPECTED_FAIL in lib/test_bpf.c more fine
>> grained as a flag per arch, so we could say it's expected to fail on e.g. s390
>> but not on x86 and the like, or just denote it as 'could potentially fail but
>> doesn't have to be the case everywhere'.
> 
> Thank you for your reply. I will run some more tests to make sure we are
> hitting BPF_SIZE_MAX or what exactly is failing and send a patch to flag
> it conditionally for s390x.

Sounds good, thanks! In any case, please let us know your findings.

Best,
Daniel

^ permalink raw reply

* Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
From: Willem de Bruijn @ 2018-06-27 10:47 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Network Development, David Miller, Alexander Duyck,
	Samudrala, Sridhar, Alexander Duyck, Eric Dumazet,
	Hannes Frederic Sowa, Tom Herbert
In-Reply-To: <90d56882-26a9-0a6e-414b-02bd777c2854@intel.com>

> >> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
> >>  {
> >>  #ifdef CONFIG_XPS
> >>         struct xps_dev_maps *dev_maps;
> >> -       struct xps_map *map;
> >> +       struct sock *sk = skb->sk;
> >>         int queue_index = -1;
> >>
> >>         if (!static_key_false(&xps_needed))
> >>                 return -1;
> >>
> >>         rcu_read_lock();
> >> -       dev_maps = rcu_dereference(dev->xps_cpus_map);
> >> +       if (!static_key_false(&xps_rxqs_needed))
> >> +               goto get_cpus_map;
> >> +
> >> +       dev_maps = rcu_dereference(dev->xps_rxqs_map);
> >>         if (dev_maps) {
> >> -               unsigned int tci = skb->sender_cpu - 1;
> >> +               int tci = sk_rx_queue_get(sk);
> >
> > What if the rx device differs from the tx device?
> >
> I think I have 3 options here:
> 1. Cache the ifindex in sock_common which will introduce a new
> additional field in sock_common.
> 2. Use dev_get_by_napi_id to get the device id. This could be expensive,
> if the rxqs_map is set, this will be done on every packet and involves
> walking through the hashlist for napi_id lookup.

The tx queue mapping is cached in the sk for connected sockets, but
indeed this would be expensive for many workloads.

> 3. Remove validating device id, similar to how it is in skb_tx_hash
> where rx_queue recorded is used and if not, fall through to flow hash
> calculation.
> What do you think is suitable here?

Alternatively, just accept the misprediction in this rare case. But do
make the caveat explicit in the documentation.

^ permalink raw reply

* Re: [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu
From: Jesper Dangaard Brouer @ 2018-06-27 10:59 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexei Starovoitov, Daniel Borkmann, neerav.parikh, pjwaskiewicz,
	ttoukan.linux, Tariq Toukan, alexander.h.duyck,
	peter.waskiewicz.jr, Opher Reviv, Rony Efraim, netdev,
	Saeed Mahameed, brouer
In-Reply-To: <20180627024615.17856-7-saeedm@mellanox.com>

On Tue, 26 Jun 2018 19:46:15 -0700
Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:

> Add a new program (prog_num = 4) that will not parse packets and will
> use the meta data hash to spread/redirect traffic into different cpus.

You cannot "steal" prognum 4, as it is already used for
"xdp_prognum4_ddos_filter_pktgen".  Please append your new prog as #5.

> For the new program we set on bpf_set_link_xdp_fd:
> 	xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> 
> On mlx5 it will succeed since mlx5 already supports these flags.
> 
> The new program will read the value of the hash from the data_meta
> pointer from the xdp_md and will use it to compute the destination cpu.
> 
> Note: I didn't test this patch to show redirect works with the hash!
> I only used it to see that the hash and vlan values are set correctly
> by the driver and can be seen by the xdp program.
> 
> * I faced some difficulties to read the hash value using the helper
> functions defined in the previous patches, but once i used the same logic
> with out these functions it worked ! Will have to figure this out later.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  samples/bpf/xdp_redirect_cpu_kern.c | 67 +++++++++++++++++++++++++++++
>  samples/bpf/xdp_redirect_cpu_user.c |  7 +++
>  2 files changed, 74 insertions(+)
> 
> diff --git a/samples/bpf/xdp_redirect_cpu_kern.c b/samples/bpf/xdp_redirect_cpu_kern.c
> index 303e9e7161f3..d6b3f55f342a 100644
> --- a/samples/bpf/xdp_redirect_cpu_kern.c
> +++ b/samples/bpf/xdp_redirect_cpu_kern.c
> @@ -376,6 +376,73 @@ int  xdp_prognum3_proto_separate(struct xdp_md *ctx)
>  	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
>  }
>  
> +#if 0
> +xdp_md_info_arr mdi = {
> +	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
> +	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
> +};
> +#endif

Sorry, no global variables avail in the generated BPF byte-code.

> +SEC("xdp_cpu_map4_hash_separate")
> +int  xdp_prognum4_hash_separate(struct xdp_md *ctx)
> +{
> +	void *data_meta = (void *)(long)ctx->data_meta;
> +	void *data_end  = (void *)(long)ctx->data_end;
> +	void *data      = (void *)(long)ctx->data;
> +	struct xdp_md_hash *hash;
> +	struct xdp_md_vlan *vlan;
> +	struct datarec *rec;
> +	u32 cpu_dest = 0;
> +	u32 cpu_idx = 0;
> +	u32 *cpu_lookup;
> +	u32 key = 0;
> +
> +	/* Count RX packet in map */
> +	rec = bpf_map_lookup_elem(&rx_cnt, &key);
> +	if (!rec)
> +		return XDP_ABORTED;
> +	rec->processed++;
> +
> +	/* for some reason this code fails to be verified */
> +#if 0
> +	hash = xdp_data_meta_get_hash(mdi, data_meta);

This will not work, because it is not implemented as a proper
BPF-helper call.

First, you currently store the xdp_md_info_arr inside the driver, which
makes it hard for a helper to access this.  For helper access, we could
store this in xdp_rxq_info.

Second, in your design it looks like you are introducing a helper per
possible item in xdp_md_info_arr.  I think we can reduce this to a
single helper, that takes a XDP_DATA_META_xxx flag, and returns an
offset.  (The helper could return a direct pointer, but I don't think
the verfier can handle that, as it need to "see" this is related to
data_meta pointer, and that we do the proper boundry checks.).

The BPF prog already have direct memory access to the data_meta area,
and all it really need is an offset.  Sure, the XDP/bpf programmer
could just calculate these offsets as constants, and remember to load
the XDP prog with the flags that corresponds to the calculated offsets.

But I think we can do something even smarter... 

It should be possible to convert/patch the BPF instructions, of the
helper call that returns an offset, to instead avoid the call and
either (1) provide the offset as a constant/IMM or (2) make BPF inst
doing the lookup in xdp_md_info_arr.


> +	if (hash + 1 > data)
> +		return XDP_ABORTED;
> +
> +	vlan = xdp_data_meta_get_vlan(mdi, data_meta);
> +	if (vlan + 1 > data)
> +		return XDP_ABORTED;
> +#endif
> +
> +	/* Work around for the above code */
> +	hash = data_meta; /* since we know hash will appear first */
> +        if (hash + 1 > data)
> +		return XDP_ABORTED;
> +
> +#if 0
> +	// Just for testing
> +	/* We know that vlan will appear after the hash */
> +	vlan = (void *)((char *)data_meta + sizeof(*hash));
> +	if (vlan + 1 > data) {
> +		return XDP_ABORTED;
> +	}
> +#endif

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [bpf-next PATCH 2/2] samples/bpf: xdp_rxq_info action XDP_TX must adjust MAC-addrs
From: Jesper Dangaard Brouer @ 2018-06-27 11:20 UTC (permalink / raw)
  To: Song Liu
  Cc: Networking, Daniel Borkmann, Toke Høiland-Jørgensen,
	Alexei Starovoitov, brouer
In-Reply-To: <CAPhsuW74ZzHawToa3YXxzT6ojk+ScdCvcO3KyT3EBcCxqPHDXg@mail.gmail.com>

On Tue, 26 Jun 2018 17:09:01 -0700
Song Liu <liu.song.a23@gmail.com> wrote:

> On Mon, Jun 25, 2018 at 7:27 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > XDP_TX requires also changing the MAC-addrs, else some hardware
> > may drop the TX packet before reaching the wire.  This was
> > observed with driver mlx5.
> >
> > If xdp_rxq_info select --action XDP_TX the swapmac functionality
> > is activated.  It is also possible to manually enable via cmdline
> > option --swapmac.  This is practical if wanting to measure the
> > overhead of writing/updating payload for other action types.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> > ---
[...]
> >
> > diff --git a/samples/bpf/xdp_rxq_info_kern.c b/samples/bpf/xdp_rxq_info_kern.c
> > index 61af6210df2f..222a83eed1cb 100644
> > --- a/samples/bpf/xdp_rxq_info_kern.c
> > +++ b/samples/bpf/xdp_rxq_info_kern.c
> > @@ -21,6 +21,7 @@ struct config {
> >  enum cfg_options_flags {
> >         NO_TOUCH = 0x0U,
> >         READ_MEM = 0x1U,
> > +       SWAP_MAC = 0x2U,
> >  };
[...]
> > @@ -98,7 +116,7 @@ int  xdp_prognum0(struct xdp_md *ctx)
> >                 rxq_rec->issue++;
> >
> >         /* Default: Don't touch packet data, only count packets */
> > -       if (unlikely(config->options & READ_MEM)) {
> > +       if (unlikely(config->options & (READ_MEM|SWAP_MAC))) {
> >                 struct ethhdr *eth = data;
> >
> >                 if (eth + 1 > data_end)
[...]

> > diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
> > index 435485d4f49e..248a7eab9531 100644
[...]
> > @@ -119,6 +121,8 @@ static char* options2str(enum cfg_options_flags flag)
> >  {
> >         if (flag == NO_TOUCH)
> >                 return "no_touch";
> > +       if (flag & SWAP_MAC)
> > +               return "swapmac";
> >         if (flag & READ_MEM)
> >                 return "read";  
> 
> I guess SWAP_MAC also reads the memory, so it "includes" READ_MEM?

True (see _kern side)

> It is OK for now. We may need to refactor this part when adding other
> flags in the future.

Sure, do remember that this is only a 'sample' program.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [bpf-next PATCH 1/2] samples/bpf: extend xdp_rxq_info to read packet payload
From: Jesper Dangaard Brouer @ 2018-06-27 11:23 UTC (permalink / raw)
  To: Song Liu
  Cc: Networking, Daniel Borkmann, Toke Høiland-Jørgensen,
	Alexei Starovoitov, brouer
In-Reply-To: <CAPhsuW7i_YsrFvtrFwAybOiFFMUjQzQwtOwVz-r0=d9=TPM=Cw@mail.gmail.com>

On Tue, 26 Jun 2018 16:53:15 -0700
Song Liu <liu.song.a23@gmail.com> wrote:

> > +static char* options2str(enum cfg_options_flags flag)
> > +{
> > +       if (flag == NO_TOUCH)
> > +               return "no_touch";
> > +       if (flag & READ_MEM)
> > +               return "read";
> > +       fprintf(stderr, "ERR: Unknown config option flags");
> > +       exit(EXIT_FAIL);
> > +}
> > +  
> 
> enum cfg_options_flags is used as a bitmap in other parts of the sample.
> So this function is a little weird (with more flags added).

Sure, and do I handle this correctly in the next patch.

I'm uncertain what you want me to change?
Do you want me to drop the enum, and use #define instead?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Okash Khawaja @ 2018-06-27 11:47 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jakub Kicinski, Martin KaFai Lau, Alexei Starovoitov,
	Yonghong Song, Quentin Monnet, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <23968d53-2895-f0bc-a38c-5bc99e1846ab@iogearbox.net>

On Wed, Jun 27, 2018 at 12:34:35PM +0200, Daniel Borkmann wrote:
> On 06/27/2018 12:35 AM, Jakub Kicinski wrote:
> > On Tue, 26 Jun 2018 15:27:09 -0700, Martin KaFai Lau wrote:
> >> On Tue, Jun 26, 2018 at 01:31:33PM -0700, Jakub Kicinski wrote:
> [...]
> >>> Implementing both outputs in one series will help you structure your
> >>> code to best suit both of the formats up front.  
> >> hex and "formatted" are the only things missing?  As always, things
> >> can be refactored when new use case comes up.  Lets wait for
> >> Okash input.
> >>
> >> Regardless, plaintext is our current use case.  Having the current
> >> patchset in does not stop us or others from contributing other use
> >> cases (json, "bpftool map find"...etc),  and IMO it is actually
> >> the opposite.  Others may help us get there faster than us alone.
> >> We should not stop making forward progress and take this patch
> >> as hostage because "abc" and "xyz" are not done together.
> > 
> > Parity between JSON and plain text output is non negotiable.
> 
> Longish discussion and some confusion in this thread. :-) First of all
> thanks a lot for working on it, very useful! 
Thanks :)

> My $0.02 on it is that so far
> great care has been taken in bpftool to indeed have feature parity between
> JSON and plain text, so it would be highly desirable to keep continuing
> this practice if the consensus is that it indeed is feasible and makes
> sense wrt BTF data. There has been mentioned that given BTF data can be
> dynamic depending on what the user loads via bpf(2) so a potential JSON
> output may look different/break each time anyway. This however could all be
> embedded under a container object that has a fixed key like 'formatted'
> where tools like jq(1) can query into it. I think this would be fine since
> the rest of the (non-dynamic) output is still retained as-is and then
> wouldn't confuse or collide with existing users, and anyone programmatically
> parsing deeper into the BTF data under such JSON container object needs
> to have awareness of what specific data it wants to query from it; so
> there's no conflict wrt breaking anything here. Imho, both outputs would
> be very valuable.
Okay I can add "formatted" object under json output.

One thing to note here is that the fixed output will change if the map
itself changes. So someone writing a program that consumes that fixed
output will have to account for his program breaking in future, thus
breaking backward compatibility anyway as far as the developer is
concerned :)

I will go ahead with work on "formatted" object.

Thanks,
Okash

> 
> Thanks,
> Daniel

^ permalink raw reply

* [PATCH net-next] tcp: replace LINUX_MIB_TCPOFODROP with LINUX_MIB_TCPRMEMFULLDROP for drops due to receive buffer full
From: Yafang Shao @ 2018-06-27 11:50 UTC (permalink / raw)
  To: edumazet, davem; +Cc: netdev, linux-kernel, Yafang Shao

When sk_rmem_alloc is larger than the receive buffer and we can't
schedule more memory for it, the skb will be dropped.

In above situation, if this skb is put into the ofo queue,
LINUX_MIB_TCPOFODROP is incremented to track it,
while if this skb is put into the receive queue, there's no record.

So LINUX_MIB_TCPOFODROP is replaced with LINUX_MIB_TCPRMEMFULLDROP to track
this behavior.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/uapi/linux/snmp.h |  2 +-
 net/ipv4/proc.c           |  2 +-
 net/ipv4/tcp_input.c      | 10 +++++++---
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 97517f3..3a83322 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -243,7 +243,6 @@ enum
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
 	LINUX_MIB_TCPRCVCOALESCE,		/* TCPRcvCoalesce */
 	LINUX_MIB_TCPOFOQUEUE,			/* TCPOFOQueue */
-	LINUX_MIB_TCPOFODROP,			/* TCPOFODrop */
 	LINUX_MIB_TCPOFOMERGE,			/* TCPOFOMerge */
 	LINUX_MIB_TCPCHALLENGEACK,		/* TCPChallengeACK */
 	LINUX_MIB_TCPSYNCHALLENGE,		/* TCPSYNChallenge */
@@ -280,6 +279,7 @@ enum
 	LINUX_MIB_TCPDELIVEREDCE,		/* TCPDeliveredCE */
 	LINUX_MIB_TCPACKCOMPRESSED,		/* TCPAckCompressed */
 	LINUX_MIB_TCPZEROWINDOWDROP,		/* TCPZeroWindowDrop */
+	LINUX_MIB_TCPRMEMFULLDROP,              /* TCPRmemFullDrop */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 225ef34..43ee02d 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -251,7 +251,6 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 	SNMP_MIB_ITEM("TCPRetransFail", LINUX_MIB_TCPRETRANSFAIL),
 	SNMP_MIB_ITEM("TCPRcvCoalesce", LINUX_MIB_TCPRCVCOALESCE),
 	SNMP_MIB_ITEM("TCPOFOQueue", LINUX_MIB_TCPOFOQUEUE),
-	SNMP_MIB_ITEM("TCPOFODrop", LINUX_MIB_TCPOFODROP),
 	SNMP_MIB_ITEM("TCPOFOMerge", LINUX_MIB_TCPOFOMERGE),
 	SNMP_MIB_ITEM("TCPChallengeACK", LINUX_MIB_TCPCHALLENGEACK),
 	SNMP_MIB_ITEM("TCPSYNChallenge", LINUX_MIB_TCPSYNCHALLENGE),
@@ -288,6 +287,7 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 	SNMP_MIB_ITEM("TCPDeliveredCE", LINUX_MIB_TCPDELIVEREDCE),
 	SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED),
 	SNMP_MIB_ITEM("TCPZeroWindowDrop", LINUX_MIB_TCPZEROWINDOWDROP),
+	SNMP_MIB_ITEM("TCPRmemFullDrop", LINUX_MIB_TCPRMEMFULLDROP),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9c5b341..d11abb8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4442,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	tcp_ecn_check_ce(sk, skb);
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
-		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRMEMFULLDROP);
 		tcp_drop(sk, skb);
 		return;
 	}
@@ -4611,8 +4611,10 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 	skb->data_len = data_len;
 	skb->len = size;
 
-	if (tcp_try_rmem_schedule(sk, skb, skb->truesize))
+	if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRMEMFULLDROP);
 		goto err_free;
+	}
 
 	err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
 	if (err)
@@ -4677,8 +4679,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 queue_and_out:
 		if (skb_queue_len(&sk->sk_receive_queue) == 0)
 			sk_forced_mem_schedule(sk, skb->truesize);
-		else if (tcp_try_rmem_schedule(sk, skb, skb->truesize))
+		else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
+			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRMEMFULLDROP);
 			goto drop;
+		}
 
 		eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
 		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next] mlx4: do not use rwlock in fast path
From: Tariq Toukan @ 2018-06-27 12:11 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: netdev, Tariq Toukan, Shawn Bohrer, Shay Agroskin,
	Eran Ben Elisha
In-Reply-To: <1486660204.7793.104.camel@edumazet-glaptop3.roam.corp.google.com>



On 09/02/2017 7:10 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Using a reader-writer lock in fast path is silly, when we can
> instead use RCU or a seqlock.
> 
> For mlx4 hwstamp clock, a seqlock is the way to go, removing
> two atomic operations and false sharing.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_clock.c |   35 ++++++++--------
>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |    2
>   2 files changed, 19 insertions(+), 18 deletions(-)
> 

Hi Eric,

When my peer, Shay, modified mlx5 to adopt this same locking 
scheme/type, he noticed a degradation in packet rate.
He got back to testing mlx4 and also noticed a degradation introduced by 
this patch.

Perf numbers (single ring):

mlx4:
with rw-lock: ~8.54M pps
with seq-lock: ~8.51M pps

mlx5:
With rw-lock: ~14.94M pps
With seq-lock: ~14.48M pps

Actually, this can be explained by the analysis below.
In short, number of readers is significantly larger than of writers. 
Hence optimizing the readers flow would give better numbers. The issue 
is, the read/write lock might cause writers starvation. Maybe RCU fits 
best here?

Degradation analysis:
The patch changes the lock type which protects reads and updates of a 
variable ( (struct mlx4_en_dev).clock variable)
This variable is used to convert the hw timestamp into skb->hwtstamps.
This variable is read for each transmitted/received packet and updated 
only via ptp module and some overflow periodic work we have (maximum of 
10 times per second)
Meaning that there are much more readers than writers, and it’s best to 
optimize the readers flow.

Best,
Tariq

^ permalink raw reply

* Re: s390x BPF JIT failures with test_bpf
From: Kleber Souza @ 2018-06-27 12:30 UTC (permalink / raw)
  To: Daniel Borkmann, linux-s390, netdev; +Cc: Alexei Starovoitov
In-Reply-To: <0511ac38-beec-131f-5643-04eee3357cc4@iogearbox.net>

On 06/27/18 12:36, Daniel Borkmann wrote:
> On 06/27/2018 12:13 PM, Kleber Souza wrote:
>> On 06/27/18 12:01, Daniel Borkmann wrote:
>>> On 06/27/2018 11:40 AM, Kleber Souza wrote:
>>> [...]
>>>> When I load the test_bpf module from mainline (v4.18-rc2) with
>>>> CONFIG_BPF_JIT_ALWAYS_ON=y on a s390x system I get the following errors:
>>>>
>>>> test_bpf: #289 BPF_MAXINSNS: Ctx heavy transformations FAIL to
>>>> prog_create err=-524 len=4096
>>>> test_bpf: #290 BPF_MAXINSNS: Call heavy transformations FAIL to
>>>> prog_create err=-524 len=4096
>>>> [...]
>>>> test_bpf: #296 BPF_MAXINSNS: exec all MSH FAIL to prog_create err=-524
>>>> len=4096
>>>> test_bpf: #297 BPF_MAXINSNS: ld_abs+get_processor_id FAIL to prog_create
>>>> err=-524 len=4096
>>>>
>>>> From a quick look at the code it seems that
>>>> arch/s390/net/bpf_jit_comp.c:bpf_int_jit_compile() is failing to JIT
>>>> compile the test code.
>>>>
>>>> Are those failures expected and could be flagged with FLAG_EXPECTED_FAIL
>>>> on lib/test_bpf.c or are those caused by some issue with the s390x JIT
>>>> compiler that needs to be fixed?
>>>
>>> JIT doesn't guarantee in general to map really all programs to native insns,
>>> so some, mostly crafted corner cases could fail. E.g. x86-64 JIT doesn't converge
>>> on some programs in test_bpf.c and thus falls back to interpreter or simply
>>> rejects the program in case of CONFIG_BPF_JIT_ALWAYS_ON=y. Above would seem
>>> likely that it's hitting the BPF_SIZE_MAX that s390 would do. I think it might
>>> make sense to either have the FLAG_EXPECTED_FAIL in lib/test_bpf.c more fine
>>> grained as a flag per arch, so we could say it's expected to fail on e.g. s390
>>> but not on x86 and the like, or just denote it as 'could potentially fail but
>>> doesn't have to be the case everywhere'.
>>
>> Thank you for your reply. I will run some more tests to make sure we are
>> hitting BPF_SIZE_MAX or what exactly is failing and send a patch to flag
>> it conditionally for s390x.
> 
> Sounds good, thanks! In any case, please let us know your findings.
> 
> Best,
> Daniel
> 
Hi Daniel,

Your presumption was correct, all four tests are failing because they
exceed BPF_SIZE_MAX. I'll send a patch shortly.

Thanks!
Kleber

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Flavio Leitner @ 2018-06-27 12:31 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpWxHb8VrHUQAEgOQ0YsSVt5MMZvGvAQVuA-JGcfjc=ubg@mail.gmail.com>

On Tue, Jun 26, 2018 at 06:28:27PM -0700, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 5:39 PM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> > > On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> > > >
> > > > It is still isolated, the sk carries the netns info and it is
> > > > orphaned when it re-enters the stack.
> > >
> > > Then what difference does your patch make?
> >
> > Don't forget it is fixing two issues.
> 
> Sure. I am only talking about TSQ from the very beginning.
> Let me rephrase my above question:
> What difference does your patch make to TSQ?

It avoids burstiness.

> > > Before your patch:
> > > veth orphans skb in its xmit
> > >
> > > After your patch:
> > > RX orphans it when re-entering stack (as you claimed, I don't know)
> >
> > ip_rcv, and equivalents.
> 
> ip_rcv() is L3, we enter a stack from L1. So your above claim is incorrect. :)

Maybe you found a problem, could you please point me to where in
between L1 to L3 the socket is relevant?


> > > And for veth pair:
> > > xmit from one side is RX for the other side
> > > So, where is the queueing? Where is the buffer bloat? GRO list??
> >
> > CPU backlog.
> 
> Yeah, but this is never targeted by TSQ:
> 
>         tcp_limit_output_bytes limits the number of bytes on qdisc
>         or device to reduce artificial RTT/cwnd and reduce bufferbloat.
> 
> which means you have to update Documentation/networking/ip-sysctl.txt
> too.

How it is never targeted? Whole point is to avoid queueing traffic.
Would you be okay if I include this chunk?

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ce8fbf5aa63c..f4c042be0216 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -733,11 +733,11 @@ tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
 	gets losses notifications. With SNDBUF autotuning, this can
-	result in a large amount of packets queued in qdisc/device
-	on the local machine, hurting latency of other flows, for
-	typical pfifo_fast qdiscs.
-	tcp_limit_output_bytes limits the number of bytes on qdisc
-	or device to reduce artificial RTT/cwnd and reduce bufferbloat.
+	result in a large amount of packets queued on the local machine
+	(e.g.: qdiscs, CPU backlog, or device) hurting latency of other
+	flows, for typical pfifo_fast qdiscs.  tcp_limit_output_bytes
+	limits the number of bytes on qdisc or device to reduce artificial
+	RTT/cwnd and reduce bufferbloat.
 	Default: 262144
 
 tcp_challenge_ack_limit - INTEGER


-- 
Flavio

^ permalink raw reply related

* Re: [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv
From: Florian Westphal @ 2018-06-27 12:32 UTC (permalink / raw)
  To: Edward Cree; +Cc: linux-net-drivers, netdev, davem
In-Reply-To: <f4b330bf-a1b3-2849-4265-faa0aaa1e941@solarflare.com>

Edward Cree <ecree@solarflare.com> wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
>  Rather than attempting to make netfilter know about lists (which would be
>  a major project in itself) we just let it call the regular okfn (in this
>  case ip_rcv_finish()) for any packets it steals, and have it give us back
>  a list of packets it's synchronously accepted (which normally NF_HOOK
>  would automatically call okfn() on, but we want to be able to potentially
>  pass the list to a listified version of okfn().)

okfn() is only used during async reinject in NFQUEUE case,
skb is queued in kernel and we'll wait for a verdict from a userspace
process.  If thats ACCEPT, then okfn() gets called to reinject the skb
into the network stack.

A normal -j ACCEPT doesn't call okfn in the netfilter core, which is why
this occurs on '1' retval in NF_HOOK().

Only other user of okfn() is bridge netfilter, so listified version of
okfn() doesn't make too much sense to me, its not used normally
(unless such listified version makes the code simpler of course).

AFAICS its fine to unlink/free skbs from the list to handle
drops/queueing etc. so a future version of nf_hook() could propagate the
list into nf_hook_slow and mangle the list there to deal with hooks
that steal/drop/queue skbs.

Later on we can pass the list to the hook functions themselves.

We'll have to handle non-accept verdicts in-place in the hook functions
for this, but fortunately most hookfns only return NF_ACCEPT so I think
it is manageable.

I'll look into this once the series makes it to net-next.

^ permalink raw reply

* [PATCH net-next] net: stmmac: Add support for CBS QDISC
From: Jose Abreu @ 2018-06-27 12:43 UTC (permalink / raw)
  To: netdev
  Cc: Jose Abreu, David S. Miller, Joao Pinto, Vitor Soares,
	Giuseppe Cavallaro, Alexandre Torgue

This adds support for CBS reconfiguration using the TC application.

A new callback was added to TC ops struct and another one to DMA ops to
reconfigure the channel mode.

Tested in GMAC5.10.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Vitor Soares <soares@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c  | 15 ++++++
 drivers/net/ethernet/stmicro/stmmac/hwif.h        |  8 +++
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |  2 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c   | 62 +++++++++++++++++++++++
 4 files changed, 87 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index d37f17ca62fe..6e32f8a3710b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -407,6 +407,19 @@ static void dwmac4_enable_tso(void __iomem *ioaddr, bool en, u32 chan)
 	}
 }
 
+static void dwmac4_qmode(void __iomem *ioaddr, u32 channel, u8 qmode)
+{
+	u32 mtl_tx_op = readl(ioaddr + MTL_CHAN_TX_OP_MODE(channel));
+
+	mtl_tx_op &= ~MTL_OP_MODE_TXQEN_MASK;
+	if (qmode != MTL_QUEUE_AVB)
+		mtl_tx_op |= MTL_OP_MODE_TXQEN;
+	else
+		mtl_tx_op |= MTL_OP_MODE_TXQEN_AV;
+
+	writel(mtl_tx_op, ioaddr +  MTL_CHAN_TX_OP_MODE(channel));
+}
+
 const struct stmmac_dma_ops dwmac4_dma_ops = {
 	.reset = dwmac4_dma_reset,
 	.init = dwmac4_dma_init,
@@ -431,6 +444,7 @@ const struct stmmac_dma_ops dwmac4_dma_ops = {
 	.set_rx_tail_ptr = dwmac4_set_rx_tail_ptr,
 	.set_tx_tail_ptr = dwmac4_set_tx_tail_ptr,
 	.enable_tso = dwmac4_enable_tso,
+	.qmode = dwmac4_qmode,
 };
 
 const struct stmmac_dma_ops dwmac410_dma_ops = {
@@ -457,4 +471,5 @@ const struct stmmac_dma_ops dwmac410_dma_ops = {
 	.set_rx_tail_ptr = dwmac4_set_rx_tail_ptr,
 	.set_tx_tail_ptr = dwmac4_set_tx_tail_ptr,
 	.enable_tso = dwmac4_enable_tso,
+	.qmode = dwmac4_qmode,
 };
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index e44e7b26ce82..e2a965790648 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
@@ -183,6 +183,7 @@ struct stmmac_dma_ops {
 	void (*set_rx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
 	void (*set_tx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
 	void (*enable_tso)(void __iomem *ioaddr, bool en, u32 chan);
+	void (*qmode)(void __iomem *ioaddr, u32 channel, u8 qmode);
 };
 
 #define stmmac_reset(__priv, __args...) \
@@ -235,6 +236,8 @@ struct stmmac_dma_ops {
 	stmmac_do_void_callback(__priv, dma, set_tx_tail_ptr, __args)
 #define stmmac_enable_tso(__priv, __args...) \
 	stmmac_do_void_callback(__priv, dma, enable_tso, __args)
+#define stmmac_dma_qmode(__priv, __args...) \
+	stmmac_do_void_callback(__priv, dma, qmode, __args)
 
 struct mac_device_info;
 struct net_device;
@@ -441,17 +444,22 @@ struct stmmac_mode_ops {
 
 struct stmmac_priv;
 struct tc_cls_u32_offload;
+struct tc_cbs_qopt_offload;
 
 struct stmmac_tc_ops {
 	int (*init)(struct stmmac_priv *priv);
 	int (*setup_cls_u32)(struct stmmac_priv *priv,
 			     struct tc_cls_u32_offload *cls);
+	int (*setup_cbs)(struct stmmac_priv *priv,
+			 struct tc_cbs_qopt_offload *qopt);
 };
 
 #define stmmac_tc_init(__priv, __args...) \
 	stmmac_do_callback(__priv, tc, init, __args)
 #define stmmac_tc_setup_cls_u32(__priv, __args...) \
 	stmmac_do_callback(__priv, tc, setup_cls_u32, __args)
+#define stmmac_tc_setup_cbs(__priv, __args...) \
+	stmmac_do_callback(__priv, tc, setup_cbs, __args)
 
 struct stmmac_regs_off {
 	u32 ptp_off;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ccf3e2e161ca..a6be3b59a728 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3778,6 +3778,8 @@ static int stmmac_setup_tc(struct net_device *ndev, enum tc_setup_type type,
 	switch (type) {
 	case TC_SETUP_BLOCK:
 		return stmmac_setup_tc_block(priv, type_data);
+	case TC_SETUP_QDISC_CBS:
+		return stmmac_tc_setup_cbs(priv, priv, type_data);
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
index 881c94b73e2f..67542bd21010 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
@@ -289,7 +289,69 @@ static int tc_init(struct stmmac_priv *priv)
 	return 0;
 }
 
+static int tc_setup_cbs(struct stmmac_priv *priv,
+			struct tc_cbs_qopt_offload *qopt)
+{
+	u32 tx_queues_count = priv->plat->tx_queues_to_use;
+	u32 queue = qopt->queue;
+	u32 ptr, speed_div;
+	u32 mode_to_use;
+	s64 value;
+	int ret;
+
+	/* Queue 0 is not AVB capable */
+	if (queue <= 0 || queue >= tx_queues_count)
+		return -EINVAL;
+	if (priv->speed != SPEED_100 && priv->speed != SPEED_1000)
+		return -EOPNOTSUPP;
+
+	mode_to_use = priv->plat->tx_queues_cfg[queue].mode_to_use;
+	if (mode_to_use == MTL_QUEUE_DCB && qopt->enable) {
+		ret = stmmac_dma_qmode(priv, priv->ioaddr, queue, MTL_QUEUE_AVB);
+		if (ret)
+			return ret;
+
+		priv->plat->tx_queues_cfg[queue].mode_to_use = MTL_QUEUE_AVB;
+	} else if (!qopt->enable) {
+		return stmmac_dma_qmode(priv, priv->ioaddr, queue, MTL_QUEUE_DCB);
+	}
+
+	/* Port Transmit Rate and Speed Divider */
+	ptr = (priv->speed == SPEED_100) ? 4 : 8;
+	speed_div = (priv->speed == SPEED_100) ? 100000 : 1000000;
+
+	/* Final adjustments for HW */
+	value = qopt->idleslope * 1024 * ptr;
+	do_div(value, speed_div);
+	priv->plat->tx_queues_cfg[queue].idle_slope = value & GENMASK(31, 0);
+
+	value = -qopt->sendslope * 1024UL * ptr;
+	do_div(value, speed_div);
+	priv->plat->tx_queues_cfg[queue].send_slope = value & GENMASK(31, 0);
+
+	value = qopt->hicredit * 1024 * 8;
+	priv->plat->tx_queues_cfg[queue].high_credit = value & GENMASK(31, 0);
+
+	value = qopt->locredit * 1024 * 8;
+	priv->plat->tx_queues_cfg[queue].low_credit = value & GENMASK(31, 0);
+
+	ret = stmmac_config_cbs(priv, priv->hw,
+				priv->plat->tx_queues_cfg[queue].send_slope,
+				priv->plat->tx_queues_cfg[queue].idle_slope,
+				priv->plat->tx_queues_cfg[queue].high_credit,
+				priv->plat->tx_queues_cfg[queue].low_credit,
+				queue);
+	if (ret)
+		return ret;
+
+	dev_info(priv->device, "CBS queue %d: send %d, idle %d, hi %d, lo %d\n",
+			queue, qopt->sendslope, qopt->idleslope,
+			qopt->hicredit, qopt->locredit);
+	return 0;
+}
+
 const struct stmmac_tc_ops dwmac510_tc_ops = {
 	.init = tc_init,
 	.setup_cls_u32 = tc_setup_cls_u32,
+	.setup_cbs = tc_setup_cbs,
 };
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Daniel Borkmann @ 2018-06-27 12:56 UTC (permalink / raw)
  To: Okash Khawaja
  Cc: Jakub Kicinski, Martin KaFai Lau, Alexei Starovoitov,
	Yonghong Song, Quentin Monnet, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <20180627114737.GA1580@w1t1fb>

On 06/27/2018 01:47 PM, Okash Khawaja wrote:
> On Wed, Jun 27, 2018 at 12:34:35PM +0200, Daniel Borkmann wrote:
>> On 06/27/2018 12:35 AM, Jakub Kicinski wrote:
>>> On Tue, 26 Jun 2018 15:27:09 -0700, Martin KaFai Lau wrote:
>>>> On Tue, Jun 26, 2018 at 01:31:33PM -0700, Jakub Kicinski wrote:
>> [...]
>>>>> Implementing both outputs in one series will help you structure your
>>>>> code to best suit both of the formats up front.  
>>>> hex and "formatted" are the only things missing?  As always, things
>>>> can be refactored when new use case comes up.  Lets wait for
>>>> Okash input.
>>>>
>>>> Regardless, plaintext is our current use case.  Having the current
>>>> patchset in does not stop us or others from contributing other use
>>>> cases (json, "bpftool map find"...etc),  and IMO it is actually
>>>> the opposite.  Others may help us get there faster than us alone.
>>>> We should not stop making forward progress and take this patch
>>>> as hostage because "abc" and "xyz" are not done together.
>>>
>>> Parity between JSON and plain text output is non negotiable.
>>
>> Longish discussion and some confusion in this thread. :-) First of all
>> thanks a lot for working on it, very useful! 
> Thanks :)
> 
>> My $0.02 on it is that so far
>> great care has been taken in bpftool to indeed have feature parity between
>> JSON and plain text, so it would be highly desirable to keep continuing
>> this practice if the consensus is that it indeed is feasible and makes
>> sense wrt BTF data. There has been mentioned that given BTF data can be
>> dynamic depending on what the user loads via bpf(2) so a potential JSON
>> output may look different/break each time anyway. This however could all be
>> embedded under a container object that has a fixed key like 'formatted'
>> where tools like jq(1) can query into it. I think this would be fine since
>> the rest of the (non-dynamic) output is still retained as-is and then
>> wouldn't confuse or collide with existing users, and anyone programmatically
>> parsing deeper into the BTF data under such JSON container object needs
>> to have awareness of what specific data it wants to query from it; so
>> there's no conflict wrt breaking anything here. Imho, both outputs would
>> be very valuable.
> Okay I can add "formatted" object under json output.
> 
> One thing to note here is that the fixed output will change if the map
> itself changes. So someone writing a program that consumes that fixed
> output will have to account for his program breaking in future, thus

Yes, that aspect is fine though, any program/script parsing this would need
to be aware of the underlying map type to make sense of it (e.g. per-cpu vs
non per-cpu maps to name one). But that info it could query/verify already
beforehand via bpftool as well (via normal map info dump for a given id).

> breaking backward compatibility anyway as far as the developer is
> concerned :)
> 
> I will go ahead with work on "formatted" object.

Cool, thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] infiniband: i40iw, nes: don't use wall time for TCP sequence numbers
From: Arnd Bergmann @ 2018-06-27 13:24 UTC (permalink / raw)
  To: Shiraz Saleem
  Cc: Ismail, Mustafa, Kees Cook, Latif, Faisal, y2038@lists.linaro.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Yuval Shaia,
	Orosco, Henry, Jason Gunthorpe, Doug Ledford, Jia-Ju Bai,
	Nikolova, Tatyana E, Bart Van Assche, David S. Miller,
	Reshetova, Elena, linux-rdma@vger.kernel.org
In-Reply-To: <20180623000137.GA17264@ssaleem-MOBL4.amr.corp.intel.com>

On Sat, Jun 23, 2018 at 2:01 AM, Shiraz Saleem <shiraz.saleem@intel.com> wrote:

>> @@ -2164,7 +2165,6 @@ static struct i40iw_cm_node *i40iw_make_cm_node(
>>                                  struct i40iw_cm_listener *listener)
>>  {
>>       struct i40iw_cm_node *cm_node;
>> -     struct timespec ts;
>>       int oldarpindex;
>>       int arpindex;
>>       struct net_device *netdev = iwdev->netdev;
>> @@ -2214,8 +2214,10 @@ static struct i40iw_cm_node *i40iw_make_cm_node(
>>       cm_node->tcp_cntxt.rcv_wscale = I40IW_CM_DEFAULT_RCV_WND_SCALE;
>>       cm_node->tcp_cntxt.rcv_wnd =
>>                       I40IW_CM_DEFAULT_RCV_WND_SCALED >> I40IW_CM_DEFAULT_RCV_WND_SCALE;
>> -     ts = current_kernel_time();
>> -     cm_node->tcp_cntxt.loc_seq_num = ts.tv_nsec;
>> +     cm_node->tcp_cntxt.loc_seq_num = secure_tcp_seq(htonl(cm_node->loc_addr[0]),
>> +                                                     htonl(cm_node->rem_addr[0]),
>> +                                                     htons(cm_node->loc_port),
>> +                                                     htons(cm_node->rem_port));
>
> Should we not be using secure_tcpv6_seq() when we are ipv6?

I had not realized that there is a difference, but yes, from looking
at that function it seems that we should. v2 coming now.

       Arnd
_______________________________________________
Y2038 mailing list
Y2038@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/y2038

^ permalink raw reply

* [PATCH] [v2] infiniband: i40iw, nes: don't use wall time for TCP sequence numbers
From: Arnd Bergmann @ 2018-06-27 13:26 UTC (permalink / raw)
  To: Faisal Latif, Shiraz Saleem, Doug Ledford, Jason Gunthorpe,
	David S. Miller
  Cc: Arnd Bergmann, Henry Orosco, Tatyana Nikolova, Mustafa Ismail,
	Bart Van Assche, Yuval Shaia, linux-rdma, linux-kernel, netdev

The nes infiniband driver uses current_kernel_time() to get a nanosecond
granunarity timestamp to initialize its tcp sequence counters. This is
one of only a few remaining users of that deprecated function, so we
should try to get rid of it.

Aside from using a deprecated API, there are several problems I see here:

- Using a CLOCK_REALTIME based time source makes it predictable in
  case the time base is synchronized.
- Using a coarse timestamp means it only gets updated once per jiffie,
  making it even more predictable in order to avoid having to access
  the hardware clock source
- The upper 2 bits are always zero because the nanoseconds are at most
  999999999.

For the Linux TCP implementation, we use secure_tcp_seq(), which appears
to be appropriate here as well, and solves all the above problems.

i40iw uses a variant of the same code, so I do that same thing there
for ipv4. Unlike nes, i40e also supports ipv6, which needs to call
secure_tcpv6_seq instead.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
v2: use secure_tcpv6_seq for IPv6 support as suggested by Shiraz Saleem.
---
 drivers/infiniband/hw/i40iw/i40iw_cm.c | 26 +++++++++++++++++++++-----
 drivers/infiniband/hw/nes/nes_cm.c     |  8 +++++---
 net/core/secure_seq.c                  |  1 +
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c b/drivers/infiniband/hw/i40iw/i40iw_cm.c
index 7b2655128b9f..e2590784b857 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
@@ -57,6 +57,7 @@
 #include <net/addrconf.h>
 #include <net/ip6_route.h>
 #include <net/ip_fib.h>
+#include <net/secure_seq.h>
 #include <net/tcp.h>
 #include <asm/checksum.h>
 
@@ -2164,7 +2165,6 @@ static struct i40iw_cm_node *i40iw_make_cm_node(
 				   struct i40iw_cm_listener *listener)
 {
 	struct i40iw_cm_node *cm_node;
-	struct timespec ts;
 	int oldarpindex;
 	int arpindex;
 	struct net_device *netdev = iwdev->netdev;
@@ -2214,10 +2214,26 @@ static struct i40iw_cm_node *i40iw_make_cm_node(
 	cm_node->tcp_cntxt.rcv_wscale = I40IW_CM_DEFAULT_RCV_WND_SCALE;
 	cm_node->tcp_cntxt.rcv_wnd =
 			I40IW_CM_DEFAULT_RCV_WND_SCALED >> I40IW_CM_DEFAULT_RCV_WND_SCALE;
-	ts = current_kernel_time();
-	cm_node->tcp_cntxt.loc_seq_num = ts.tv_nsec;
-	cm_node->tcp_cntxt.mss = (cm_node->ipv4) ? (iwdev->vsi.mtu - I40IW_MTU_TO_MSS_IPV4) :
-				 (iwdev->vsi.mtu - I40IW_MTU_TO_MSS_IPV6);
+	if (cm_node->ipv4) {
+		cm_node->tcp_cntxt.loc_seq_num = secure_tcp_seq(htonl(cm_node->loc_addr[0]),
+							htonl(cm_node->rem_addr[0]),
+							htons(cm_node->loc_port),
+							htons(cm_node->rem_port));
+		cm_node->tcp_cntxt.mss = iwdev->vsi.mtu - I40IW_MTU_TO_MSS_IPV4;
+	} else {
+		__be32 loc[4] = {
+			htonl(cm_node->loc_addr[0]), htonl(cm_node->loc_addr[1]),
+			htonl(cm_node->loc_addr[2]), htonl(cm_node->loc_addr[3])
+		};
+		__be32 rem[4] = {
+			htonl(cm_node->rem_addr[0]), htonl(cm_node->rem_addr[1]),
+			htonl(cm_node->rem_addr[2]), htonl(cm_node->rem_addr[3])
+		};
+		cm_node->tcp_cntxt.loc_seq_num = secure_tcpv6_seq(loc, rem,
+							htons(cm_node->loc_port),
+							htons(cm_node->rem_port));
+		cm_node->tcp_cntxt.mss = iwdev->vsi.mtu - I40IW_MTU_TO_MSS_IPV6;
+	}
 
 	cm_node->iwdev = iwdev;
 	cm_node->dev = &iwdev->sc_dev;
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 6cdfbf8c5674..2b67ace5b614 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -58,6 +58,7 @@
 #include <net/neighbour.h>
 #include <net/route.h>
 #include <net/ip_fib.h>
+#include <net/secure_seq.h>
 #include <net/tcp.h>
 #include <linux/fcntl.h>
 
@@ -1445,7 +1446,6 @@ static struct nes_cm_node *make_cm_node(struct nes_cm_core *cm_core,
 					struct nes_cm_listener *listener)
 {
 	struct nes_cm_node *cm_node;
-	struct timespec ts;
 	int oldarpindex = 0;
 	int arpindex = 0;
 	struct nes_device *nesdev;
@@ -1496,8 +1496,10 @@ static struct nes_cm_node *make_cm_node(struct nes_cm_core *cm_core,
 	cm_node->tcp_cntxt.rcv_wscale = NES_CM_DEFAULT_RCV_WND_SCALE;
 	cm_node->tcp_cntxt.rcv_wnd = NES_CM_DEFAULT_RCV_WND_SCALED >>
 				     NES_CM_DEFAULT_RCV_WND_SCALE;
-	ts = current_kernel_time();
-	cm_node->tcp_cntxt.loc_seq_num = htonl(ts.tv_nsec);
+	cm_node->tcp_cntxt.loc_seq_num = secure_tcp_seq(htonl(cm_node->loc_addr),
+							htonl(cm_node->rem_addr),
+							htons(cm_node->loc_port),
+							htons(cm_node->rem_port));
 	cm_node->tcp_cntxt.mss = nesvnic->max_frame_size - sizeof(struct iphdr) -
 				 sizeof(struct tcphdr) - ETH_HLEN - VLAN_HLEN;
 	cm_node->tcp_cntxt.rcv_nxt = 0;
diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 7232274de334..af6ad467ed61 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -140,6 +140,7 @@ u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
 			    &net_secret);
 	return seq_scale(hash);
 }
+EXPORT_SYMBOL_GPL(secure_tcp_seq);
 
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport)
 {
-- 
2.9.0

^ permalink raw reply related

* [PATCH v2 net-next 0/2] net: preserve sock reference when scrubbing the skb.
From: Flavio Leitner @ 2018-06-27 13:34 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Paolo Abeni, David Miller, Florian Westphal,
	netfilter-devel, Flavio Leitner

The sock reference is lost when scrubbing the packet and that breaks
TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
performance impacts of about 50% in a single TCP stream when crossing
network namespaces.

XPS breaks because the queue mapping stored in the socket is not
available, so another random queue might be selected when the stack
needs to transmit something like a TCP ACK, or TCP Retransmissions.
That causes packet re-ordering and/or performance issues.

TSQ breaks because it orphans the packet while it is still in the
host, so packets are queued contributing to the buffer bloat problem.

Preserving the sock reference fixes both issues. The socket is
orphaned anyways in the receiving path before any relevant action,
but the transmit side needs some extra checking included in the
first patch.

The first patch will update netfilter to check if the socket
netns is local before use it.

The second patch removes the skb_orphan() from the skb_scrub_packet()
and improve the documentation.

ChangeLog:
- split into two (Eric)
- addressed Paolo's offline feedback to swap the checks in xt_socket.c
  to preserve original behavior.
- improved ip-sysctl.txt (reported by Cong)

Flavio Leitner (2):
  netfilter: check if the socket netns is correct.
  skbuff: preserve sock reference when scrubbing the skb.

 Documentation/networking/ip-sysctl.txt | 10 +++++-----
 include/net/netfilter/nf_log.h         |  3 ++-
 net/core/skbuff.c                      |  1 -
 net/ipv4/netfilter/nf_log_ipv4.c       |  8 ++++----
 net/ipv6/netfilter/nf_log_ipv6.c       |  8 ++++----
 net/netfilter/nf_conntrack_broadcast.c |  2 +-
 net/netfilter/nf_log_common.c          |  5 +++--
 net/netfilter/nf_nat_core.c            |  6 +++++-
 net/netfilter/nft_meta.c               |  9 ++++++---
 net/netfilter/nft_socket.c             |  5 ++++-
 net/netfilter/xt_cgroup.c              |  6 ++++--
 net/netfilter/xt_owner.c               |  2 +-
 net/netfilter/xt_recent.c              |  3 ++-
 net/netfilter/xt_socket.c              |  8 ++++++++
 14 files changed, 49 insertions(+), 27 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [PATCH v2 net-next 1/2] netfilter: check if the socket netns is correct.
From: Flavio Leitner @ 2018-06-27 13:34 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Paolo Abeni, David Miller, Florian Westphal,
	netfilter-devel, Flavio Leitner
In-Reply-To: <20180627133426.3858-1-fbl@redhat.com>

Netfilter assumes that if the socket is present in the skb, then
it can be used because that reference is cleaned up while the skb
is crossing netns.

We want to change that to preserve the socket reference in a future
patch, so this is a preparation updating netfilter to check if the
socket netns matches before use it.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 include/net/netfilter/nf_log.h         | 3 ++-
 net/ipv4/netfilter/nf_log_ipv4.c       | 8 ++++----
 net/ipv6/netfilter/nf_log_ipv6.c       | 8 ++++----
 net/netfilter/nf_conntrack_broadcast.c | 2 +-
 net/netfilter/nf_log_common.c          | 5 +++--
 net/netfilter/nf_nat_core.c            | 6 +++++-
 net/netfilter/nft_meta.c               | 9 ++++++---
 net/netfilter/nft_socket.c             | 5 ++++-
 net/netfilter/xt_cgroup.c              | 6 ++++--
 net/netfilter/xt_owner.c               | 2 +-
 net/netfilter/xt_recent.c              | 3 ++-
 net/netfilter/xt_socket.c              | 8 ++++++++
 12 files changed, 44 insertions(+), 21 deletions(-)

diff --git a/include/net/netfilter/nf_log.h b/include/net/netfilter/nf_log.h
index e811ac07ea94..0d3920896d50 100644
--- a/include/net/netfilter/nf_log.h
+++ b/include/net/netfilter/nf_log.h
@@ -106,7 +106,8 @@ int nf_log_dump_udp_header(struct nf_log_buf *m, const struct sk_buff *skb,
 int nf_log_dump_tcp_header(struct nf_log_buf *m, const struct sk_buff *skb,
 			   u8 proto, int fragment, unsigned int offset,
 			   unsigned int logflags);
-void nf_log_dump_sk_uid_gid(struct nf_log_buf *m, struct sock *sk);
+void nf_log_dump_sk_uid_gid(struct net *net, struct nf_log_buf *m,
+			    struct sock *sk);
 void nf_log_dump_packet_common(struct nf_log_buf *m, u_int8_t pf,
 			       unsigned int hooknum, const struct sk_buff *skb,
 			       const struct net_device *in,
diff --git a/net/ipv4/netfilter/nf_log_ipv4.c b/net/ipv4/netfilter/nf_log_ipv4.c
index 4388de0e5380..1e6f28c97d3a 100644
--- a/net/ipv4/netfilter/nf_log_ipv4.c
+++ b/net/ipv4/netfilter/nf_log_ipv4.c
@@ -35,7 +35,7 @@ static const struct nf_loginfo default_loginfo = {
 };
 
 /* One level of recursion won't kill us */
-static void dump_ipv4_packet(struct nf_log_buf *m,
+static void dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 			     const struct nf_loginfo *info,
 			     const struct sk_buff *skb, unsigned int iphoff)
 {
@@ -183,7 +183,7 @@ static void dump_ipv4_packet(struct nf_log_buf *m,
 			/* Max length: 3+maxlen */
 			if (!iphoff) { /* Only recurse once. */
 				nf_log_buf_add(m, "[");
-				dump_ipv4_packet(m, info, skb,
+				dump_ipv4_packet(net, m, info, skb,
 					    iphoff + ih->ihl*4+sizeof(_icmph));
 				nf_log_buf_add(m, "] ");
 			}
@@ -251,7 +251,7 @@ static void dump_ipv4_packet(struct nf_log_buf *m,
 
 	/* Max length: 15 "UID=4294967295 " */
 	if ((logflags & NF_LOG_UID) && !iphoff)
-		nf_log_dump_sk_uid_gid(m, skb->sk);
+		nf_log_dump_sk_uid_gid(net, m, skb->sk);
 
 	/* Max length: 16 "MARK=0xFFFFFFFF " */
 	if (!iphoff && skb->mark)
@@ -333,7 +333,7 @@ static void nf_log_ip_packet(struct net *net, u_int8_t pf,
 	if (in != NULL)
 		dump_ipv4_mac_header(m, loginfo, skb);
 
-	dump_ipv4_packet(m, loginfo, skb, 0);
+	dump_ipv4_packet(net, m, loginfo, skb, 0);
 
 	nf_log_buf_close(m);
 }
diff --git a/net/ipv6/netfilter/nf_log_ipv6.c b/net/ipv6/netfilter/nf_log_ipv6.c
index b397a8fe88b9..c6bf580d0f33 100644
--- a/net/ipv6/netfilter/nf_log_ipv6.c
+++ b/net/ipv6/netfilter/nf_log_ipv6.c
@@ -36,7 +36,7 @@ static const struct nf_loginfo default_loginfo = {
 };
 
 /* One level of recursion won't kill us */
-static void dump_ipv6_packet(struct nf_log_buf *m,
+static void dump_ipv6_packet(struct net *net, struct nf_log_buf *m,
 			     const struct nf_loginfo *info,
 			     const struct sk_buff *skb, unsigned int ip6hoff,
 			     int recurse)
@@ -258,7 +258,7 @@ static void dump_ipv6_packet(struct nf_log_buf *m,
 			/* Max length: 3+maxlen */
 			if (recurse) {
 				nf_log_buf_add(m, "[");
-				dump_ipv6_packet(m, info, skb,
+				dump_ipv6_packet(net, m, info, skb,
 						 ptr + sizeof(_icmp6h), 0);
 				nf_log_buf_add(m, "] ");
 			}
@@ -278,7 +278,7 @@ static void dump_ipv6_packet(struct nf_log_buf *m,
 
 	/* Max length: 15 "UID=4294967295 " */
 	if ((logflags & NF_LOG_UID) && recurse)
-		nf_log_dump_sk_uid_gid(m, skb->sk);
+		nf_log_dump_sk_uid_gid(net, m, skb->sk);
 
 	/* Max length: 16 "MARK=0xFFFFFFFF " */
 	if (recurse && skb->mark)
@@ -365,7 +365,7 @@ static void nf_log_ip6_packet(struct net *net, u_int8_t pf,
 	if (in != NULL)
 		dump_ipv6_mac_header(m, loginfo, skb);
 
-	dump_ipv6_packet(m, loginfo, skb, skb_network_offset(skb), 1);
+	dump_ipv6_packet(net, m, loginfo, skb, skb_network_offset(skb), 1);
 
 	nf_log_buf_close(m);
 }
diff --git a/net/netfilter/nf_conntrack_broadcast.c b/net/netfilter/nf_conntrack_broadcast.c
index a1086bdec242..5423b197d98a 100644
--- a/net/netfilter/nf_conntrack_broadcast.c
+++ b/net/netfilter/nf_conntrack_broadcast.c
@@ -32,7 +32,7 @@ int nf_conntrack_broadcast_help(struct sk_buff *skb,
 	__be32 mask = 0;
 
 	/* we're only interested in locally generated packets */
-	if (skb->sk == NULL)
+	if (skb->sk == NULL || !net_eq(nf_ct_net(ct), sock_net(skb->sk)))
 		goto out;
 	if (rt == NULL || !(rt->rt_flags & RTCF_BROADCAST))
 		goto out;
diff --git a/net/netfilter/nf_log_common.c b/net/netfilter/nf_log_common.c
index dc61399e30be..a8c5c846aec1 100644
--- a/net/netfilter/nf_log_common.c
+++ b/net/netfilter/nf_log_common.c
@@ -132,9 +132,10 @@ int nf_log_dump_tcp_header(struct nf_log_buf *m, const struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(nf_log_dump_tcp_header);
 
-void nf_log_dump_sk_uid_gid(struct nf_log_buf *m, struct sock *sk)
+void nf_log_dump_sk_uid_gid(struct net *net, struct nf_log_buf *m,
+			    struct sock *sk)
 {
-	if (!sk || !sk_fullsock(sk))
+	if (!sk || !sk_fullsock(sk) || !net_eq(net, sock_net(sk)))
 		return;
 
 	read_lock_bh(&sk->sk_callback_lock);
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 46f9df99d276..86df2a1666fd 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -108,6 +108,7 @@ int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
 	struct flowi fl;
 	unsigned int hh_len;
 	struct dst_entry *dst;
+	struct sock *sk = skb->sk;
 	int err;
 
 	err = xfrm_decode_session(skb, &fl, family);
@@ -119,7 +120,10 @@ int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
 		dst = ((struct xfrm_dst *)dst)->route;
 	dst_hold(dst);
 
-	dst = xfrm_lookup(net, dst, &fl, skb->sk, 0);
+	if (sk && !net_eq(net, sock_net(sk)))
+		sk = NULL;
+
+	dst = xfrm_lookup(net, dst, &fl, sk, 0);
 	if (IS_ERR(dst))
 		return PTR_ERR(dst);
 
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 1105a23bda5e..2b94dcc43456 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -107,7 +107,8 @@ static void nft_meta_get_eval(const struct nft_expr *expr,
 		break;
 	case NFT_META_SKUID:
 		sk = skb_to_full_sk(skb);
-		if (!sk || !sk_fullsock(sk))
+		if (!sk || !sk_fullsock(sk) ||
+		    !net_eq(nft_net(pkt), sock_net(sk)))
 			goto err;
 
 		read_lock_bh(&sk->sk_callback_lock);
@@ -123,7 +124,8 @@ static void nft_meta_get_eval(const struct nft_expr *expr,
 		break;
 	case NFT_META_SKGID:
 		sk = skb_to_full_sk(skb);
-		if (!sk || !sk_fullsock(sk))
+		if (!sk || !sk_fullsock(sk) ||
+		    !net_eq(nft_net(pkt), sock_net(sk)))
 			goto err;
 
 		read_lock_bh(&sk->sk_callback_lock);
@@ -214,7 +216,8 @@ static void nft_meta_get_eval(const struct nft_expr *expr,
 #ifdef CONFIG_CGROUP_NET_CLASSID
 	case NFT_META_CGROUP:
 		sk = skb_to_full_sk(skb);
-		if (!sk || !sk_fullsock(sk))
+		if (!sk || !sk_fullsock(sk) ||
+		    !net_eq(nft_net(pkt), sock_net(sk)))
 			goto err;
 		*dest = sock_cgroup_classid(&sk->sk_cgrp_data);
 		break;
diff --git a/net/netfilter/nft_socket.c b/net/netfilter/nft_socket.c
index 74e1b3bd6954..998c2b546f6d 100644
--- a/net/netfilter/nft_socket.c
+++ b/net/netfilter/nft_socket.c
@@ -23,6 +23,9 @@ static void nft_socket_eval(const struct nft_expr *expr,
 	struct sock *sk = skb->sk;
 	u32 *dest = &regs->data[priv->dreg];
 
+	if (sk && !net_eq(nft_net(pkt), sock_net(sk)))
+		sk = NULL;
+
 	if (!sk)
 		switch(nft_pf(pkt)) {
 		case NFPROTO_IPV4:
@@ -39,7 +42,7 @@ static void nft_socket_eval(const struct nft_expr *expr,
 			return;
 		}
 
-	if(!sk) {
+	if (!sk) {
 		nft_reg_store8(dest, 0);
 		return;
 	}
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 7df2dece57d3..5d92e1781980 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -72,8 +72,9 @@ static bool
 cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_cgroup_info_v0 *info = par->matchinfo;
+	struct sock *sk = skb->sk;
 
-	if (skb->sk == NULL || !sk_fullsock(skb->sk))
+	if (!sk || !sk_fullsock(sk) || !net_eq(xt_net(par), sock_net(sk)))
 		return false;
 
 	return (info->id == sock_cgroup_classid(&skb->sk->sk_cgrp_data)) ^
@@ -85,8 +86,9 @@ static bool cgroup_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct xt_cgroup_info_v1 *info = par->matchinfo;
 	struct sock_cgroup_data *skcd = &skb->sk->sk_cgrp_data;
 	struct cgroup *ancestor = info->priv;
+	struct sock *sk = skb->sk;
 
-	if (!skb->sk || !sk_fullsock(skb->sk))
+	if (!sk || !sk_fullsock(sk) || !net_eq(xt_net(par), sock_net(sk)))
 		return false;
 
 	if (ancestor)
diff --git a/net/netfilter/xt_owner.c b/net/netfilter/xt_owner.c
index 3d705c688a27..46686fb73784 100644
--- a/net/netfilter/xt_owner.c
+++ b/net/netfilter/xt_owner.c
@@ -67,7 +67,7 @@ owner_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	struct sock *sk = skb_to_full_sk(skb);
 	struct net *net = xt_net(par);
 
-	if (sk == NULL || sk->sk_socket == NULL)
+	if (!sk || !sk->sk_socket || !net_eq(net, sock_net(sk)))
 		return (info->match ^ info->invert) == 0;
 	else if (info->match & info->invert & XT_OWNER_SOCKET)
 		/*
diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 07085c22b19c..f44de4bc2100 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -265,7 +265,8 @@ recent_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	}
 
 	/* use TTL as seen before forwarding */
-	if (xt_out(par) != NULL && skb->sk == NULL)
+	if (xt_out(par) != NULL &&
+	    (!skb->sk || !net_eq(net, sock_net(skb->sk))))
 		ttl++;
 
 	spin_lock_bh(&recent_lock);
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 5c0779c4fa3c..0472f3472842 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -56,8 +56,12 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 	struct sk_buff *pskb = (struct sk_buff *)skb;
 	struct sock *sk = skb->sk;
 
+	if (!net_eq(xt_net(par), sock_net(sk)))
+		sk = NULL;
+
 	if (!sk)
 		sk = nf_sk_lookup_slow_v4(xt_net(par), skb, xt_in(par));
+
 	if (sk) {
 		bool wildcard;
 		bool transparent = true;
@@ -113,8 +117,12 @@ socket_mt6_v1_v2_v3(const struct sk_buff *skb, struct xt_action_param *par)
 	struct sk_buff *pskb = (struct sk_buff *)skb;
 	struct sock *sk = skb->sk;
 
+	if (!net_eq(xt_net(par), sock_net(sk)))
+		sk = NULL;
+
 	if (!sk)
 		sk = nf_sk_lookup_slow_v6(xt_net(par), skb, xt_in(par));
+
 	if (sk) {
 		bool wildcard;
 		bool transparent = true;
-- 
2.14.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox