Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 1/1] tc-testing: initial version of tunnel_key unit tests
From: Davide Caratti @ 2018-06-28 17:26 UTC (permalink / raw)
  To: Lucas Bates
  Cc: Keara Leibovitz, David Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
In-Reply-To: <CAMDBHYLmhtmtNR00yiK2i9Pt=r7Z-mRxjj7X=bR6JiDgKvCEVA@mail.gmail.com>

hello Lucas,

On Wed, 2018-06-27 at 14:50 -0400, Lucas Bates wrote:
> On Tue, Jun 26, 2018 at 10:51 AM, Davide Caratti <dcaratti@redhat.com> wrote:
> > On Tue, 2018-06-26 at 09:17 -0400, Keara Leibovitz wrote:
> > > Create unittests for the tc tunnel_key action.
> > > 
> > > 
> > > Signed-off-by: Keara Leibovitz <kleib@mojatatu.com>
> > > ---
> > >  .../tc-testing/tc-tests/actions/tunnel_key.json    | 676 +++++++++++++++++++++
> > >  1 file changed, 676 insertions(+)
> > >  create mode 100644 tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> > > 
> > > diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> > > new file mode 100644
> > > index 000000000000..bfe522ac8177
> > 
> > hello Keara!
> > 
> > I think the 'teardown' stage in some of these tests should be reviewed.
> > Those that are meant to test invalid configurations (like dc6b) should
> > allow non-zero exit codes in the teardown stage, if the wrong
> > configuration is catched by the userspace TC tool, before talking to the
> > kernel.
> > 
> > Otherwise, those tests will fail when they are invoked one by one with the
> > act_tunnel_key module unloaded.
> > 
> 
> Hi Davide, I thought I'd weigh in here.

glad to hear your feedback!

> In the short term, I think this is reasonable, but it's not a feasible
> long-term solution.  Here's why:
> 
> Allowing non-zero exit codes on setup and teardown was a precaution
> that needed to be implemented as flushing actions in a freshly-booted
> kernel returned errors - certain actions would only allow you to flush
> after that action had been added.

I guess this is a desired behavior, and it's common to all TC actions:

# grep bpf /proc/modules
# tc actions flush action bpf
RTNETLINK answers: Invalid argument
We have an error flushing
# modprobe act_bpf
 tc actions flush action bpf
# echo $?
0

> But, doing this on so many test cases means that we can lose control
> of the test environment, especially since a lot of commands get copied
> between test cases.  One test's command under test becomes the next
> test case's setup command, etc.  This can cause false results and
> potentially waste a lot of time for someone trying to track down a
> bug... Or cause bugs to be missed.

I understand, you want to ensure that 'teardown' leaves the scenario in a
status which is the same as before the 'setup' phase. Whether or not this
happened successfully, it's sane not to ignore the error code: otherwise,
test X will perturbate test X+1.

> So, how to fix: we've had some discussions about it already.  Jiri had
> requested the addition of a config file (like the one at
> tools/testing/selftests/net/forwarding/config, and maybe an addition
> to the README for tdc for explanation.  People would then possibly be
> restricted to running one test case file at a time based on what
> options they had loaded...  This is still not ideal.

All this depends on where the error condition is catched. Some parameters
(like the invalid 'index' in act_bpf) are rejected within userspace TC,
some others (like the invalid bytecode for test f84a) in the kernel.

> I think the best possible fix is to add a new plugin for tdc to
> exclude tests based on the kernel config.  This would require the
> addition of a new optional field to the test case format, where any
> and all included modules required for the test to work would be
> listed.  The plugin would look at this information, do its best to
> determine if the currently running kernel supports it, and allows the
> test to run or be skipped as a result.
> 
> Let me show an example of the new field:
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> > > @@ -0,0 +1,676 @@
> > > 
> > 
> > ...
> > 
> > > +    {
> > > +        "id": "dc6b",
> > > +        "name": "Add tunnel_key set action with missing mandatory src_ip parameter",
> > > +        "category": [
> > > +            "actions",
> > > +            "tunnel_key"
> > > +        ],
> 
>                "reqModules": [
>                    "CONFIG_NET_ACT_TUNNEL_KEY"
>                ],
> > > +        "setup": [
> > > +            [
> > > +                "$TC actions flush action tunnel_key",
> > > +                0,
> > > +                1,
> > > +                255
> > > +            ]
> > > +        ],
> > > +        "cmdUnderTest": "$TC actions add action tunnel_key set dst_ip 20.20.20.2 id 100",
> > > +        "expExitCode": "255",
> > > +        "verifyCmd": "$TC actions list action tunnel_key",
> > > +        "matchPattern": "action order [0-9]+: tunnel_key set.*dst_ip 20.20.20.2.*key_id 100",
> > > +        "matchCount": "0",
> > > +        "teardown": [
> > > +            "$TC actions flush action tunnel_key"
> > > +        ]
> > > +    },
> 
> As we venture into more and more complicated tests, where different
> modules would start getting mixed together, this might be the most
> effective route.
> 
> This plugin will require some changes I've made to our local version
> of tdc that I've been testing out - they change the way tdc handles
> its test results, and also give it the ability to skip tests without
> affecting the rest of the test run.

LGTM.  To maintain the possibility to test automatic module loading based
on the action, we only need to add another test per module (tipically the
first one) where the 'reqModules' line is not present.

> Until I'm able to submit everything, I'd be OK with having Keara add
> the non-zero exit codes to the teardown on her tests.  In the meantime
> we'll get the README updated and config file added as well.
> 
> How does this sound?

it sounds good to me, but at this point we can also leave the code of
tunnel_key as-is. there are many other items failing in this script:

for act in $ACT; do
while IFS=':' read -r id _ ; do modprobe -r act_${act} ; sleep 1 ; [ -n "$id" ] && ./tdc.py -p /home/davide/iproute2/tc/tc -e $id ; done <<EOF
`./tdc.py -l | grep ${act}`
EOF
done

So, it's ok for me if they are fixed all together in a series, and I
volunteer for testing it when they land on netdev list.

regards,
-- 
davide

^ permalink raw reply

* Re: [PATCH v1 net-next 12/14] igb: Only call skb_tx_timestamp after descriptors are ready
From: Jesus Sanchez-Palencia @ 2018-06-28 17:12 UTC (permalink / raw)
  To: Eric Dumazet, netdev
  Cc: tglx, jan.altenberg, vinicius.gomes, kurt.kanzenbach, henrik,
	richardcochran, levi.pearson, ilias.apalodimas, ivan.khoronzhuk,
	mlichvar, willemb, jhs, xiyou.wangcong, jiri
In-Reply-To: <44770d6b-503c-279f-807f-0b7f11be56cf@gmail.com>



On 06/27/2018 04:56 PM, Eric Dumazet wrote:
> 
> 
> On 06/27/2018 02:59 PM, Jesus Sanchez-Palencia wrote:
>> Currently, skb_tx_timestamp() is being called before the DMA
>> descriptors are prepared in igb_xmit_frame_ring(), which happens
>> during either the igb_tso() or igb_tx_csum() calls.
>>
>> Given that now the skb->tstamp might be used to carry the timestamp
>> for SO_TXTIME, we must only call skb_tx_timestamp() after the
>> information has been copied into the DMA tx_ring.
> 
> 
> Since when this skb->tstamp use happened ?
> 
> If this is in patch 11/14 (igb: Add support for ETF offload), then you should either :
> 
> 1) Squash this into 11/14
> 
> 2) swap 11 and 12 patch, so that this change is done before "igb: Add support for ETF offload"  
> 
> Otherwise a bisection could fail badly.


OK. Fixed for v2 by swapping patches 11 and 12.

Thanks,
Jesus

^ permalink raw reply

* Re: [PATCH v1 net-next 13/14] net/sched: Enforce usage of CLOCK_TAI for sch_etf
From: Jesus Sanchez-Palencia @ 2018-06-28 17:11 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Thomas Gleixner, jan.altenberg,
	Vinicius Gomes, kurt.kanzenbach, Henrik Austad, Richard Cochran,
	ilias.apalodimas, ivan.khoronzhuk, Miroslav Lichvar,
	Willem de Bruijn, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko
In-Reply-To: <CAF=yD-KnZJgqAspgOvtD82n4x0tB-neUF8THTgRKLo+OR5oE=A@mail.gmail.com>



On 06/28/2018 07:26 AM, Willem de Bruijn wrote:
> On Wed, Jun 27, 2018 at 8:45 PM Jesus Sanchez-Palencia
> <jesus.sanchez-palencia@intel.com> wrote:
>>
>> The qdisc and the SO_TXTIME ABIs allow for a clockid to be configured,
>> but it's been decided that usage of CLOCK_TAI should be enforced until
>> we decide to allow for other clockids to be used. The rationale here is
>> that PTP times are usually in the TAI scale, thus no other clocks should
>> be necessary.
>>
>> For now, the qdisc will return EINVAL if any clocks other than
>> CLOCK_TAI are used.
>>
>> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>> ---
>>  net/sched/sch_etf.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
>> index cd6cb5b69228..5514a8aa3bd5 100644
>> --- a/net/sched/sch_etf.c
>> +++ b/net/sched/sch_etf.c
>> @@ -56,8 +56,8 @@ static inline int validate_input_params(struct tc_etf_qopt *qopt,
>>                 return -ENOTSUPP;
>>         }
>>
>> -       if (qopt->clockid >= MAX_CLOCKS) {
>> -               NL_SET_ERR_MSG(extack, "Invalid clockid");
>> +       if (qopt->clockid != CLOCK_TAI) {
>> +               NL_SET_ERR_MSG(extack, "Invalid clockid. CLOCK_TAI must be used");
> 
> Similar to the comment in patch 12, this should be squashed (into
> patch 6) to avoid incorrect behavior in a range of SHA1s.


Ok. Fixed for v2.

Thanks,
Jesus

^ permalink raw reply

* Re: [PATCH bpf 1/4] xsk: fix potential lost completion message in SKB path
From: Song Liu @ 2018-06-28 17:10 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bjorn.topel, ast, Daniel Borkmann, Networking, qi.z.zhang, pavel
In-Reply-To: <1530108136-4984-2-git-send-email-magnus.karlsson@intel.com>

On Wed, Jun 27, 2018 at 7:02 AM, Magnus Karlsson
<magnus.karlsson@intel.com> wrote:
> The code in xskq_produce_addr erroneously checked if there
> was up to LAZY_UPDATE_THRESHOLD amount of space in the completion
> queue. It only needs to check if there is one slot left in the
> queue. This bug could under some circumstances lead to a WARN_ON_ONCE
> being triggered and the completion message to user space being lost.
>
> Fixes: 35fcde7f8deb ("xsk: support for Tx")
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Reported-by: Pavel Odintsov <pavel@fastnetmon.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  net/xdp/xsk_queue.h | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index ef6a6f0ec949..52ecaf770642 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -62,14 +62,9 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
>         return (entries > dcnt) ? dcnt : entries;
>  }
>
> -static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
> -{
> -       return q->nentries - (producer - q->cons_tail);
> -}
> -
>  static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
>  {
> -       u32 free_entries = xskq_nb_free_lazy(q, producer);
> +       u32 free_entries = q->nentries - (producer - q->cons_tail);
>
>         if (free_entries >= dcnt)
>                 return free_entries;
> @@ -129,7 +124,7 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
>  {
>         struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>
> -       if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
> +       if (xskq_nb_free(q, q->prod_tail, 1) == 0)
>                 return -ENOSPC;
>
>         ring->desc[q->prod_tail++ & q->ring_mask] = addr;
> --
> 2.7.4
>

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Jakub Kicinski @ 2018-06-28 17:05 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Daniel Borkmann, davem, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	oss-drivers, netdev, Pieter Jansen van Vuuren
In-Reply-To: <20180628190152.539bfc67@redhat.com>

On Thu, 28 Jun 2018 19:01:52 +0200, Jiri Benc wrote:
> On Thu, 28 Jun 2018 09:54:52 -0700, Jakub Kicinski wrote:
> > Hmm... in practice we could steal top bits of the size parameter for
> > some flags, since it seems to be limited to values < 256 today?  Is it
> > worth it?
> > 
> > It would look something along the lines of:  
> 
> Something like that, yes. I'll leave to Daniel to review how much sense
> it makes from the BPF side.

Can we take this as a follow up through the bpf-next tree or do you
want us to respin as part of this set?

^ permalink raw reply

* [PATCH net-next 10/10] s390/ism: add device driver for internal shared memory
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Sebastian Ott <sebott@linux.ibm.com>

Add support for the Internal Shared Memory vPCI Adapter.
This driver implements the interfaces of the SMC-D protocol.

Signed-off-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 drivers/s390/net/Kconfig   |  10 +
 drivers/s390/net/Makefile  |   3 +
 drivers/s390/net/ism.h     | 221 ++++++++++++++++
 drivers/s390/net/ism_drv.c | 623 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 857 insertions(+)
 create mode 100644 drivers/s390/net/ism.h
 create mode 100644 drivers/s390/net/ism_drv.c

diff --git a/drivers/s390/net/Kconfig b/drivers/s390/net/Kconfig
index c7e484f70654..7c5a25ddf832 100644
--- a/drivers/s390/net/Kconfig
+++ b/drivers/s390/net/Kconfig
@@ -95,4 +95,14 @@ config CCWGROUP
 	tristate
 	default (LCS || CTCM || QETH)
 
+config ISM
+	tristate "Support for ISM vPCI Adapter"
+	depends on PCI && SMC
+	default n
+	help
+	  Select this option if you want to use the Internal Shared Memory
+	  vPCI Adapter.
+
+	  To compile as a module choose M. The module name is ism.
+	  If unsure, choose N.
 endmenu
diff --git a/drivers/s390/net/Makefile b/drivers/s390/net/Makefile
index 513b7ae64980..f2d6bbe57a6f 100644
--- a/drivers/s390/net/Makefile
+++ b/drivers/s390/net/Makefile
@@ -15,3 +15,6 @@ qeth_l2-y += qeth_l2_main.o qeth_l2_sys.o
 obj-$(CONFIG_QETH_L2) += qeth_l2.o
 qeth_l3-y += qeth_l3_main.o qeth_l3_sys.o
 obj-$(CONFIG_QETH_L3) += qeth_l3.o
+
+ism-y := ism_drv.o
+obj-$(CONFIG_ISM) += ism.o
diff --git a/drivers/s390/net/ism.h b/drivers/s390/net/ism.h
new file mode 100644
index 000000000000..0aab90817326
--- /dev/null
+++ b/drivers/s390/net/ism.h
@@ -0,0 +1,221 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef S390_ISM_H
+#define S390_ISM_H
+
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <linux/pci.h>
+#include <net/smc.h>
+
+#define UTIL_STR_LEN	16
+
+/*
+ * Do not use the first word of the DMB bits to ensure 8 byte aligned access.
+ */
+#define ISM_DMB_WORD_OFFSET	1
+#define ISM_DMB_BIT_OFFSET	(ISM_DMB_WORD_OFFSET * 32)
+#define ISM_NR_DMBS		1920
+
+#define ISM_REG_SBA	0x1
+#define ISM_REG_IEQ	0x2
+#define ISM_READ_GID	0x3
+#define ISM_ADD_VLAN_ID	0x4
+#define ISM_DEL_VLAN_ID	0x5
+#define ISM_SET_VLAN	0x6
+#define ISM_RESET_VLAN	0x7
+#define ISM_QUERY_INFO	0x8
+#define ISM_QUERY_RGID	0x9
+#define ISM_REG_DMB	0xA
+#define ISM_UNREG_DMB	0xB
+#define ISM_SIGNAL_IEQ	0xE
+#define ISM_UNREG_SBA	0x11
+#define ISM_UNREG_IEQ	0x12
+
+#define ISM_ERROR	0xFFFF
+
+struct ism_req_hdr {
+	u32 cmd;
+	u16 : 16;
+	u16 len;
+};
+
+struct ism_resp_hdr {
+	u32 cmd;
+	u16 ret;
+	u16 len;
+};
+
+union ism_reg_sba {
+	struct {
+		struct ism_req_hdr hdr;
+		u64 sba;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+	} response;
+} __aligned(16);
+
+union ism_reg_ieq {
+	struct {
+		struct ism_req_hdr hdr;
+		u64 ieq;
+		u64 len;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+	} response;
+} __aligned(16);
+
+union ism_read_gid {
+	struct {
+		struct ism_req_hdr hdr;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+		u64 gid;
+	} response;
+} __aligned(16);
+
+union ism_qi {
+	struct {
+		struct ism_req_hdr hdr;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+		u32 version;
+		u32 max_len;
+		u64 ism_state;
+		u64 my_gid;
+		u64 sba;
+		u64 ieq;
+		u32 ieq_len;
+		u32 : 32;
+		u32 dmbs_owned;
+		u32 dmbs_used;
+		u32 vlan_required;
+		u32 vlan_nr_ids;
+		u16 vlan_id[64];
+	} response;
+} __aligned(64);
+
+union ism_query_rgid {
+	struct {
+		struct ism_req_hdr hdr;
+		u64 rgid;
+		u32 vlan_valid;
+		u32 vlan_id;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+	} response;
+} __aligned(16);
+
+union ism_reg_dmb {
+	struct {
+		struct ism_req_hdr hdr;
+		u64 dmb;
+		u32 dmb_len;
+		u32 sba_idx;
+		u32 vlan_valid;
+		u32 vlan_id;
+		u64 rgid;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+		u64 dmb_tok;
+	} response;
+} __aligned(32);
+
+union ism_sig_ieq {
+	struct {
+		struct ism_req_hdr hdr;
+		u64 rgid;
+		u32 trigger_irq;
+		u32 event_code;
+		u64 info;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+	} response;
+} __aligned(32);
+
+union ism_unreg_dmb {
+	struct {
+		struct ism_req_hdr hdr;
+		u64 dmb_tok;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+	} response;
+} __aligned(16);
+
+union ism_cmd_simple {
+	struct {
+		struct ism_req_hdr hdr;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+	} response;
+} __aligned(8);
+
+union ism_set_vlan_id {
+	struct {
+		struct ism_req_hdr hdr;
+		u64 vlan_id;
+	} request;
+	struct {
+		struct ism_resp_hdr hdr;
+	} response;
+} __aligned(16);
+
+struct ism_eq_header {
+	u64 idx;
+	u64 ieq_len;
+	u64 entry_len;
+	u64 : 64;
+};
+
+struct ism_eq {
+	struct ism_eq_header header;
+	struct smcd_event entry[15];
+};
+
+struct ism_sba {
+	u32 s : 1;	/* summary bit */
+	u32 e : 1;	/* event bit */
+	u32 : 30;
+	u32 dmb_bits[ISM_NR_DMBS / 32];
+	u32 reserved[3];
+	u16 dmbe_mask[ISM_NR_DMBS];
+};
+
+struct ism_dev {
+	spinlock_t lock;
+	struct pci_dev *pdev;
+	struct smcd_dev *smcd;
+
+	void __iomem *ctl;
+
+	struct ism_sba *sba;
+	dma_addr_t sba_dma_addr;
+	DECLARE_BITMAP(sba_bitmap, ISM_NR_DMBS);
+
+	struct ism_eq *ieq;
+	dma_addr_t ieq_dma_addr;
+
+	int ieq_idx;
+};
+
+#define ISM_CREATE_REQ(dmb, idx, sf, offset)		\
+	((dmb) | (idx) << 24 | (sf) << 23 | (offset))
+
+static inline int __ism_move(struct ism_dev *ism, u64 dmb_req, void *data,
+			     unsigned int size)
+{
+	struct zpci_dev *zdev = to_zpci(ism->pdev);
+	u64 req = ZPCI_CREATE_REQ(zdev->fh, 0, size);
+
+	return zpci_write_block(req, data, dmb_req);
+}
+
+#endif /* S390_ISM_H */
diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
new file mode 100644
index 000000000000..c0631895154e
--- /dev/null
+++ b/drivers/s390/net/ism_drv.c
@@ -0,0 +1,623 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ISM driver for s390.
+ *
+ * Copyright IBM Corp. 2018
+ */
+#define KMSG_COMPONENT "ism"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/err.h>
+#include <net/smc.h>
+
+#include <asm/debug.h>
+
+#include "ism.h"
+
+MODULE_DESCRIPTION("ISM driver for s390");
+MODULE_LICENSE("GPL");
+
+#define PCI_DEVICE_ID_IBM_ISM 0x04ED
+#define DRV_NAME "ism"
+
+static const struct pci_device_id ism_device_table[] = {
+	{ PCI_VDEVICE(IBM, PCI_DEVICE_ID_IBM_ISM), 0 },
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(pci, ism_device_table);
+
+static debug_info_t *ism_debug_info;
+
+static int ism_cmd(struct ism_dev *ism, void *cmd)
+{
+	struct ism_req_hdr *req = cmd;
+	struct ism_resp_hdr *resp = cmd;
+
+	memcpy_toio(ism->ctl + sizeof(*req), req + 1, req->len - sizeof(*req));
+	memcpy_toio(ism->ctl, req, sizeof(*req));
+
+	WRITE_ONCE(resp->ret, ISM_ERROR);
+
+	memcpy_fromio(resp, ism->ctl, sizeof(*resp));
+	if (resp->ret) {
+		debug_text_event(ism_debug_info, 0, "cmd failure");
+		debug_event(ism_debug_info, 0, resp, sizeof(*resp));
+		goto out;
+	}
+	memcpy_fromio(resp + 1, ism->ctl + sizeof(*resp),
+		      resp->len - sizeof(*resp));
+out:
+	return resp->ret;
+}
+
+static int ism_cmd_simple(struct ism_dev *ism, u32 cmd_code)
+{
+	union ism_cmd_simple cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = cmd_code;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	return ism_cmd(ism, &cmd);
+}
+
+static int query_info(struct ism_dev *ism)
+{
+	union ism_qi cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_QUERY_INFO;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	if (ism_cmd(ism, &cmd))
+		goto out;
+
+	debug_text_event(ism_debug_info, 3, "query info");
+	debug_event(ism_debug_info, 3, &cmd.response, sizeof(cmd.response));
+out:
+	return 0;
+}
+
+static int register_sba(struct ism_dev *ism)
+{
+	union ism_reg_sba cmd;
+	dma_addr_t dma_handle;
+	struct ism_sba *sba;
+
+	sba = dma_zalloc_coherent(&ism->pdev->dev, PAGE_SIZE,
+				  &dma_handle, GFP_KERNEL);
+	if (!sba)
+		return -ENOMEM;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_REG_SBA;
+	cmd.request.hdr.len = sizeof(cmd.request);
+	cmd.request.sba = dma_handle;
+
+	if (ism_cmd(ism, &cmd)) {
+		dma_free_coherent(&ism->pdev->dev, PAGE_SIZE, sba, dma_handle);
+		return -EIO;
+	}
+
+	ism->sba = sba;
+	ism->sba_dma_addr = dma_handle;
+
+	return 0;
+}
+
+static int register_ieq(struct ism_dev *ism)
+{
+	union ism_reg_ieq cmd;
+	dma_addr_t dma_handle;
+	struct ism_eq *ieq;
+
+	ieq = dma_zalloc_coherent(&ism->pdev->dev, PAGE_SIZE,
+				  &dma_handle, GFP_KERNEL);
+	if (!ieq)
+		return -ENOMEM;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_REG_IEQ;
+	cmd.request.hdr.len = sizeof(cmd.request);
+	cmd.request.ieq = dma_handle;
+	cmd.request.len = sizeof(*ieq);
+
+	if (ism_cmd(ism, &cmd)) {
+		dma_free_coherent(&ism->pdev->dev, PAGE_SIZE, ieq, dma_handle);
+		return -EIO;
+	}
+
+	ism->ieq = ieq;
+	ism->ieq_idx = -1;
+	ism->ieq_dma_addr = dma_handle;
+
+	return 0;
+}
+
+static int unregister_sba(struct ism_dev *ism)
+{
+	if (!ism->sba)
+		return 0;
+
+	if (ism_cmd_simple(ism, ISM_UNREG_SBA))
+		return -EIO;
+
+	dma_free_coherent(&ism->pdev->dev, PAGE_SIZE,
+			  ism->sba, ism->sba_dma_addr);
+
+	ism->sba = NULL;
+	ism->sba_dma_addr = 0;
+
+	return 0;
+}
+
+static int unregister_ieq(struct ism_dev *ism)
+{
+	if (!ism->ieq)
+		return 0;
+
+	if (ism_cmd_simple(ism, ISM_UNREG_IEQ))
+		return -EIO;
+
+	dma_free_coherent(&ism->pdev->dev, PAGE_SIZE,
+			  ism->ieq, ism->ieq_dma_addr);
+
+	ism->ieq = NULL;
+	ism->ieq_dma_addr = 0;
+
+	return 0;
+}
+
+static int ism_read_local_gid(struct ism_dev *ism)
+{
+	union ism_read_gid cmd;
+	int ret;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_READ_GID;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	ret = ism_cmd(ism, &cmd);
+	if (ret)
+		goto out;
+
+	ism->smcd->local_gid = cmd.response.gid;
+out:
+	return ret;
+}
+
+static int ism_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid,
+			  u32 vid)
+{
+	struct ism_dev *ism = smcd->priv;
+	union ism_query_rgid cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_QUERY_RGID;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	cmd.request.rgid = rgid;
+	cmd.request.vlan_valid = vid_valid;
+	cmd.request.vlan_id = vid;
+
+	return ism_cmd(ism, &cmd);
+}
+
+static void ism_free_dmb(struct ism_dev *ism, struct smcd_dmb *dmb)
+{
+	clear_bit(dmb->sba_idx, ism->sba_bitmap);
+	dma_free_coherent(&ism->pdev->dev, dmb->dmb_len,
+			  dmb->cpu_addr, dmb->dma_addr);
+}
+
+static int ism_alloc_dmb(struct ism_dev *ism, struct smcd_dmb *dmb)
+{
+	unsigned long bit;
+
+	if (PAGE_ALIGN(dmb->dmb_len) > dma_get_max_seg_size(&ism->pdev->dev))
+		return -EINVAL;
+
+	if (!dmb->sba_idx) {
+		bit = find_next_zero_bit(ism->sba_bitmap, ISM_NR_DMBS,
+					 ISM_DMB_BIT_OFFSET);
+		if (bit == ISM_NR_DMBS)
+			return -ENOMEM;
+
+		dmb->sba_idx = bit;
+	}
+	if (dmb->sba_idx < ISM_DMB_BIT_OFFSET ||
+	    test_and_set_bit(dmb->sba_idx, ism->sba_bitmap))
+		return -EINVAL;
+
+	dmb->cpu_addr = dma_zalloc_coherent(&ism->pdev->dev, dmb->dmb_len,
+					    &dmb->dma_addr, GFP_KERNEL |
+					    __GFP_NOWARN | __GFP_NOMEMALLOC |
+					    __GFP_COMP | __GFP_NORETRY);
+	if (!dmb->cpu_addr)
+		clear_bit(dmb->sba_idx, ism->sba_bitmap);
+
+	return dmb->cpu_addr ? 0 : -ENOMEM;
+}
+
+static int ism_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+	struct ism_dev *ism = smcd->priv;
+	union ism_reg_dmb cmd;
+	int ret;
+
+	ret = ism_alloc_dmb(ism, dmb);
+	if (ret)
+		goto out;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_REG_DMB;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	cmd.request.dmb = dmb->dma_addr;
+	cmd.request.dmb_len = dmb->dmb_len;
+	cmd.request.sba_idx = dmb->sba_idx;
+	cmd.request.vlan_valid = dmb->vlan_valid;
+	cmd.request.vlan_id = dmb->vlan_id;
+	cmd.request.rgid = dmb->rgid;
+
+	ret = ism_cmd(ism, &cmd);
+	if (ret) {
+		ism_free_dmb(ism, dmb);
+		goto out;
+	}
+	dmb->dmb_tok = cmd.response.dmb_tok;
+out:
+	return ret;
+}
+
+static int ism_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+	struct ism_dev *ism = smcd->priv;
+	union ism_unreg_dmb cmd;
+	int ret;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_UNREG_DMB;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	cmd.request.dmb_tok = dmb->dmb_tok;
+
+	ret = ism_cmd(ism, &cmd);
+	if (ret)
+		goto out;
+
+	ism_free_dmb(ism, dmb);
+out:
+	return ret;
+}
+
+static int ism_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+	struct ism_dev *ism = smcd->priv;
+	union ism_set_vlan_id cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_ADD_VLAN_ID;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	cmd.request.vlan_id = vlan_id;
+
+	return ism_cmd(ism, &cmd);
+}
+
+static int ism_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+	struct ism_dev *ism = smcd->priv;
+	union ism_set_vlan_id cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_DEL_VLAN_ID;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	cmd.request.vlan_id = vlan_id;
+
+	return ism_cmd(ism, &cmd);
+}
+
+static int ism_set_vlan_required(struct smcd_dev *smcd)
+{
+	return ism_cmd_simple(smcd->priv, ISM_SET_VLAN);
+}
+
+static int ism_reset_vlan_required(struct smcd_dev *smcd)
+{
+	return ism_cmd_simple(smcd->priv, ISM_RESET_VLAN);
+}
+
+static int ism_signal_ieq(struct smcd_dev *smcd, u64 rgid, u32 trigger_irq,
+			  u32 event_code, u64 info)
+{
+	struct ism_dev *ism = smcd->priv;
+	union ism_sig_ieq cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	cmd.request.hdr.cmd = ISM_SIGNAL_IEQ;
+	cmd.request.hdr.len = sizeof(cmd.request);
+
+	cmd.request.rgid = rgid;
+	cmd.request.trigger_irq = trigger_irq;
+	cmd.request.event_code = event_code;
+	cmd.request.info = info;
+
+	return ism_cmd(ism, &cmd);
+}
+
+static unsigned int max_bytes(unsigned int start, unsigned int len,
+			      unsigned int boundary)
+{
+	return min(boundary - (start & (boundary - 1)), len);
+}
+
+static int ism_move(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
+		    bool sf, unsigned int offset, void *data, unsigned int size)
+{
+	struct ism_dev *ism = smcd->priv;
+	unsigned int bytes;
+	u64 dmb_req;
+	int ret;
+
+	while (size) {
+		bytes = max_bytes(offset, size, PAGE_SIZE);
+		dmb_req = ISM_CREATE_REQ(dmb_tok, idx, size == bytes ? sf : 0,
+					 offset);
+
+		ret = __ism_move(ism, dmb_req, data, bytes);
+		if (ret)
+			return ret;
+
+		size -= bytes;
+		data += bytes;
+		offset += bytes;
+	}
+
+	return 0;
+}
+
+static void ism_handle_event(struct ism_dev *ism)
+{
+	struct smcd_event *entry;
+
+	while ((ism->ieq_idx + 1) != READ_ONCE(ism->ieq->header.idx)) {
+		if (++(ism->ieq_idx) == ARRAY_SIZE(ism->ieq->entry))
+			ism->ieq_idx = 0;
+
+		entry = &ism->ieq->entry[ism->ieq_idx];
+		debug_event(ism_debug_info, 2, entry, sizeof(*entry));
+		smcd_handle_event(ism->smcd, entry);
+	}
+}
+
+static irqreturn_t ism_handle_irq(int irq, void *data)
+{
+	struct ism_dev *ism = data;
+	unsigned long bit, end;
+	unsigned long *bv;
+
+	bv = (void *) &ism->sba->dmb_bits[ISM_DMB_WORD_OFFSET];
+	end = sizeof(ism->sba->dmb_bits) * BITS_PER_BYTE - ISM_DMB_BIT_OFFSET;
+
+	spin_lock(&ism->lock);
+	ism->sba->s = 0;
+	barrier();
+	for (bit = 0;;) {
+		bit = find_next_bit_inv(bv, end, bit);
+		if (bit >= end)
+			break;
+
+		clear_bit_inv(bit, bv);
+		barrier();
+		smcd_handle_irq(ism->smcd, bit + ISM_DMB_BIT_OFFSET);
+		ism->sba->dmbe_mask[bit + ISM_DMB_BIT_OFFSET] = 0;
+	}
+
+	if (ism->sba->e) {
+		ism->sba->e = 0;
+		barrier();
+		ism_handle_event(ism);
+	}
+	spin_unlock(&ism->lock);
+	return IRQ_HANDLED;
+}
+
+static const struct smcd_ops ism_ops = {
+	.query_remote_gid = ism_query_rgid,
+	.register_dmb = ism_register_dmb,
+	.unregister_dmb = ism_unregister_dmb,
+	.add_vlan_id = ism_add_vlan_id,
+	.del_vlan_id = ism_del_vlan_id,
+	.set_vlan_required = ism_set_vlan_required,
+	.reset_vlan_required = ism_reset_vlan_required,
+	.signal_event = ism_signal_ieq,
+	.move_data = ism_move,
+};
+
+static int ism_dev_init(struct ism_dev *ism)
+{
+	struct pci_dev *pdev = ism->pdev;
+	int ret;
+
+	ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI);
+	if (ret <= 0)
+		goto out;
+
+	ret = request_irq(pci_irq_vector(pdev, 0), ism_handle_irq, 0,
+			  pci_name(pdev), ism);
+	if (ret)
+		goto free_vectors;
+
+	ret = register_sba(ism);
+	if (ret)
+		goto free_irq;
+
+	ret = register_ieq(ism);
+	if (ret)
+		goto unreg_sba;
+
+	ret = ism_read_local_gid(ism);
+	if (ret)
+		goto unreg_ieq;
+
+	ret = smcd_register_dev(ism->smcd);
+	if (ret)
+		goto unreg_ieq;
+
+	query_info(ism);
+	return 0;
+
+unreg_ieq:
+	unregister_ieq(ism);
+unreg_sba:
+	unregister_sba(ism);
+free_irq:
+	free_irq(pci_irq_vector(pdev, 0), ism);
+free_vectors:
+	pci_free_irq_vectors(pdev);
+out:
+	return ret;
+}
+
+static int ism_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct ism_dev *ism;
+	int ret;
+
+	ism = kzalloc(sizeof(*ism), GFP_KERNEL);
+	if (!ism)
+		return -ENOMEM;
+
+	spin_lock_init(&ism->lock);
+	dev_set_drvdata(&pdev->dev, ism);
+	ism->pdev = pdev;
+
+	ret = pci_enable_device_mem(pdev);
+	if (ret)
+		goto err;
+
+	ret = pci_request_mem_regions(pdev, DRV_NAME);
+	if (ret)
+		goto err_disable;
+
+	ism->ctl = pci_iomap(pdev, 2, 0);
+	if (!ism->ctl)
+		goto err_resource;
+
+	ret = pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
+	if (ret)
+		goto err_unmap;
+
+	pci_set_dma_seg_boundary(pdev, SZ_1M - 1);
+	pci_set_dma_max_seg_size(pdev, SZ_1M);
+	pci_set_master(pdev);
+
+	ism->smcd = smcd_alloc_dev(&pdev->dev, dev_name(&pdev->dev), &ism_ops,
+				   ISM_NR_DMBS);
+	if (!ism->smcd)
+		goto err_unmap;
+
+	ism->smcd->priv = ism;
+	ret = ism_dev_init(ism);
+	if (ret)
+		goto err_free;
+
+	return 0;
+
+err_free:
+	smcd_free_dev(ism->smcd);
+err_unmap:
+	pci_iounmap(pdev, ism->ctl);
+err_resource:
+	pci_release_mem_regions(pdev);
+err_disable:
+	pci_disable_device(pdev);
+err:
+	kfree(ism);
+	dev_set_drvdata(&pdev->dev, NULL);
+	return ret;
+}
+
+static void ism_dev_exit(struct ism_dev *ism)
+{
+	struct pci_dev *pdev = ism->pdev;
+
+	smcd_unregister_dev(ism->smcd);
+	unregister_ieq(ism);
+	unregister_sba(ism);
+	free_irq(pci_irq_vector(pdev, 0), ism);
+	pci_free_irq_vectors(pdev);
+}
+
+static void ism_remove(struct pci_dev *pdev)
+{
+	struct ism_dev *ism = dev_get_drvdata(&pdev->dev);
+
+	ism_dev_exit(ism);
+
+	smcd_free_dev(ism->smcd);
+	pci_iounmap(pdev, ism->ctl);
+	pci_release_mem_regions(pdev);
+	pci_disable_device(pdev);
+	dev_set_drvdata(&pdev->dev, NULL);
+	kfree(ism);
+}
+
+static int ism_suspend(struct device *dev)
+{
+	struct ism_dev *ism = dev_get_drvdata(dev);
+
+	ism_dev_exit(ism);
+	return 0;
+}
+
+static int ism_resume(struct device *dev)
+{
+	struct ism_dev *ism = dev_get_drvdata(dev);
+
+	return ism_dev_init(ism);
+}
+
+static SIMPLE_DEV_PM_OPS(ism_pm_ops, ism_suspend, ism_resume);
+
+static struct pci_driver ism_driver = {
+	.name	  = DRV_NAME,
+	.id_table = ism_device_table,
+	.probe	  = ism_probe,
+	.remove	  = ism_remove,
+	.driver	  = {
+		.pm = &ism_pm_ops,
+	},
+};
+
+static int __init ism_init(void)
+{
+	int ret;
+
+	ism_debug_info = debug_register("ism", 2, 1, 16);
+	if (!ism_debug_info)
+		return -ENODEV;
+
+	debug_register_view(ism_debug_info, &debug_hex_ascii_view);
+	ret = pci_register_driver(&ism_driver);
+	if (ret)
+		debug_unregister(ism_debug_info);
+
+	return ret;
+}
+
+static void __exit ism_exit(void)
+{
+	pci_unregister_driver(&ism_driver);
+	debug_unregister(ism_debug_info);
+}
+
+module_init(ism_init);
+module_exit(ism_exit);
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 09/10] net/smc: add SMC-D diag support
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Hans Wippel <hwippel@linux.ibm.com>

This patch adds diag support for SMC-D.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
---
 include/uapi/linux/smc_diag.h | 10 ++++++++++
 net/smc/smc_diag.c            | 15 +++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/uapi/linux/smc_diag.h b/include/uapi/linux/smc_diag.h
index 0ae5d4685ba3..92be255e534c 100644
--- a/include/uapi/linux/smc_diag.h
+++ b/include/uapi/linux/smc_diag.h
@@ -35,6 +35,7 @@ enum {
 	SMC_DIAG_CONNINFO,
 	SMC_DIAG_LGRINFO,
 	SMC_DIAG_SHUTDOWN,
+	SMC_DIAG_DMBINFO,
 	__SMC_DIAG_MAX,
 };
 
@@ -83,4 +84,13 @@ struct smc_diag_lgrinfo {
 	struct smc_diag_linkinfo	lnk[1];
 	__u8				role;
 };
+
+struct smcd_diag_dmbinfo {		/* SMC-D Socket internals */
+	__u32 linkid;			/* Link identifier */
+	__u64 peer_gid;			/* Peer GID */
+	__u64 my_gid;			/* My GID */
+	__u64 token;			/* Token of DMB */
+	__u64 peer_token;		/* Token of remote DMBE */
+};
+
 #endif /* _UAPI_SMC_DIAG_H_ */
diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index 64ce107c24d9..6d83eef1b743 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -156,6 +156,21 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb,
 		if (nla_put(skb, SMC_DIAG_LGRINFO, sizeof(linfo), &linfo) < 0)
 			goto errout;
 	}
+	if (smc->conn.lgr && smc->conn.lgr->is_smcd &&
+	    (req->diag_ext & (1 << (SMC_DIAG_DMBINFO - 1))) &&
+	    !list_empty(&smc->conn.lgr->list)) {
+		struct smc_connection *conn = &smc->conn;
+		struct smcd_diag_dmbinfo dinfo = {
+			.linkid = *((u32 *)conn->lgr->id),
+			.peer_gid = conn->lgr->peer_gid,
+			.my_gid = conn->lgr->smcd->local_gid,
+			.token = conn->rmb_desc->token,
+			.peer_token = conn->peer_token
+		};
+
+		if (nla_put(skb, SMC_DIAG_DMBINFO, sizeof(dinfo), &dinfo) < 0)
+			goto errout;
+	}
 
 	nlmsg_end(skb, nlh);
 	return 0;
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 08/10] net/smc: add SMC-D support in af_smc
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Hans Wippel <hwippel@linux.ibm.com>

This patch ties together the previous SMC-D patches. It adds support for
SMC-D to the listen and connect functions and, thus, enables SMC-D
support in the SMC code. If a connection supports both SMC-R and SMC-D,
SMC-D is preferred.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
---
 net/smc/af_smc.c   | 216 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 net/smc/smc_core.c |   2 +-
 net/smc/smc_core.h |   1 +
 3 files changed, 200 insertions(+), 19 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 20afa94be8bb..cbbb947dbfcf 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -23,6 +23,7 @@
 #include <linux/workqueue.h>
 #include <linux/in.h>
 #include <linux/sched/signal.h>
+#include <linux/if_vlan.h>
 
 #include <net/sock.h>
 #include <net/tcp.h>
@@ -35,6 +36,7 @@
 #include "smc_cdc.h"
 #include "smc_core.h"
 #include "smc_ib.h"
+#include "smc_ism.h"
 #include "smc_pnet.h"
 #include "smc_tx.h"
 #include "smc_rx.h"
@@ -372,8 +374,8 @@ static int smc_clnt_conf_first_link(struct smc_sock *smc)
 	return 0;
 }
 
-static void smc_conn_save_peer_info(struct smc_sock *smc,
-				    struct smc_clc_msg_accept_confirm *clc)
+static void smcr_conn_save_peer_info(struct smc_sock *smc,
+				     struct smc_clc_msg_accept_confirm *clc)
 {
 	int bufsize = smc_uncompress_bufsize(clc->rmbe_size);
 
@@ -384,6 +386,28 @@ static void smc_conn_save_peer_info(struct smc_sock *smc,
 	smc->conn.tx_off = bufsize * (smc->conn.peer_rmbe_idx - 1);
 }
 
+static void smcd_conn_save_peer_info(struct smc_sock *smc,
+				     struct smc_clc_msg_accept_confirm *clc)
+{
+	int bufsize = smc_uncompress_bufsize(clc->dmbe_size);
+
+	smc->conn.peer_rmbe_idx = clc->dmbe_idx;
+	smc->conn.peer_token = clc->token;
+	/* msg header takes up space in the buffer */
+	smc->conn.peer_rmbe_size = bufsize - sizeof(struct smcd_cdc_msg);
+	atomic_set(&smc->conn.peer_rmbe_space, smc->conn.peer_rmbe_size);
+	smc->conn.tx_off = bufsize * smc->conn.peer_rmbe_idx;
+}
+
+static void smc_conn_save_peer_info(struct smc_sock *smc,
+				    struct smc_clc_msg_accept_confirm *clc)
+{
+	if (smc->conn.lgr->is_smcd)
+		smcd_conn_save_peer_info(smc, clc);
+	else
+		smcr_conn_save_peer_info(smc, clc);
+}
+
 static void smc_link_save_peer_info(struct smc_link *link,
 				    struct smc_clc_msg_accept_confirm *clc)
 {
@@ -450,15 +474,51 @@ static int smc_check_rdma(struct smc_sock *smc, struct smc_ib_device **ibdev,
 	return reason_code;
 }
 
+/* check if there is an ISM device available for this connection. */
+/* called for connect and listen */
+static int smc_check_ism(struct smc_sock *smc, struct smcd_dev **ismdev)
+{
+	/* Find ISM device with same PNETID as connecting interface  */
+	smc_pnet_find_ism_resource(smc->clcsock->sk, ismdev);
+	if (!(*ismdev))
+		return SMC_CLC_DECL_CNFERR; /* configuration error */
+	return 0;
+}
+
+/* Check for VLAN ID and register it on ISM device just for CLC handshake */
+static int smc_connect_ism_vlan_setup(struct smc_sock *smc,
+				      struct smcd_dev *ismdev,
+				      unsigned short vlan_id)
+{
+	if (vlan_id && smc_ism_get_vlan(ismdev, vlan_id))
+		return SMC_CLC_DECL_CNFERR;
+	return 0;
+}
+
+/* cleanup temporary VLAN ID registration used for CLC handshake. If ISM is
+ * used, the VLAN ID will be registered again during the connection setup.
+ */
+static int smc_connect_ism_vlan_cleanup(struct smc_sock *smc, bool is_smcd,
+					struct smcd_dev *ismdev,
+					unsigned short vlan_id)
+{
+	if (!is_smcd)
+		return 0;
+	if (vlan_id && smc_ism_put_vlan(ismdev, vlan_id))
+		return SMC_CLC_DECL_CNFERR;
+	return 0;
+}
+
 /* CLC handshake during connect */
 static int smc_connect_clc(struct smc_sock *smc, int smc_type,
 			   struct smc_clc_msg_accept_confirm *aclc,
-			   struct smc_ib_device *ibdev, u8 ibport)
+			   struct smc_ib_device *ibdev, u8 ibport,
+			   struct smcd_dev *ismdev)
 {
 	int rc = 0;
 
 	/* do inband token exchange */
-	rc = smc_clc_send_proposal(smc, smc_type, ibdev, ibport, NULL);
+	rc = smc_clc_send_proposal(smc, smc_type, ibdev, ibport, ismdev);
 	if (rc)
 		return rc;
 	/* receive SMC Accept CLC message */
@@ -538,11 +598,50 @@ static int smc_connect_rdma(struct smc_sock *smc,
 	return 0;
 }
 
+/* setup for ISM connection of client */
+static int smc_connect_ism(struct smc_sock *smc,
+			   struct smc_clc_msg_accept_confirm *aclc,
+			   struct smcd_dev *ismdev)
+{
+	int local_contact = SMC_FIRST_CONTACT;
+	int rc = 0;
+
+	mutex_lock(&smc_create_lgr_pending);
+	local_contact = smc_conn_create(smc, true, aclc->hdr.flag, NULL, 0,
+					NULL, ismdev, aclc->gid);
+	if (local_contact < 0)
+		return smc_connect_abort(smc, SMC_CLC_DECL_MEM, 0);
+
+	/* Create send and receive buffers */
+	if (smc_buf_create(smc, true))
+		return smc_connect_abort(smc, SMC_CLC_DECL_MEM, local_contact);
+
+	smc_conn_save_peer_info(smc, aclc);
+	smc_close_init(smc);
+	smc_rx_init(smc);
+	smc_tx_init(smc);
+
+	rc = smc_clc_send_confirm(smc);
+	if (rc)
+		return smc_connect_abort(smc, rc, local_contact);
+	mutex_unlock(&smc_create_lgr_pending);
+
+	smc_copy_sock_settings_to_clc(smc);
+	if (smc->sk.sk_state == SMC_INIT)
+		smc->sk.sk_state = SMC_ACTIVE;
+
+	return 0;
+}
+
 /* perform steps before actually connecting */
 static int __smc_connect(struct smc_sock *smc)
 {
+	bool ism_supported = false, rdma_supported = false;
 	struct smc_clc_msg_accept_confirm aclc;
 	struct smc_ib_device *ibdev;
+	struct smcd_dev *ismdev;
+	unsigned short vlan;
+	int smc_type;
 	int rc = 0;
 	u8 ibport;
 
@@ -559,20 +658,52 @@ static int __smc_connect(struct smc_sock *smc)
 	if (using_ipsec(smc))
 		return smc_connect_decline_fallback(smc, SMC_CLC_DECL_IPSEC);
 
-	/* check if a RDMA device is available; if not, fall back */
-	if (smc_check_rdma(smc, &ibdev, &ibport))
+	/* check for VLAN ID */
+	if (smc_vlan_by_tcpsk(smc->clcsock, &vlan))
+		return smc_connect_decline_fallback(smc, SMC_CLC_DECL_CNFERR);
+
+	/* check if there is an ism device available */
+	if (!smc_check_ism(smc, &ismdev) &&
+	    !smc_connect_ism_vlan_setup(smc, ismdev, vlan)) {
+		/* ISM is supported for this connection */
+		ism_supported = true;
+		smc_type = SMC_TYPE_D;
+	}
+
+	/* check if there is a rdma device available */
+	if (!smc_check_rdma(smc, &ibdev, &ibport)) {
+		/* RDMA is supported for this connection */
+		rdma_supported = true;
+		if (ism_supported)
+			smc_type = SMC_TYPE_B; /* both */
+		else
+			smc_type = SMC_TYPE_R; /* only RDMA */
+	}
+
+	/* if neither ISM nor RDMA are supported, fallback */
+	if (!rdma_supported && !ism_supported)
 		return smc_connect_decline_fallback(smc, SMC_CLC_DECL_CNFERR);
 
 	/* perform CLC handshake */
-	rc = smc_connect_clc(smc, SMC_TYPE_R, &aclc, ibdev, ibport);
-	if (rc)
+	rc = smc_connect_clc(smc, smc_type, &aclc, ibdev, ibport, ismdev);
+	if (rc) {
+		smc_connect_ism_vlan_cleanup(smc, ism_supported, ismdev, vlan);
 		return smc_connect_decline_fallback(smc, rc);
+	}
 
-	/* connect using rdma */
-	rc = smc_connect_rdma(smc, &aclc, ibdev, ibport);
-	if (rc)
+	/* depending on previous steps, connect using rdma or ism */
+	if (rdma_supported && aclc.hdr.path == SMC_TYPE_R)
+		rc = smc_connect_rdma(smc, &aclc, ibdev, ibport);
+	else if (ism_supported && aclc.hdr.path == SMC_TYPE_D)
+		rc = smc_connect_ism(smc, &aclc, ismdev);
+	else
+		rc = SMC_CLC_DECL_CNFERR;
+	if (rc) {
+		smc_connect_ism_vlan_cleanup(smc, ism_supported, ismdev, vlan);
 		return smc_connect_decline_fallback(smc, rc);
+	}
 
+	smc_connect_ism_vlan_cleanup(smc, ism_supported, ismdev, vlan);
 	return 0;
 }
 
@@ -909,6 +1040,44 @@ static int smc_listen_rdma_init(struct smc_sock *new_smc,
 	return 0;
 }
 
+/* listen worker: initialize connection and buffers for SMC-D */
+static int smc_listen_ism_init(struct smc_sock *new_smc,
+			       struct smc_clc_msg_proposal *pclc,
+			       struct smcd_dev *ismdev,
+			       int *local_contact)
+{
+	struct smc_clc_msg_smcd *pclc_smcd;
+
+	pclc_smcd = smc_get_clc_msg_smcd(pclc);
+	*local_contact = smc_conn_create(new_smc, true, 0, NULL, 0, NULL,
+					 ismdev, pclc_smcd->gid);
+	if (*local_contact < 0) {
+		if (*local_contact == -ENOMEM)
+			return SMC_CLC_DECL_MEM;/* insufficient memory*/
+		return SMC_CLC_DECL_INTERR; /* other error */
+	}
+
+	/* Check if peer can be reached via ISM device */
+	if (smc_ism_cantalk(new_smc->conn.lgr->peer_gid,
+			    new_smc->conn.lgr->vlan_id,
+			    new_smc->conn.lgr->smcd)) {
+		if (*local_contact == SMC_FIRST_CONTACT)
+			smc_lgr_forget(new_smc->conn.lgr);
+		smc_conn_free(&new_smc->conn);
+		return SMC_CLC_DECL_CNFERR;
+	}
+
+	/* Create send and receive buffers */
+	if (smc_buf_create(new_smc, true)) {
+		if (*local_contact == SMC_FIRST_CONTACT)
+			smc_lgr_forget(new_smc->conn.lgr);
+		smc_conn_free(&new_smc->conn);
+		return SMC_CLC_DECL_MEM;
+	}
+
+	return 0;
+}
+
 /* listen worker: register buffers */
 static int smc_listen_rdma_reg(struct smc_sock *new_smc, int local_contact)
 {
@@ -967,6 +1136,8 @@ static void smc_listen_work(struct work_struct *work)
 	struct smc_clc_msg_accept_confirm cclc;
 	struct smc_clc_msg_proposal *pclc;
 	struct smc_ib_device *ibdev;
+	bool ism_supported = false;
+	struct smcd_dev *ismdev;
 	u8 buf[SMC_CLC_MAX_LEN];
 	int local_contact = 0;
 	int reason_code = 0;
@@ -1007,13 +1178,21 @@ static void smc_listen_work(struct work_struct *work)
 	smc_rx_init(new_smc);
 	smc_tx_init(new_smc);
 
+	/* check if ISM is available */
+	if ((pclc->hdr.path == SMC_TYPE_D || pclc->hdr.path == SMC_TYPE_B) &&
+	    !smc_check_ism(new_smc, &ismdev) &&
+	    !smc_listen_ism_init(new_smc, pclc, ismdev, &local_contact)) {
+		ism_supported = true;
+	}
+
 	/* check if RDMA is available */
-	if ((pclc->hdr.path != SMC_TYPE_R && pclc->hdr.path != SMC_TYPE_B) ||
-	    smc_check_rdma(new_smc, &ibdev, &ibport) ||
-	    smc_listen_rdma_check(new_smc, pclc) ||
-	    smc_listen_rdma_init(new_smc, pclc, ibdev, ibport,
-				 &local_contact) ||
-	    smc_listen_rdma_reg(new_smc, local_contact)) {
+	if (!ism_supported &&
+	    ((pclc->hdr.path != SMC_TYPE_R && pclc->hdr.path != SMC_TYPE_B) ||
+	     smc_check_rdma(new_smc, &ibdev, &ibport) ||
+	     smc_listen_rdma_check(new_smc, pclc) ||
+	     smc_listen_rdma_init(new_smc, pclc, ibdev, ibport,
+				  &local_contact) ||
+	     smc_listen_rdma_reg(new_smc, local_contact))) {
 		/* SMC not supported, decline */
 		mutex_unlock(&smc_create_lgr_pending);
 		smc_listen_decline(new_smc, SMC_CLC_DECL_CNFERR, local_contact);
@@ -1038,7 +1217,8 @@ static void smc_listen_work(struct work_struct *work)
 	}
 
 	/* finish worker */
-	smc_listen_rdma_finish(new_smc, &cclc, local_contact);
+	if (!ism_supported)
+		smc_listen_rdma_finish(new_smc, &cclc, local_contact);
 	smc_conn_save_peer_info(new_smc, &cclc);
 	mutex_unlock(&smc_create_lgr_pending);
 	smc_listen_out_connected(new_smc);
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 434c028162a4..66741e61a3b0 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -478,7 +478,7 @@ void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid)
 /* Determine vlan of internal TCP socket.
  * @vlan_id: address to store the determined vlan id into
  */
-static int smc_vlan_by_tcpsk(struct socket *clcsock, unsigned short *vlan_id)
+int smc_vlan_by_tcpsk(struct socket *clcsock, unsigned short *vlan_id)
 {
 	struct dst_entry *dst = sk_dst_get(clcsock->sk);
 	struct net_device *ndev;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index cd9268a9570e..8b47e0168fc3 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -257,6 +257,7 @@ void smc_sndbuf_sync_sg_for_cpu(struct smc_connection *conn);
 void smc_sndbuf_sync_sg_for_device(struct smc_connection *conn);
 void smc_rmb_sync_sg_for_cpu(struct smc_connection *conn);
 void smc_rmb_sync_sg_for_device(struct smc_connection *conn);
+int smc_vlan_by_tcpsk(struct socket *clcsock, unsigned short *vlan_id);
 
 void smc_conn_free(struct smc_connection *conn);
 int smc_conn_create(struct smc_sock *smc, bool is_smcd, int srv_first_contact,
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 07/10] net/smc: add SMC-D support in data transfer
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Hans Wippel <hwippel@linux.ibm.com>

The data transfer and CDC message headers differ in SMC-R and SMC-D.
This patch adds support for the SMC-D data transfer to the existing SMC
code. It consists of the following:

* SMC-D CDC support
* SMC-D tx support
* SMC-D rx support

The CDC header is stored at the beginning of the receive buffer. Thus, a
rx_offset variable is added for the CDC header offset within the buffer
(0 for SMC-R).

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
---
 net/smc/smc.h      |   5 ++
 net/smc/smc_cdc.c  |  86 +++++++++++++++++++++++-
 net/smc/smc_cdc.h  |  43 +++++++++++-
 net/smc/smc_core.c |  25 +++++--
 net/smc/smc_ism.c  |   8 +++
 net/smc/smc_rx.c   |   2 +-
 net/smc/smc_tx.c   | 193 +++++++++++++++++++++++++++++++++++++++++------------
 net/smc/smc_tx.h   |   2 +
 8 files changed, 308 insertions(+), 56 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 7c86f716a92e..8c6231011779 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -183,6 +183,11 @@ struct smc_connection {
 	spinlock_t		acurs_lock;	/* protect cursors */
 #endif
 	struct work_struct	close_work;	/* peer sent some closing */
+	struct tasklet_struct	rx_tsklet;	/* Receiver tasklet for SMC-D */
+	u8			rx_off;		/* receive offset:
+						 * 0 for SMC-R, 32 for SMC-D
+						 */
+	u64			peer_token;	/* SMC-D token of peer */
 };
 
 struct smc_sock {				/* smc sock container */
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index a7e8d63fc8ae..621d8cca570b 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -117,7 +117,7 @@ int smc_cdc_msg_send(struct smc_connection *conn,
 	return rc;
 }
 
-int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn)
+static int smcr_cdc_get_slot_and_msg_send(struct smc_connection *conn)
 {
 	struct smc_cdc_tx_pend *pend;
 	struct smc_wr_buf *wr_buf;
@@ -130,6 +130,21 @@ int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn)
 	return smc_cdc_msg_send(conn, wr_buf, pend);
 }
 
+int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn)
+{
+	int rc;
+
+	if (conn->lgr->is_smcd) {
+		spin_lock_bh(&conn->send_lock);
+		rc = smcd_cdc_msg_send(conn);
+		spin_unlock_bh(&conn->send_lock);
+	} else {
+		rc = smcr_cdc_get_slot_and_msg_send(conn);
+	}
+
+	return rc;
+}
+
 static bool smc_cdc_tx_filter(struct smc_wr_tx_pend_priv *tx_pend,
 			      unsigned long data)
 {
@@ -157,6 +172,45 @@ void smc_cdc_tx_dismiss_slots(struct smc_connection *conn)
 				(unsigned long)conn);
 }
 
+/* Send a SMC-D CDC header.
+ * This increments the free space available in our send buffer.
+ * Also update the confirmed receive buffer with what was sent to the peer.
+ */
+int smcd_cdc_msg_send(struct smc_connection *conn)
+{
+	struct smc_sock *smc = container_of(conn, struct smc_sock, conn);
+	struct smcd_cdc_msg cdc;
+	int rc, diff;
+
+	memset(&cdc, 0, sizeof(cdc));
+	cdc.common.type = SMC_CDC_MSG_TYPE;
+	cdc.prod_wrap = conn->local_tx_ctrl.prod.wrap;
+	cdc.prod_count = conn->local_tx_ctrl.prod.count;
+
+	cdc.cons_wrap = conn->local_tx_ctrl.cons.wrap;
+	cdc.cons_count = conn->local_tx_ctrl.cons.count;
+	cdc.prod_flags = conn->local_tx_ctrl.prod_flags;
+	cdc.conn_state_flags = conn->local_tx_ctrl.conn_state_flags;
+	rc = smcd_tx_ism_write(conn, &cdc, sizeof(cdc), 0, 1);
+	if (rc)
+		return rc;
+	smc_curs_write(&conn->rx_curs_confirmed,
+		       smc_curs_read(&conn->local_tx_ctrl.cons, conn), conn);
+	/* Calculate transmitted data and increment free send buffer space */
+	diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+			     &conn->tx_curs_sent);
+	/* increased by confirmed number of bytes */
+	smp_mb__before_atomic();
+	atomic_add(diff, &conn->sndbuf_space);
+	/* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+	smp_mb__after_atomic();
+	smc_curs_write(&conn->tx_curs_fin,
+		       smc_curs_read(&conn->tx_curs_sent, conn), conn);
+
+	smc_tx_sndbuf_nonfull(smc);
+	return rc;
+}
+
 /********************************* receive ***********************************/
 
 static inline bool smc_cdc_before(u16 seq1, u16 seq2)
@@ -178,7 +232,7 @@ static void smc_cdc_handle_urg_data_arrival(struct smc_sock *smc,
 	if (!sock_flag(&smc->sk, SOCK_URGINLINE))
 		/* we'll skip the urgent byte, so don't account for it */
 		(*diff_prod)--;
-	base = (char *)conn->rmb_desc->cpu_addr;
+	base = (char *)conn->rmb_desc->cpu_addr + conn->rx_off;
 	if (conn->urg_curs.count)
 		conn->urg_rx_byte = *(base + conn->urg_curs.count - 1);
 	else
@@ -276,6 +330,34 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
 	sock_put(&smc->sk); /* no free sk in softirq-context */
 }
 
+/* Schedule a tasklet for this connection. Triggered from the ISM device IRQ
+ * handler to indicate update in the DMBE.
+ *
+ * Context:
+ * - tasklet context
+ */
+static void smcd_cdc_rx_tsklet(unsigned long data)
+{
+	struct smc_connection *conn = (struct smc_connection *)data;
+	struct smcd_cdc_msg cdc;
+	struct smc_sock *smc;
+
+	if (!conn)
+		return;
+
+	memcpy(&cdc, conn->rmb_desc->cpu_addr, sizeof(cdc));
+	smc = container_of(conn, struct smc_sock, conn);
+	smc_cdc_msg_recv(smc, (struct smc_cdc_msg *)&cdc);
+}
+
+/* Initialize receive tasklet. Called from ISM device IRQ handler to start
+ * receiver side.
+ */
+void smcd_cdc_rx_init(struct smc_connection *conn)
+{
+	tasklet_init(&conn->rx_tsklet, smcd_cdc_rx_tsklet, (unsigned long)conn);
+}
+
 /***************************** init, exit, misc ******************************/
 
 static void smc_cdc_rx_handler(struct ib_wc *wc, void *buf)
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index f60082fee5b8..8fbce4fee3e4 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -50,6 +50,20 @@ struct smc_cdc_msg {
 	u8				reserved[18];
 } __packed;					/* format defined in RFC7609 */
 
+/* CDC message for SMC-D */
+struct smcd_cdc_msg {
+	struct smc_wr_rx_hdr common;	/* Type = 0xFE */
+	u8 res1[7];
+	u16 prod_wrap;
+	u32 prod_count;
+	u8 res2[2];
+	u16 cons_wrap;
+	u32 cons_count;
+	struct smc_cdc_producer_flags	prod_flags;
+	struct smc_cdc_conn_state_flags conn_state_flags;
+	u8 res3[8];
+} __packed;
+
 static inline bool smc_cdc_rxed_any_close(struct smc_connection *conn)
 {
 	return conn->local_rx_ctrl.conn_state_flags.peer_conn_abort ||
@@ -204,9 +218,9 @@ static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
 	smc_curs_write(local, smc_curs_read(&temp, conn), conn);
 }
 
-static inline void smc_cdc_msg_to_host(struct smc_host_cdc_msg *local,
-				       struct smc_cdc_msg *peer,
-				       struct smc_connection *conn)
+static inline void smcr_cdc_msg_to_host(struct smc_host_cdc_msg *local,
+					struct smc_cdc_msg *peer,
+					struct smc_connection *conn)
 {
 	local->common.type = peer->common.type;
 	local->len = peer->len;
@@ -218,6 +232,27 @@ static inline void smc_cdc_msg_to_host(struct smc_host_cdc_msg *local,
 	local->conn_state_flags = peer->conn_state_flags;
 }
 
+static inline void smcd_cdc_msg_to_host(struct smc_host_cdc_msg *local,
+					struct smcd_cdc_msg *peer)
+{
+	local->prod.wrap = peer->prod_wrap;
+	local->prod.count = peer->prod_count;
+	local->cons.wrap = peer->cons_wrap;
+	local->cons.count = peer->cons_count;
+	local->prod_flags = peer->prod_flags;
+	local->conn_state_flags = peer->conn_state_flags;
+}
+
+static inline void smc_cdc_msg_to_host(struct smc_host_cdc_msg *local,
+				       struct smc_cdc_msg *peer,
+				       struct smc_connection *conn)
+{
+	if (conn->lgr->is_smcd)
+		smcd_cdc_msg_to_host(local, (struct smcd_cdc_msg *)peer);
+	else
+		smcr_cdc_msg_to_host(local, peer, conn);
+}
+
 struct smc_cdc_tx_pend;
 
 int smc_cdc_get_free_slot(struct smc_connection *conn,
@@ -227,6 +262,8 @@ void smc_cdc_tx_dismiss_slots(struct smc_connection *conn);
 int smc_cdc_msg_send(struct smc_connection *conn, struct smc_wr_buf *wr_buf,
 		     struct smc_cdc_tx_pend *pend);
 int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn);
+int smcd_cdc_msg_send(struct smc_connection *conn);
 int smc_cdc_init(void) __init;
+void smcd_cdc_rx_init(struct smc_connection *conn);
 
 #endif /* SMC_CDC_H */
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index daa88db1841a..434c028162a4 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -281,10 +281,12 @@ void smc_conn_free(struct smc_connection *conn)
 {
 	if (!conn->lgr)
 		return;
-	if (conn->lgr->is_smcd)
+	if (conn->lgr->is_smcd) {
 		smc_ism_unset_conn(conn);
-	else
+		tasklet_kill(&conn->rx_tsklet);
+	} else {
 		smc_cdc_tx_dismiss_slots(conn);
+	}
 	smc_lgr_unregister_conn(conn);
 	smc_buf_unuse(conn);
 }
@@ -324,10 +326,13 @@ static void smcr_buf_free(struct smc_link_group *lgr, bool is_rmb,
 static void smcd_buf_free(struct smc_link_group *lgr, bool is_dmb,
 			  struct smc_buf_desc *buf_desc)
 {
-	if (is_dmb)
+	if (is_dmb) {
+		/* restore original buf len */
+		buf_desc->len += sizeof(struct smcd_cdc_msg);
 		smc_ism_unregister_dmb(lgr->smcd, buf_desc);
-	else
+	} else {
 		kfree(buf_desc->cpu_addr);
+	}
 	kfree(buf_desc);
 }
 
@@ -632,6 +637,10 @@ int smc_conn_create(struct smc_sock *smc, bool is_smcd, int srv_first_contact,
 	conn->local_tx_ctrl.common.type = SMC_CDC_MSG_TYPE;
 	conn->local_tx_ctrl.len = SMC_WR_TX_SIZE;
 	conn->urg_state = SMC_URG_READ;
+	if (is_smcd) {
+		conn->rx_off = sizeof(struct smcd_cdc_msg);
+		smcd_cdc_rx_init(conn); /* init tasklet for this conn */
+	}
 #ifndef KERNEL_HAS_ATOMIC64
 	spin_lock_init(&conn->acurs_lock);
 #endif
@@ -776,8 +785,9 @@ static struct smc_buf_desc *smcd_new_buf_create(struct smc_link_group *lgr,
 			kfree(buf_desc);
 			return ERR_PTR(-EAGAIN);
 		}
-		memset(buf_desc->cpu_addr, 0, bufsize);
-		buf_desc->len = bufsize;
+		buf_desc->pages = virt_to_page(buf_desc->cpu_addr);
+		/* CDC header stored in buf. So, pretend it was smaller */
+		buf_desc->len = bufsize - sizeof(struct smcd_cdc_msg);
 	} else {
 		buf_desc->cpu_addr = kzalloc(bufsize, GFP_KERNEL |
 					     __GFP_NOWARN | __GFP_NORETRY |
@@ -854,7 +864,8 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
 		conn->rmbe_size_short = bufsize_short;
 		smc->sk.sk_rcvbuf = bufsize * 2;
 		atomic_set(&conn->bytes_to_rcv, 0);
-		conn->rmbe_update_limit = smc_rmb_wnd_update_limit(bufsize);
+		conn->rmbe_update_limit =
+			smc_rmb_wnd_update_limit(buf_desc->len);
 		if (is_smcd)
 			smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */
 	} else {
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index f44e4dff244a..cfade7fdcc6d 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -302,5 +302,13 @@ EXPORT_SYMBOL_GPL(smcd_handle_event);
  */
 void smcd_handle_irq(struct smcd_dev *smcd, unsigned int dmbno)
 {
+	struct smc_connection *conn = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&smcd->lock, flags);
+	conn = smcd->conn[dmbno];
+	if (conn)
+		tasklet_schedule(&conn->rx_tsklet);
+	spin_unlock_irqrestore(&smcd->lock, flags);
 }
 EXPORT_SYMBOL_GPL(smcd_handle_irq);
diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
index 3d77b383cccd..b329803c8339 100644
--- a/net/smc/smc_rx.c
+++ b/net/smc/smc_rx.c
@@ -305,7 +305,7 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg,
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
 
 	/* we currently use 1 RMBE per RMB, so RMBE == RMB base addr */
-	rcvbuf_base = conn->rmb_desc->cpu_addr;
+	rcvbuf_base = conn->rx_off + conn->rmb_desc->cpu_addr;
 
 	do { /* while (read_remaining) */
 		if (read_done >= target || (pipe && read_done))
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index f82886b7d1d8..142bcb134dd6 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -24,6 +24,7 @@
 #include "smc.h"
 #include "smc_wr.h"
 #include "smc_cdc.h"
+#include "smc_ism.h"
 #include "smc_tx.h"
 
 #define SMC_TX_WORK_DELAY	HZ
@@ -250,6 +251,24 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
 
 /***************************** sndbuf consumer *******************************/
 
+/* sndbuf consumer: actual data transfer of one target chunk with ISM write */
+int smcd_tx_ism_write(struct smc_connection *conn, void *data, size_t len,
+		      u32 offset, int signal)
+{
+	struct smc_ism_position pos;
+	int rc;
+
+	memset(&pos, 0, sizeof(pos));
+	pos.token = conn->peer_token;
+	pos.index = conn->peer_rmbe_idx;
+	pos.offset = conn->tx_off + offset;
+	pos.signal = signal;
+	rc = smc_ism_write(conn->lgr->smcd, &pos, data, len);
+	if (rc)
+		conn->local_tx_ctrl.conn_state_flags.peer_conn_abort = 1;
+	return rc;
+}
+
 /* sndbuf consumer: actual data transfer of one target chunk with RDMA write */
 static int smc_tx_rdma_write(struct smc_connection *conn, int peer_rmbe_offset,
 			     int num_sges, struct ib_sge sges[])
@@ -297,21 +316,104 @@ static inline void smc_tx_advance_cursors(struct smc_connection *conn,
 	smc_curs_add(conn->sndbuf_desc->len, sent, len);
 }
 
+/* SMC-R helper for smc_tx_rdma_writes() */
+static int smcr_tx_rdma_writes(struct smc_connection *conn, size_t len,
+			       size_t src_off, size_t src_len,
+			       size_t dst_off, size_t dst_len)
+{
+	dma_addr_t dma_addr =
+		sg_dma_address(conn->sndbuf_desc->sgt[SMC_SINGLE_LINK].sgl);
+	struct smc_link *link = &conn->lgr->lnk[SMC_SINGLE_LINK];
+	int src_len_sum = src_len, dst_len_sum = dst_len;
+	struct ib_sge sges[SMC_IB_MAX_SEND_SGE];
+	int sent_count = src_off;
+	int srcchunk, dstchunk;
+	int num_sges;
+	int rc;
+
+	for (dstchunk = 0; dstchunk < 2; dstchunk++) {
+		num_sges = 0;
+		for (srcchunk = 0; srcchunk < 2; srcchunk++) {
+			sges[srcchunk].addr = dma_addr + src_off;
+			sges[srcchunk].length = src_len;
+			sges[srcchunk].lkey = link->roce_pd->local_dma_lkey;
+			num_sges++;
+
+			src_off += src_len;
+			if (src_off >= conn->sndbuf_desc->len)
+				src_off -= conn->sndbuf_desc->len;
+						/* modulo in send ring */
+			if (src_len_sum == dst_len)
+				break; /* either on 1st or 2nd iteration */
+			/* prepare next (== 2nd) iteration */
+			src_len = dst_len - src_len; /* remainder */
+			src_len_sum += src_len;
+		}
+		rc = smc_tx_rdma_write(conn, dst_off, num_sges, sges);
+		if (rc)
+			return rc;
+		if (dst_len_sum == len)
+			break; /* either on 1st or 2nd iteration */
+		/* prepare next (== 2nd) iteration */
+		dst_off = 0; /* modulo offset in RMBE ring buffer */
+		dst_len = len - dst_len; /* remainder */
+		dst_len_sum += dst_len;
+		src_len = min_t(int, dst_len, conn->sndbuf_desc->len -
+				sent_count);
+		src_len_sum = src_len;
+	}
+	return 0;
+}
+
+/* SMC-D helper for smc_tx_rdma_writes() */
+static int smcd_tx_rdma_writes(struct smc_connection *conn, size_t len,
+			       size_t src_off, size_t src_len,
+			       size_t dst_off, size_t dst_len)
+{
+	int src_len_sum = src_len, dst_len_sum = dst_len;
+	int srcchunk, dstchunk;
+	int rc;
+
+	for (dstchunk = 0; dstchunk < 2; dstchunk++) {
+		for (srcchunk = 0; srcchunk < 2; srcchunk++) {
+			void *data = conn->sndbuf_desc->cpu_addr + src_off;
+
+			rc = smcd_tx_ism_write(conn, data, src_len, dst_off +
+					       sizeof(struct smcd_cdc_msg), 0);
+			if (rc)
+				return rc;
+			dst_off += src_len;
+			src_off += src_len;
+			if (src_off >= conn->sndbuf_desc->len)
+				src_off -= conn->sndbuf_desc->len;
+						/* modulo in send ring */
+			if (src_len_sum == dst_len)
+				break; /* either on 1st or 2nd iteration */
+			/* prepare next (== 2nd) iteration */
+			src_len = dst_len - src_len; /* remainder */
+			src_len_sum += src_len;
+		}
+		if (dst_len_sum == len)
+			break; /* either on 1st or 2nd iteration */
+		/* prepare next (== 2nd) iteration */
+		dst_off = 0; /* modulo offset in RMBE ring buffer */
+		dst_len = len - dst_len; /* remainder */
+		dst_len_sum += dst_len;
+		src_len = min_t(int, dst_len, conn->sndbuf_desc->len - src_off);
+		src_len_sum = src_len;
+	}
+	return 0;
+}
+
 /* sndbuf consumer: prepare all necessary (src&dst) chunks of data transmit;
  * usable snd_wnd as max transmit
  */
 static int smc_tx_rdma_writes(struct smc_connection *conn)
 {
-	size_t src_off, src_len, dst_off, dst_len; /* current chunk values */
-	size_t len, dst_len_sum, src_len_sum, dstchunk, srcchunk;
+	size_t len, src_len, dst_off, dst_len; /* current chunk values */
 	union smc_host_cursor sent, prep, prod, cons;
-	struct ib_sge sges[SMC_IB_MAX_SEND_SGE];
-	struct smc_link_group *lgr = conn->lgr;
 	struct smc_cdc_producer_flags *pflags;
 	int to_send, rmbespace;
-	struct smc_link *link;
-	dma_addr_t dma_addr;
-	int num_sges;
 	int rc;
 
 	/* source: sndbuf */
@@ -341,7 +443,6 @@ static int smc_tx_rdma_writes(struct smc_connection *conn)
 	len = min(to_send, rmbespace);
 
 	/* initialize variables for first iteration of subsequent nested loop */
-	link = &lgr->lnk[SMC_SINGLE_LINK];
 	dst_off = prod.count;
 	if (prod.wrap == cons.wrap) {
 		/* the filled destination area is unwrapped,
@@ -358,8 +459,6 @@ static int smc_tx_rdma_writes(struct smc_connection *conn)
 		 */
 		dst_len = len;
 	}
-	dst_len_sum = dst_len;
-	src_off = sent.count;
 	/* dst_len determines the maximum src_len */
 	if (sent.count + dst_len <= conn->sndbuf_desc->len) {
 		/* unwrapped src case: single chunk of entire dst_len */
@@ -368,38 +467,15 @@ static int smc_tx_rdma_writes(struct smc_connection *conn)
 		/* wrapped src case: 2 chunks of sum dst_len; start with 1st: */
 		src_len = conn->sndbuf_desc->len - sent.count;
 	}
-	src_len_sum = src_len;
-	dma_addr = sg_dma_address(conn->sndbuf_desc->sgt[SMC_SINGLE_LINK].sgl);
-	for (dstchunk = 0; dstchunk < 2; dstchunk++) {
-		num_sges = 0;
-		for (srcchunk = 0; srcchunk < 2; srcchunk++) {
-			sges[srcchunk].addr = dma_addr + src_off;
-			sges[srcchunk].length = src_len;
-			sges[srcchunk].lkey = link->roce_pd->local_dma_lkey;
-			num_sges++;
-			src_off += src_len;
-			if (src_off >= conn->sndbuf_desc->len)
-				src_off -= conn->sndbuf_desc->len;
-						/* modulo in send ring */
-			if (src_len_sum == dst_len)
-				break; /* either on 1st or 2nd iteration */
-			/* prepare next (== 2nd) iteration */
-			src_len = dst_len - src_len; /* remainder */
-			src_len_sum += src_len;
-		}
-		rc = smc_tx_rdma_write(conn, dst_off, num_sges, sges);
-		if (rc)
-			return rc;
-		if (dst_len_sum == len)
-			break; /* either on 1st or 2nd iteration */
-		/* prepare next (== 2nd) iteration */
-		dst_off = 0; /* modulo offset in RMBE ring buffer */
-		dst_len = len - dst_len; /* remainder */
-		dst_len_sum += dst_len;
-		src_len = min_t(int,
-				dst_len, conn->sndbuf_desc->len - sent.count);
-		src_len_sum = src_len;
-	}
+
+	if (conn->lgr->is_smcd)
+		rc = smcd_tx_rdma_writes(conn, len, sent.count, src_len,
+					 dst_off, dst_len);
+	else
+		rc = smcr_tx_rdma_writes(conn, len, sent.count, src_len,
+					 dst_off, dst_len);
+	if (rc)
+		return rc;
 
 	if (conn->urg_tx_pend && len == to_send)
 		pflags->urg_data_present = 1;
@@ -420,7 +496,7 @@ static int smc_tx_rdma_writes(struct smc_connection *conn)
 /* Wakeup sndbuf consumers from any context (IRQ or process)
  * since there is more data to transmit; usable snd_wnd as max transmit
  */
-int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
+static int smcr_tx_sndbuf_nonempty(struct smc_connection *conn)
 {
 	struct smc_cdc_producer_flags *pflags;
 	struct smc_cdc_tx_pend *pend;
@@ -467,6 +543,37 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
 	return rc;
 }
 
+static int smcd_tx_sndbuf_nonempty(struct smc_connection *conn)
+{
+	struct smc_cdc_producer_flags *pflags = &conn->local_tx_ctrl.prod_flags;
+	int rc = 0;
+
+	spin_lock_bh(&conn->send_lock);
+	if (!pflags->urg_data_present)
+		rc = smc_tx_rdma_writes(conn);
+	if (!rc)
+		rc = smcd_cdc_msg_send(conn);
+
+	if (!rc && pflags->urg_data_present) {
+		pflags->urg_data_pending = 0;
+		pflags->urg_data_present = 0;
+	}
+	spin_unlock_bh(&conn->send_lock);
+	return rc;
+}
+
+int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
+{
+	int rc;
+
+	if (conn->lgr->is_smcd)
+		rc = smcd_tx_sndbuf_nonempty(conn);
+	else
+		rc = smcr_tx_sndbuf_nonempty(conn);
+
+	return rc;
+}
+
 /* Wakeup sndbuf consumers from process context
  * since there is more data to transmit
  */
diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
index 9d2238909fa0..b22bdc5694c4 100644
--- a/net/smc/smc_tx.h
+++ b/net/smc/smc_tx.h
@@ -33,5 +33,7 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len);
 int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
 void smc_tx_sndbuf_nonfull(struct smc_sock *smc);
 void smc_tx_consumer_update(struct smc_connection *conn, bool force);
+int smcd_tx_ism_write(struct smc_connection *conn, void *data, size_t len,
+		      u32 offset, int signal);
 
 #endif /* SMC_TX_H */
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 05/10] net/smc: add pnetid support for SMC-D and ISM
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Hans Wippel <hwippel@linux.ibm.com>

SMC-D relies on PNETIDs to find usable SMC-D/ISM devices for a SMC
connection. This patch adds SMC-D/ISM support to the current PNETID
implementation.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
---
 include/net/smc.h  |  1 +
 net/smc/smc_ism.c  |  2 ++
 net/smc/smc_pnet.c | 41 +++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_pnet.h |  2 ++
 4 files changed, 46 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 824a7af8d654..9ef49f8b1002 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -73,6 +73,7 @@ struct smcd_dev {
 	struct smc_connection **conn;
 	struct list_head vlan;
 	struct workqueue_struct *event_wq;
+	u8 pnetid[SMC_MAX_PNETID_LEN];
 };
 
 struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index ca1ce42fd49f..f44e4dff244a 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -13,6 +13,7 @@
 #include "smc.h"
 #include "smc_core.h"
 #include "smc_ism.h"
+#include "smc_pnet.h"
 
 struct smcd_dev_list smcd_dev_list = {
 	.list = LIST_HEAD_INIT(smcd_dev_list.list),
@@ -227,6 +228,7 @@ struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
 	device_initialize(&smcd->dev);
 	dev_set_name(&smcd->dev, name);
 	smcd->ops = ops;
+	smc_pnetid_by_dev_port(parent, 0, smcd->pnetid);
 
 	spin_lock_init(&smcd->lock);
 	INIT_LIST_HEAD(&smcd->vlan);
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index cdc6e23b6ce1..1b6c066d3495 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -22,6 +22,7 @@
 
 #include "smc_pnet.h"
 #include "smc_ib.h"
+#include "smc_ism.h"
 
 static struct nla_policy smc_pnet_policy[SMC_PNETID_MAX + 1] = {
 	[SMC_PNETID_NAME] = {
@@ -564,6 +565,27 @@ static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
 	spin_unlock(&smc_ib_devices.lock);
 }
 
+static void smc_pnet_find_ism_by_pnetid(struct net_device *ndev,
+					struct smcd_dev **smcismdev)
+{
+	u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
+	struct smcd_dev *ismdev;
+
+	ndev = pnet_find_base_ndev(ndev);
+	if (smc_pnetid_by_dev_port(ndev->dev.parent, ndev->dev_port,
+				   ndev_pnetid))
+		return; /* pnetid could not be determined */
+
+	spin_lock(&smcd_dev_list.lock);
+	list_for_each_entry(ismdev, &smcd_dev_list.list, list) {
+		if (!memcmp(ismdev->pnetid, ndev_pnetid, SMC_MAX_PNETID_LEN)) {
+			*smcismdev = ismdev;
+			break;
+		}
+	}
+	spin_unlock(&smcd_dev_list.lock);
+}
+
 /* Lookup of coupled ib_device via SMC pnet table */
 static void smc_pnet_find_roce_by_table(struct net_device *netdev,
 					struct smc_ib_device **smcibdev,
@@ -615,3 +637,22 @@ void smc_pnet_find_roce_resource(struct sock *sk,
 out:
 	return;
 }
+
+void smc_pnet_find_ism_resource(struct sock *sk, struct smcd_dev **smcismdev)
+{
+	struct dst_entry *dst = sk_dst_get(sk);
+
+	*smcismdev = NULL;
+	if (!dst)
+		goto out;
+	if (!dst->dev)
+		goto out_rel;
+
+	/* if possible, lookup via hardware-defined pnetid */
+	smc_pnet_find_ism_by_pnetid(dst->dev, smcismdev);
+
+out_rel:
+	dst_release(dst);
+out:
+	return;
+}
diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
index ad4455cde9e7..1e94fd4df7bc 100644
--- a/net/smc/smc_pnet.h
+++ b/net/smc/smc_pnet.h
@@ -17,6 +17,7 @@
 #endif
 
 struct smc_ib_device;
+struct smcd_dev;
 
 static inline int smc_pnetid_by_dev_port(struct device *dev,
 					 unsigned short port, u8 *pnetid)
@@ -33,5 +34,6 @@ void smc_pnet_exit(void);
 int smc_pnet_remove_by_ibdev(struct smc_ib_device *ibdev);
 void smc_pnet_find_roce_resource(struct sock *sk,
 				 struct smc_ib_device **smcibdev, u8 *ibport);
+void smc_pnet_find_ism_resource(struct sock *sk, struct smcd_dev **smcismdev);
 
 #endif
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 04/10] net/smc: add base infrastructure for SMC-D and ISM
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Hans Wippel <hwippel@linux.ibm.com>

SMC supports two variants: SMC-R and SMC-D. For data transport, SMC-R
uses RDMA devices, SMC-D uses so-called Internal Shared Memory (ISM)
devices. An ISM device only allows shared memory communication between
SMC instances on the same machine. For example, this allows virtual
machines on the same host to communicate via SMC without RDMA devices.

This patch adds the base infrastructure for SMC-D and ISM devices to
the existing SMC code. It contains the following:

* ISM driver interface:
  This interface allows an ISM driver to register ISM devices in SMC. In
  the process, the driver provides a set of device ops for each device.
  SMC uses these ops to execute SMC specific operations on or transfer
  data over the device.

* Core SMC-D link group, connection, and buffer support:
  Link groups, SMC connections and SMC buffers (in smc_core) are
  extended to support SMC-D.

* SMC type checks:
  Some type checks are added to prevent using SMC-R specific code for
  SMC-D and vice versa.

To actually use SMC-D, additional changes to pnetid, CLC, CDC, etc. are
required. These are added in follow-up patches.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
---
 include/net/smc.h  |  62 +++++++++++
 net/smc/Makefile   |   2 +-
 net/smc/af_smc.c   |  11 +-
 net/smc/smc_core.c | 270 +++++++++++++++++++++++++++++++++++------------
 net/smc/smc_core.h |  71 +++++++++----
 net/smc/smc_diag.c |   3 +-
 net/smc/smc_ism.c  | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_ism.h  |  48 +++++++++
 8 files changed, 679 insertions(+), 92 deletions(-)
 create mode 100644 net/smc/smc_ism.c
 create mode 100644 net/smc/smc_ism.h

diff --git a/include/net/smc.h b/include/net/smc.h
index 2173932fab9d..824a7af8d654 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -20,4 +20,66 @@ struct smc_hashinfo {
 
 int smc_hash_sk(struct sock *sk);
 void smc_unhash_sk(struct sock *sk);
+
+/* SMCD/ISM device driver interface */
+struct smcd_dmb {
+	u64 dmb_tok;
+	u64 rgid;
+	u32 dmb_len;
+	u32 sba_idx;
+	u32 vlan_valid;
+	u32 vlan_id;
+	void *cpu_addr;
+	dma_addr_t dma_addr;
+};
+
+#define ISM_EVENT_DMB	0
+#define ISM_EVENT_GID	1
+#define ISM_EVENT_SWR	2
+
+struct smcd_event {
+	u32 type;
+	u32 code;
+	u64 tok;
+	u64 time;
+	u64 info;
+};
+
+struct smcd_dev;
+
+struct smcd_ops {
+	int (*query_remote_gid)(struct smcd_dev *dev, u64 rgid, u32 vid_valid,
+				u32 vid);
+	int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+	int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+	int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
+	int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
+	int (*set_vlan_required)(struct smcd_dev *dev);
+	int (*reset_vlan_required)(struct smcd_dev *dev);
+	int (*signal_event)(struct smcd_dev *dev, u64 rgid, u32 trigger_irq,
+			    u32 event_code, u64 info);
+	int (*move_data)(struct smcd_dev *dev, u64 dmb_tok, unsigned int idx,
+			 bool sf, unsigned int offset, void *data,
+			 unsigned int size);
+};
+
+struct smcd_dev {
+	const struct smcd_ops *ops;
+	struct device dev;
+	void *priv;
+	u64 local_gid;
+	struct list_head list;
+	spinlock_t lock;
+	struct smc_connection **conn;
+	struct list_head vlan;
+	struct workqueue_struct *event_wq;
+};
+
+struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
+				const struct smcd_ops *ops, int max_dmbs);
+int smcd_register_dev(struct smcd_dev *smcd);
+void smcd_unregister_dev(struct smcd_dev *smcd);
+void smcd_free_dev(struct smcd_dev *smcd);
+void smcd_handle_event(struct smcd_dev *dev, struct smcd_event *event);
+void smcd_handle_irq(struct smcd_dev *dev, unsigned int bit);
 #endif	/* _SMC_H */
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 188104654b54..4df96b4b8130 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -1,4 +1,4 @@
 obj-$(CONFIG_SMC)	+= smc.o
 obj-$(CONFIG_SMC_DIAG)	+= smc_diag.o
 smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
-smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o
+smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index da7f02edcd37..8ce48799cf68 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -475,8 +475,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
 	int reason_code = 0;
 
 	mutex_lock(&smc_create_lgr_pending);
-	local_contact = smc_conn_create(smc, ibdev, ibport, &aclc->lcl,
-					aclc->hdr.flag);
+	local_contact = smc_conn_create(smc, false, aclc->hdr.flag, ibdev,
+					ibport, &aclc->lcl, NULL, 0);
 	if (local_contact < 0) {
 		if (local_contact == -ENOMEM)
 			reason_code = SMC_CLC_DECL_MEM;/* insufficient memory*/
@@ -491,7 +491,7 @@ static int smc_connect_rdma(struct smc_sock *smc,
 	smc_conn_save_peer_info(smc, aclc);
 
 	/* create send buffer and rmb */
-	if (smc_buf_create(smc))
+	if (smc_buf_create(smc, false))
 		return smc_connect_abort(smc, SMC_CLC_DECL_MEM, local_contact);
 
 	if (local_contact == SMC_FIRST_CONTACT)
@@ -894,7 +894,8 @@ static int smc_listen_rdma_init(struct smc_sock *new_smc,
 				int *local_contact)
 {
 	/* allocate connection / link group */
-	*local_contact = smc_conn_create(new_smc, ibdev, ibport, &pclc->lcl, 0);
+	*local_contact = smc_conn_create(new_smc, false, 0, ibdev, ibport,
+					 &pclc->lcl, NULL, 0);
 	if (*local_contact < 0) {
 		if (*local_contact == -ENOMEM)
 			return SMC_CLC_DECL_MEM;/* insufficient memory*/
@@ -902,7 +903,7 @@ static int smc_listen_rdma_init(struct smc_sock *new_smc,
 	}
 
 	/* create send buffer and rmb */
-	if (smc_buf_create(new_smc))
+	if (smc_buf_create(new_smc, false))
 		return SMC_CLC_DECL_MEM;
 
 	return 0;
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index add82b0266f3..daa88db1841a 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -25,6 +25,7 @@
 #include "smc_llc.h"
 #include "smc_cdc.h"
 #include "smc_close.h"
+#include "smc_ism.h"
 
 #define SMC_LGR_NUM_INCR		256
 #define SMC_LGR_FREE_DELAY_SERV		(600 * HZ)
@@ -46,8 +47,8 @@ static void smc_lgr_schedule_free_work(struct smc_link_group *lgr)
 	 * otherwise there is a risk of out-of-sync link groups.
 	 */
 	mod_delayed_work(system_wq, &lgr->free_work,
-			 lgr->role == SMC_CLNT ? SMC_LGR_FREE_DELAY_CLNT :
-						 SMC_LGR_FREE_DELAY_SERV);
+			 (!lgr->is_smcd && lgr->role == SMC_CLNT) ?
+			 SMC_LGR_FREE_DELAY_CLNT : SMC_LGR_FREE_DELAY_SERV);
 }
 
 /* Register connection's alert token in our lookup structure.
@@ -153,16 +154,18 @@ static void smc_lgr_free_work(struct work_struct *work)
 free:
 	spin_unlock_bh(&smc_lgr_list.lock);
 	if (!delayed_work_pending(&lgr->free_work)) {
-		if (lgr->lnk[SMC_SINGLE_LINK].state != SMC_LNK_INACTIVE)
+		if (!lgr->is_smcd &&
+		    lgr->lnk[SMC_SINGLE_LINK].state != SMC_LNK_INACTIVE)
 			smc_llc_link_inactive(&lgr->lnk[SMC_SINGLE_LINK]);
 		smc_lgr_free(lgr);
 	}
 }
 
 /* create a new SMC link group */
-static int smc_lgr_create(struct smc_sock *smc,
+static int smc_lgr_create(struct smc_sock *smc, bool is_smcd,
 			  struct smc_ib_device *smcibdev, u8 ibport,
-			  char *peer_systemid, unsigned short vlan_id)
+			  char *peer_systemid, unsigned short vlan_id,
+			  struct smcd_dev *smcismdev, u64 peer_gid)
 {
 	struct smc_link_group *lgr;
 	struct smc_link *lnk;
@@ -170,17 +173,23 @@ static int smc_lgr_create(struct smc_sock *smc,
 	int rc = 0;
 	int i;
 
+	if (is_smcd && vlan_id) {
+		rc = smc_ism_get_vlan(smcismdev, vlan_id);
+		if (rc)
+			goto out;
+	}
+
 	lgr = kzalloc(sizeof(*lgr), GFP_KERNEL);
 	if (!lgr) {
 		rc = -ENOMEM;
 		goto out;
 	}
-	lgr->role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
+	lgr->is_smcd = is_smcd;
 	lgr->sync_err = 0;
-	memcpy(lgr->peer_systemid, peer_systemid, SMC_SYSTEMID_LEN);
 	lgr->vlan_id = vlan_id;
 	rwlock_init(&lgr->sndbufs_lock);
 	rwlock_init(&lgr->rmbs_lock);
+	rwlock_init(&lgr->conns_lock);
 	for (i = 0; i < SMC_RMBE_SIZES; i++) {
 		INIT_LIST_HEAD(&lgr->sndbufs[i]);
 		INIT_LIST_HEAD(&lgr->rmbs[i]);
@@ -189,36 +198,44 @@ static int smc_lgr_create(struct smc_sock *smc,
 	memcpy(&lgr->id, (u8 *)&smc_lgr_list.num, SMC_LGR_ID_SIZE);
 	INIT_DELAYED_WORK(&lgr->free_work, smc_lgr_free_work);
 	lgr->conns_all = RB_ROOT;
-
-	lnk = &lgr->lnk[SMC_SINGLE_LINK];
-	/* initialize link */
-	lnk->state = SMC_LNK_ACTIVATING;
-	lnk->link_id = SMC_SINGLE_LINK;
-	lnk->smcibdev = smcibdev;
-	lnk->ibport = ibport;
-	lnk->path_mtu = smcibdev->pattr[ibport - 1].active_mtu;
-	if (!smcibdev->initialized)
-		smc_ib_setup_per_ibdev(smcibdev);
-	get_random_bytes(rndvec, sizeof(rndvec));
-	lnk->psn_initial = rndvec[0] + (rndvec[1] << 8) + (rndvec[2] << 16);
-	rc = smc_llc_link_init(lnk);
-	if (rc)
-		goto free_lgr;
-	rc = smc_wr_alloc_link_mem(lnk);
-	if (rc)
-		goto clear_llc_lnk;
-	rc = smc_ib_create_protection_domain(lnk);
-	if (rc)
-		goto free_link_mem;
-	rc = smc_ib_create_queue_pair(lnk);
-	if (rc)
-		goto dealloc_pd;
-	rc = smc_wr_create_link(lnk);
-	if (rc)
-		goto destroy_qp;
-
+	if (is_smcd) {
+		/* SMC-D specific settings */
+		lgr->peer_gid = peer_gid;
+		lgr->smcd = smcismdev;
+	} else {
+		/* SMC-R specific settings */
+		lgr->role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
+		memcpy(lgr->peer_systemid, peer_systemid, SMC_SYSTEMID_LEN);
+
+		lnk = &lgr->lnk[SMC_SINGLE_LINK];
+		/* initialize link */
+		lnk->state = SMC_LNK_ACTIVATING;
+		lnk->link_id = SMC_SINGLE_LINK;
+		lnk->smcibdev = smcibdev;
+		lnk->ibport = ibport;
+		lnk->path_mtu = smcibdev->pattr[ibport - 1].active_mtu;
+		if (!smcibdev->initialized)
+			smc_ib_setup_per_ibdev(smcibdev);
+		get_random_bytes(rndvec, sizeof(rndvec));
+		lnk->psn_initial = rndvec[0] + (rndvec[1] << 8) +
+			(rndvec[2] << 16);
+		rc = smc_llc_link_init(lnk);
+		if (rc)
+			goto free_lgr;
+		rc = smc_wr_alloc_link_mem(lnk);
+		if (rc)
+			goto clear_llc_lnk;
+		rc = smc_ib_create_protection_domain(lnk);
+		if (rc)
+			goto free_link_mem;
+		rc = smc_ib_create_queue_pair(lnk);
+		if (rc)
+			goto dealloc_pd;
+		rc = smc_wr_create_link(lnk);
+		if (rc)
+			goto destroy_qp;
+	}
 	smc->conn.lgr = lgr;
-	rwlock_init(&lgr->conns_lock);
 	spin_lock_bh(&smc_lgr_list.lock);
 	list_add(&lgr->list, &smc_lgr_list.list);
 	spin_unlock_bh(&smc_lgr_list.lock);
@@ -264,7 +281,10 @@ void smc_conn_free(struct smc_connection *conn)
 {
 	if (!conn->lgr)
 		return;
-	smc_cdc_tx_dismiss_slots(conn);
+	if (conn->lgr->is_smcd)
+		smc_ism_unset_conn(conn);
+	else
+		smc_cdc_tx_dismiss_slots(conn);
 	smc_lgr_unregister_conn(conn);
 	smc_buf_unuse(conn);
 }
@@ -280,8 +300,8 @@ static void smc_link_clear(struct smc_link *lnk)
 	smc_wr_free_link_mem(lnk);
 }
 
-static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
-			 struct smc_buf_desc *buf_desc)
+static void smcr_buf_free(struct smc_link_group *lgr, bool is_rmb,
+			  struct smc_buf_desc *buf_desc)
 {
 	struct smc_link *lnk = &lgr->lnk[SMC_SINGLE_LINK];
 
@@ -301,6 +321,25 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
 	kfree(buf_desc);
 }
 
+static void smcd_buf_free(struct smc_link_group *lgr, bool is_dmb,
+			  struct smc_buf_desc *buf_desc)
+{
+	if (is_dmb)
+		smc_ism_unregister_dmb(lgr->smcd, buf_desc);
+	else
+		kfree(buf_desc->cpu_addr);
+	kfree(buf_desc);
+}
+
+static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
+			 struct smc_buf_desc *buf_desc)
+{
+	if (lgr->is_smcd)
+		smcd_buf_free(lgr, is_rmb, buf_desc);
+	else
+		smcr_buf_free(lgr, is_rmb, buf_desc);
+}
+
 static void __smc_lgr_free_bufs(struct smc_link_group *lgr, bool is_rmb)
 {
 	struct smc_buf_desc *buf_desc, *bf_desc;
@@ -332,7 +371,10 @@ static void smc_lgr_free_bufs(struct smc_link_group *lgr)
 void smc_lgr_free(struct smc_link_group *lgr)
 {
 	smc_lgr_free_bufs(lgr);
-	smc_link_clear(&lgr->lnk[SMC_SINGLE_LINK]);
+	if (lgr->is_smcd)
+		smc_ism_put_vlan(lgr->smcd, lgr->vlan_id);
+	else
+		smc_link_clear(&lgr->lnk[SMC_SINGLE_LINK]);
 	kfree(lgr);
 }
 
@@ -357,7 +399,8 @@ static void __smc_lgr_terminate(struct smc_link_group *lgr)
 	lgr->terminating = 1;
 	if (!list_empty(&lgr->list)) /* forget lgr */
 		list_del_init(&lgr->list);
-	smc_llc_link_inactive(&lgr->lnk[SMC_SINGLE_LINK]);
+	if (!lgr->is_smcd)
+		smc_llc_link_inactive(&lgr->lnk[SMC_SINGLE_LINK]);
 
 	write_lock_bh(&lgr->conns_lock);
 	node = rb_first(&lgr->conns_all);
@@ -374,7 +417,8 @@ static void __smc_lgr_terminate(struct smc_link_group *lgr)
 		node = rb_first(&lgr->conns_all);
 	}
 	write_unlock_bh(&lgr->conns_lock);
-	wake_up(&lgr->lnk[SMC_SINGLE_LINK].wr_reg_wait);
+	if (!lgr->is_smcd)
+		wake_up(&lgr->lnk[SMC_SINGLE_LINK].wr_reg_wait);
 	smc_lgr_schedule_free_work(lgr);
 }
 
@@ -392,13 +436,40 @@ void smc_port_terminate(struct smc_ib_device *smcibdev, u8 ibport)
 
 	spin_lock_bh(&smc_lgr_list.lock);
 	list_for_each_entry_safe(lgr, l, &smc_lgr_list.list, list) {
-		if (lgr->lnk[SMC_SINGLE_LINK].smcibdev == smcibdev &&
+		if (!lgr->is_smcd &&
+		    lgr->lnk[SMC_SINGLE_LINK].smcibdev == smcibdev &&
 		    lgr->lnk[SMC_SINGLE_LINK].ibport == ibport)
 			__smc_lgr_terminate(lgr);
 	}
 	spin_unlock_bh(&smc_lgr_list.lock);
 }
 
+/* Called when SMC-D device is terminated or peer is lost */
+void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid)
+{
+	struct smc_link_group *lgr, *l;
+	LIST_HEAD(lgr_free_list);
+
+	/* run common cleanup function and build free list */
+	spin_lock_bh(&smc_lgr_list.lock);
+	list_for_each_entry_safe(lgr, l, &smc_lgr_list.list, list) {
+		if (lgr->is_smcd && lgr->smcd == dev &&
+		    (!peer_gid || lgr->peer_gid == peer_gid) &&
+		    !list_empty(&lgr->list)) {
+			__smc_lgr_terminate(lgr);
+			list_move(&lgr->list, &lgr_free_list);
+		}
+	}
+	spin_unlock_bh(&smc_lgr_list.lock);
+
+	/* cancel the regular free workers and actually free lgrs */
+	list_for_each_entry_safe(lgr, l, &lgr_free_list, list) {
+		list_del_init(&lgr->list);
+		cancel_delayed_work_sync(&lgr->free_work);
+		smc_lgr_free(lgr);
+	}
+}
+
 /* Determine vlan of internal TCP socket.
  * @vlan_id: address to store the determined vlan id into
  */
@@ -477,10 +548,30 @@ static int smc_link_determine_gid(struct smc_link_group *lgr)
 	return -ENODEV;
 }
 
+static bool smcr_lgr_match(struct smc_link_group *lgr,
+			   struct smc_clc_msg_local *lcl,
+			   enum smc_lgr_role role)
+{
+	return !memcmp(lgr->peer_systemid, lcl->id_for_peer,
+		       SMC_SYSTEMID_LEN) &&
+		!memcmp(lgr->lnk[SMC_SINGLE_LINK].peer_gid, &lcl->gid,
+			SMC_GID_SIZE) &&
+		!memcmp(lgr->lnk[SMC_SINGLE_LINK].peer_mac, lcl->mac,
+			sizeof(lcl->mac)) &&
+		lgr->role == role;
+}
+
+static bool smcd_lgr_match(struct smc_link_group *lgr,
+			   struct smcd_dev *smcismdev, u64 peer_gid)
+{
+	return lgr->peer_gid == peer_gid && lgr->smcd == smcismdev;
+}
+
 /* create a new SMC connection (and a new link group if necessary) */
-int smc_conn_create(struct smc_sock *smc,
+int smc_conn_create(struct smc_sock *smc, bool is_smcd, int srv_first_contact,
 		    struct smc_ib_device *smcibdev, u8 ibport,
-		    struct smc_clc_msg_local *lcl, int srv_first_contact)
+		    struct smc_clc_msg_local *lcl, struct smcd_dev *smcd,
+		    u64 peer_gid)
 {
 	struct smc_connection *conn = &smc->conn;
 	int local_contact = SMC_FIRST_CONTACT;
@@ -502,17 +593,12 @@ int smc_conn_create(struct smc_sock *smc,
 	spin_lock_bh(&smc_lgr_list.lock);
 	list_for_each_entry(lgr, &smc_lgr_list.list, list) {
 		write_lock_bh(&lgr->conns_lock);
-		if (!memcmp(lgr->peer_systemid, lcl->id_for_peer,
-			    SMC_SYSTEMID_LEN) &&
-		    !memcmp(lgr->lnk[SMC_SINGLE_LINK].peer_gid, &lcl->gid,
-			    SMC_GID_SIZE) &&
-		    !memcmp(lgr->lnk[SMC_SINGLE_LINK].peer_mac, lcl->mac,
-			    sizeof(lcl->mac)) &&
+		if ((is_smcd ? smcd_lgr_match(lgr, smcd, peer_gid) :
+		     smcr_lgr_match(lgr, lcl, role)) &&
 		    !lgr->sync_err &&
-		    (lgr->role == role) &&
-		    (lgr->vlan_id == vlan_id) &&
-		    ((role == SMC_CLNT) ||
-		     (lgr->conns_num < SMC_RMBS_PER_LGR_MAX))) {
+		    lgr->vlan_id == vlan_id &&
+		    (role == SMC_CLNT ||
+		     lgr->conns_num < SMC_RMBS_PER_LGR_MAX)) {
 			/* link group found */
 			local_contact = SMC_REUSE_CONTACT;
 			conn->lgr = lgr;
@@ -535,12 +621,13 @@ int smc_conn_create(struct smc_sock *smc,
 
 create:
 	if (local_contact == SMC_FIRST_CONTACT) {
-		rc = smc_lgr_create(smc, smcibdev, ibport,
-				    lcl->id_for_peer, vlan_id);
+		rc = smc_lgr_create(smc, is_smcd, smcibdev, ibport,
+				    lcl->id_for_peer, vlan_id, smcd, peer_gid);
 		if (rc)
 			goto out;
 		smc_lgr_register_conn(conn); /* add smc conn to lgr */
-		rc = smc_link_determine_gid(conn->lgr);
+		if (!is_smcd)
+			rc = smc_link_determine_gid(conn->lgr);
 	}
 	conn->local_tx_ctrl.common.type = SMC_CDC_MSG_TYPE;
 	conn->local_tx_ctrl.len = SMC_WR_TX_SIZE;
@@ -609,8 +696,8 @@ static inline int smc_rmb_wnd_update_limit(int rmbe_size)
 	return min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2);
 }
 
-static struct smc_buf_desc *smc_new_buf_create(struct smc_link_group *lgr,
-					       bool is_rmb, int bufsize)
+static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
+						bool is_rmb, int bufsize)
 {
 	struct smc_buf_desc *buf_desc;
 	struct smc_link *lnk;
@@ -668,7 +755,43 @@ static struct smc_buf_desc *smc_new_buf_create(struct smc_link_group *lgr,
 	return buf_desc;
 }
 
-static int __smc_buf_create(struct smc_sock *smc, bool is_rmb)
+#define SMCD_DMBE_SIZES		7 /* 0 -> 16KB, 1 -> 32KB, .. 6 -> 1MB */
+
+static struct smc_buf_desc *smcd_new_buf_create(struct smc_link_group *lgr,
+						bool is_dmb, int bufsize)
+{
+	struct smc_buf_desc *buf_desc;
+	int rc;
+
+	if (smc_compress_bufsize(bufsize) > SMCD_DMBE_SIZES)
+		return ERR_PTR(-EAGAIN);
+
+	/* try to alloc a new DMB */
+	buf_desc = kzalloc(sizeof(*buf_desc), GFP_KERNEL);
+	if (!buf_desc)
+		return ERR_PTR(-ENOMEM);
+	if (is_dmb) {
+		rc = smc_ism_register_dmb(lgr, bufsize, buf_desc);
+		if (rc) {
+			kfree(buf_desc);
+			return ERR_PTR(-EAGAIN);
+		}
+		memset(buf_desc->cpu_addr, 0, bufsize);
+		buf_desc->len = bufsize;
+	} else {
+		buf_desc->cpu_addr = kzalloc(bufsize, GFP_KERNEL |
+					     __GFP_NOWARN | __GFP_NORETRY |
+					     __GFP_NOMEMALLOC);
+		if (!buf_desc->cpu_addr) {
+			kfree(buf_desc);
+			return ERR_PTR(-EAGAIN);
+		}
+		buf_desc->len = bufsize;
+	}
+	return buf_desc;
+}
+
+static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
 {
 	struct smc_buf_desc *buf_desc = ERR_PTR(-ENOMEM);
 	struct smc_connection *conn = &smc->conn;
@@ -706,7 +829,11 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_rmb)
 			break; /* found reusable slot */
 		}
 
-		buf_desc = smc_new_buf_create(lgr, is_rmb, bufsize);
+		if (is_smcd)
+			buf_desc = smcd_new_buf_create(lgr, is_rmb, bufsize);
+		else
+			buf_desc = smcr_new_buf_create(lgr, is_rmb, bufsize);
+
 		if (PTR_ERR(buf_desc) == -ENOMEM)
 			break;
 		if (IS_ERR(buf_desc))
@@ -728,6 +855,8 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_rmb)
 		smc->sk.sk_rcvbuf = bufsize * 2;
 		atomic_set(&conn->bytes_to_rcv, 0);
 		conn->rmbe_update_limit = smc_rmb_wnd_update_limit(bufsize);
+		if (is_smcd)
+			smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */
 	} else {
 		conn->sndbuf_desc = buf_desc;
 		smc->sk.sk_sndbuf = bufsize * 2;
@@ -740,6 +869,8 @@ void smc_sndbuf_sync_sg_for_cpu(struct smc_connection *conn)
 {
 	struct smc_link_group *lgr = conn->lgr;
 
+	if (!conn->lgr || conn->lgr->is_smcd)
+		return;
 	smc_ib_sync_sg_for_cpu(lgr->lnk[SMC_SINGLE_LINK].smcibdev,
 			       conn->sndbuf_desc, DMA_TO_DEVICE);
 }
@@ -748,6 +879,8 @@ void smc_sndbuf_sync_sg_for_device(struct smc_connection *conn)
 {
 	struct smc_link_group *lgr = conn->lgr;
 
+	if (!conn->lgr || conn->lgr->is_smcd)
+		return;
 	smc_ib_sync_sg_for_device(lgr->lnk[SMC_SINGLE_LINK].smcibdev,
 				  conn->sndbuf_desc, DMA_TO_DEVICE);
 }
@@ -756,6 +889,8 @@ void smc_rmb_sync_sg_for_cpu(struct smc_connection *conn)
 {
 	struct smc_link_group *lgr = conn->lgr;
 
+	if (!conn->lgr || conn->lgr->is_smcd)
+		return;
 	smc_ib_sync_sg_for_cpu(lgr->lnk[SMC_SINGLE_LINK].smcibdev,
 			       conn->rmb_desc, DMA_FROM_DEVICE);
 }
@@ -764,6 +899,8 @@ void smc_rmb_sync_sg_for_device(struct smc_connection *conn)
 {
 	struct smc_link_group *lgr = conn->lgr;
 
+	if (!conn->lgr || conn->lgr->is_smcd)
+		return;
 	smc_ib_sync_sg_for_device(lgr->lnk[SMC_SINGLE_LINK].smcibdev,
 				  conn->rmb_desc, DMA_FROM_DEVICE);
 }
@@ -774,16 +911,16 @@ void smc_rmb_sync_sg_for_device(struct smc_connection *conn)
  * the Linux implementation uses just one RMB-element per RMB, i.e. uses an
  * extra RMB for every connection in a link group
  */
-int smc_buf_create(struct smc_sock *smc)
+int smc_buf_create(struct smc_sock *smc, bool is_smcd)
 {
 	int rc;
 
 	/* create send buffer */
-	rc = __smc_buf_create(smc, false);
+	rc = __smc_buf_create(smc, is_smcd, false);
 	if (rc)
 		return rc;
 	/* create rmb */
-	rc = __smc_buf_create(smc, true);
+	rc = __smc_buf_create(smc, is_smcd, true);
 	if (rc)
 		smc_buf_free(smc->conn.lgr, false, smc->conn.sndbuf_desc);
 	return rc;
@@ -865,7 +1002,8 @@ void smc_core_exit(void)
 	spin_unlock_bh(&smc_lgr_list.lock);
 	list_for_each_entry_safe(lgr, lg, &lgr_freeing_list, list) {
 		list_del_init(&lgr->list);
-		smc_llc_link_inactive(&lgr->lnk[SMC_SINGLE_LINK]);
+		if (!lgr->is_smcd)
+			smc_llc_link_inactive(&lgr->lnk[SMC_SINGLE_LINK]);
 		cancel_delayed_work_sync(&lgr->free_work);
 		smc_lgr_free(lgr); /* free link group */
 	}
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 93cb3523bf50..cd9268a9570e 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -124,15 +124,28 @@ struct smc_buf_desc {
 	void			*cpu_addr;	/* virtual address of buffer */
 	struct page		*pages;
 	int			len;		/* length of buffer */
-	struct sg_table		sgt[SMC_LINKS_PER_LGR_MAX];/* virtual buffer */
-	struct ib_mr		*mr_rx[SMC_LINKS_PER_LGR_MAX];
-						/* for rmb only: memory region
-						 * incl. rkey provided to peer
-						 */
-	u32			order;		/* allocation order */
 	u32			used;		/* currently used / unused */
 	u8			reused	: 1;	/* new created / reused */
 	u8			regerr	: 1;	/* err during registration */
+	union {
+		struct { /* SMC-R */
+			struct sg_table		sgt[SMC_LINKS_PER_LGR_MAX];
+						/* virtual buffer */
+			struct ib_mr		*mr_rx[SMC_LINKS_PER_LGR_MAX];
+						/* for rmb only: memory region
+						 * incl. rkey provided to peer
+						 */
+			u32			order;	/* allocation order */
+		};
+		struct { /* SMC-D */
+			unsigned short		sba_idx;
+						/* SBA index number */
+			u64			token;
+						/* DMB token number */
+			dma_addr_t		dma_addr;
+						/* DMA address */
+		};
+	};
 };
 
 struct smc_rtoken {				/* address/key of remote RMB */
@@ -148,12 +161,10 @@ struct smc_rtoken {				/* address/key of remote RMB */
  * struct smc_clc_msg_accept_confirm.rmbe_size being a 4 bit value (0..15)
  */
 
+struct smcd_dev;
+
 struct smc_link_group {
 	struct list_head	list;
-	enum smc_lgr_role	role;		/* client or server */
-	struct smc_link		lnk[SMC_LINKS_PER_LGR_MAX];	/* smc link */
-	char			peer_systemid[SMC_SYSTEMID_LEN];
-						/* unique system_id of peer */
 	struct rb_root		conns_all;	/* connection tree */
 	rwlock_t		conns_lock;	/* protects conns_all */
 	unsigned int		conns_num;	/* current # of connections */
@@ -163,17 +174,35 @@ struct smc_link_group {
 	rwlock_t		sndbufs_lock;	/* protects tx buffers */
 	struct list_head	rmbs[SMC_RMBE_SIZES];	/* rx buffers */
 	rwlock_t		rmbs_lock;	/* protects rx buffers */
-	struct smc_rtoken	rtokens[SMC_RMBS_PER_LGR_MAX]
-				       [SMC_LINKS_PER_LGR_MAX];
-						/* remote addr/key pairs */
-	unsigned long		rtokens_used_mask[BITS_TO_LONGS(
-							SMC_RMBS_PER_LGR_MAX)];
-						/* used rtoken elements */
 
 	u8			id[SMC_LGR_ID_SIZE];	/* unique lgr id */
 	struct delayed_work	free_work;	/* delayed freeing of an lgr */
 	u8			sync_err : 1;	/* lgr no longer fits to peer */
 	u8			terminating : 1;/* lgr is terminating */
+
+	bool			is_smcd;	/* SMC-R or SMC-D */
+	union {
+		struct { /* SMC-R */
+			enum smc_lgr_role	role;
+						/* client or server */
+			struct smc_link		lnk[SMC_LINKS_PER_LGR_MAX];
+						/* smc link */
+			char			peer_systemid[SMC_SYSTEMID_LEN];
+						/* unique system_id of peer */
+			struct smc_rtoken	rtokens[SMC_RMBS_PER_LGR_MAX]
+						[SMC_LINKS_PER_LGR_MAX];
+						/* remote addr/key pairs */
+			unsigned long		rtokens_used_mask[BITS_TO_LONGS
+							(SMC_RMBS_PER_LGR_MAX)];
+						/* used rtoken elements */
+		};
+		struct { /* SMC-D */
+			u64			peer_gid;
+						/* Peer GID (remote) */
+			struct smcd_dev		*smcd;
+						/* ISM device for VLAN reg. */
+		};
+	};
 };
 
 /* Find the connection associated with the given alert token in the link group.
@@ -217,7 +246,8 @@ void smc_lgr_free(struct smc_link_group *lgr);
 void smc_lgr_forget(struct smc_link_group *lgr);
 void smc_lgr_terminate(struct smc_link_group *lgr);
 void smc_port_terminate(struct smc_ib_device *smcibdev, u8 ibport);
-int smc_buf_create(struct smc_sock *smc);
+void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid);
+int smc_buf_create(struct smc_sock *smc, bool is_smcd);
 int smc_uncompress_bufsize(u8 compressed);
 int smc_rmb_rtoken_handling(struct smc_connection *conn,
 			    struct smc_clc_msg_accept_confirm *clc);
@@ -227,9 +257,12 @@ void smc_sndbuf_sync_sg_for_cpu(struct smc_connection *conn);
 void smc_sndbuf_sync_sg_for_device(struct smc_connection *conn);
 void smc_rmb_sync_sg_for_cpu(struct smc_connection *conn);
 void smc_rmb_sync_sg_for_device(struct smc_connection *conn);
+
 void smc_conn_free(struct smc_connection *conn);
-int smc_conn_create(struct smc_sock *smc,
+int smc_conn_create(struct smc_sock *smc, bool is_smcd, int srv_first_contact,
 		    struct smc_ib_device *smcibdev, u8 ibport,
-		    struct smc_clc_msg_local *lcl, int srv_first_contact);
+		    struct smc_clc_msg_local *lcl, struct smcd_dev *smcd,
+		    u64 peer_gid);
+void smcd_conn_free(struct smc_connection *conn);
 void smc_core_exit(void);
 #endif
diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index 839354402215..64ce107c24d9 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -136,7 +136,8 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb,
 			goto errout;
 	}
 
-	if ((req->diag_ext & (1 << (SMC_DIAG_LGRINFO - 1))) && smc->conn.lgr &&
+	if (smc->conn.lgr && !smc->conn.lgr->is_smcd &&
+	    (req->diag_ext & (1 << (SMC_DIAG_LGRINFO - 1))) &&
 	    !list_empty(&smc->conn.lgr->list)) {
 		struct smc_diag_lgrinfo linfo = {
 			.role = smc->conn.lgr->role,
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
new file mode 100644
index 000000000000..ca1ce42fd49f
--- /dev/null
+++ b/net/smc/smc_ism.c
@@ -0,0 +1,304 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Shared Memory Communications Direct over ISM devices (SMC-D)
+ *
+ * Functions for ISM device.
+ *
+ * Copyright IBM Corp. 2018
+ */
+
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <asm/page.h>
+
+#include "smc.h"
+#include "smc_core.h"
+#include "smc_ism.h"
+
+struct smcd_dev_list smcd_dev_list = {
+	.list = LIST_HEAD_INIT(smcd_dev_list.list),
+	.lock = __SPIN_LOCK_UNLOCKED(smcd_dev_list.lock)
+};
+
+/* Test if an ISM communication is possible. */
+int smc_ism_cantalk(u64 peer_gid, unsigned short vlan_id, struct smcd_dev *smcd)
+{
+	return smcd->ops->query_remote_gid(smcd, peer_gid, vlan_id ? 1 : 0,
+					   vlan_id);
+}
+
+int smc_ism_write(struct smcd_dev *smcd, const struct smc_ism_position *pos,
+		  void *data, size_t len)
+{
+	int rc;
+
+	rc = smcd->ops->move_data(smcd, pos->token, pos->index, pos->signal,
+				  pos->offset, data, len);
+
+	return rc < 0 ? rc : 0;
+}
+
+/* Set a connection using this DMBE. */
+void smc_ism_set_conn(struct smc_connection *conn)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&conn->lgr->smcd->lock, flags);
+	conn->lgr->smcd->conn[conn->rmb_desc->sba_idx] = conn;
+	spin_unlock_irqrestore(&conn->lgr->smcd->lock, flags);
+}
+
+/* Unset a connection using this DMBE. */
+void smc_ism_unset_conn(struct smc_connection *conn)
+{
+	unsigned long flags;
+
+	if (!conn->rmb_desc)
+		return;
+
+	spin_lock_irqsave(&conn->lgr->smcd->lock, flags);
+	conn->lgr->smcd->conn[conn->rmb_desc->sba_idx] = NULL;
+	spin_unlock_irqrestore(&conn->lgr->smcd->lock, flags);
+}
+
+/* Register a VLAN identifier with the ISM device. Use a reference count
+ * and add a VLAN identifier only when the first DMB using this VLAN is
+ * registered.
+ */
+int smc_ism_get_vlan(struct smcd_dev *smcd, unsigned short vlanid)
+{
+	struct smc_ism_vlanid *new_vlan, *vlan;
+	unsigned long flags;
+	int rc = 0;
+
+	if (!vlanid)			/* No valid vlan id */
+		return -EINVAL;
+
+	/* create new vlan entry, in case we need it */
+	new_vlan = kzalloc(sizeof(*new_vlan), GFP_KERNEL);
+	if (!new_vlan)
+		return -ENOMEM;
+	new_vlan->vlanid = vlanid;
+	refcount_set(&new_vlan->refcnt, 1);
+
+	/* if there is an existing entry, increase count and return */
+	spin_lock_irqsave(&smcd->lock, flags);
+	list_for_each_entry(vlan, &smcd->vlan, list) {
+		if (vlan->vlanid == vlanid) {
+			refcount_inc(&vlan->refcnt);
+			kfree(new_vlan);
+			goto out;
+		}
+	}
+
+	/* no existing entry found.
+	 * add new entry to device; might fail, e.g., if HW limit reached
+	 */
+	if (smcd->ops->add_vlan_id(smcd, vlanid)) {
+		kfree(new_vlan);
+		rc = -EIO;
+		goto out;
+	}
+	list_add_tail(&new_vlan->list, &smcd->vlan);
+out:
+	spin_unlock_irqrestore(&smcd->lock, flags);
+	return rc;
+}
+
+/* Unregister a VLAN identifier with the ISM device. Use a reference count
+ * and remove a VLAN identifier only when the last DMB using this VLAN is
+ * unregistered.
+ */
+int smc_ism_put_vlan(struct smcd_dev *smcd, unsigned short vlanid)
+{
+	struct smc_ism_vlanid *vlan;
+	unsigned long flags;
+	bool found = false;
+	int rc = 0;
+
+	if (!vlanid)			/* No valid vlan id */
+		return -EINVAL;
+
+	spin_lock_irqsave(&smcd->lock, flags);
+	list_for_each_entry(vlan, &smcd->vlan, list) {
+		if (vlan->vlanid == vlanid) {
+			if (!refcount_dec_and_test(&vlan->refcnt))
+				goto out;
+			found = true;
+			break;
+		}
+	}
+	if (!found) {
+		rc = -ENOENT;
+		goto out;		/* VLAN id not in table */
+	}
+
+	/* Found and the last reference just gone */
+	if (smcd->ops->del_vlan_id(smcd, vlanid))
+		rc = -EIO;
+	list_del(&vlan->list);
+	kfree(vlan);
+out:
+	spin_unlock_irqrestore(&smcd->lock, flags);
+	return rc;
+}
+
+int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
+{
+	struct smcd_dmb dmb;
+
+	memset(&dmb, 0, sizeof(dmb));
+	dmb.dmb_tok = dmb_desc->token;
+	dmb.sba_idx = dmb_desc->sba_idx;
+	dmb.cpu_addr = dmb_desc->cpu_addr;
+	dmb.dma_addr = dmb_desc->dma_addr;
+	dmb.dmb_len = dmb_desc->len;
+	return smcd->ops->unregister_dmb(smcd, &dmb);
+}
+
+int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
+			 struct smc_buf_desc *dmb_desc)
+{
+	struct smcd_dmb dmb;
+	int rc;
+
+	memset(&dmb, 0, sizeof(dmb));
+	dmb.dmb_len = dmb_len;
+	dmb.sba_idx = dmb_desc->sba_idx;
+	dmb.vlan_id = lgr->vlan_id;
+	dmb.rgid = lgr->peer_gid;
+	rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb);
+	if (!rc) {
+		dmb_desc->sba_idx = dmb.sba_idx;
+		dmb_desc->token = dmb.dmb_tok;
+		dmb_desc->cpu_addr = dmb.cpu_addr;
+		dmb_desc->dma_addr = dmb.dma_addr;
+		dmb_desc->len = dmb.dmb_len;
+	}
+	return rc;
+}
+
+struct smc_ism_event_work {
+	struct work_struct work;
+	struct smcd_dev *smcd;
+	struct smcd_event event;
+};
+
+/* worker for SMC-D events */
+static void smc_ism_event_work(struct work_struct *work)
+{
+	struct smc_ism_event_work *wrk =
+		container_of(work, struct smc_ism_event_work, work);
+
+	switch (wrk->event.type) {
+	case ISM_EVENT_GID:	/* GID event, token is peer GID */
+		smc_smcd_terminate(wrk->smcd, wrk->event.tok);
+		break;
+	case ISM_EVENT_DMB:
+		break;
+	}
+	kfree(wrk);
+}
+
+static void smcd_release(struct device *dev)
+{
+	struct smcd_dev *smcd = container_of(dev, struct smcd_dev, dev);
+
+	kfree(smcd->conn);
+	kfree(smcd);
+}
+
+struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
+				const struct smcd_ops *ops, int max_dmbs)
+{
+	struct smcd_dev *smcd;
+
+	smcd = kzalloc(sizeof(*smcd), GFP_KERNEL);
+	if (!smcd)
+		return NULL;
+	smcd->conn = kcalloc(max_dmbs, sizeof(struct smc_connection *),
+			     GFP_KERNEL);
+	if (!smcd->conn) {
+		kfree(smcd);
+		return NULL;
+	}
+
+	smcd->dev.parent = parent;
+	smcd->dev.release = smcd_release;
+	device_initialize(&smcd->dev);
+	dev_set_name(&smcd->dev, name);
+	smcd->ops = ops;
+
+	spin_lock_init(&smcd->lock);
+	INIT_LIST_HEAD(&smcd->vlan);
+	smcd->event_wq = alloc_ordered_workqueue("ism_evt_wq-%s)",
+						 WQ_MEM_RECLAIM, name);
+	return smcd;
+}
+EXPORT_SYMBOL_GPL(smcd_alloc_dev);
+
+int smcd_register_dev(struct smcd_dev *smcd)
+{
+	spin_lock(&smcd_dev_list.lock);
+	list_add_tail(&smcd->list, &smcd_dev_list.list);
+	spin_unlock(&smcd_dev_list.lock);
+
+	return device_add(&smcd->dev);
+}
+EXPORT_SYMBOL_GPL(smcd_register_dev);
+
+void smcd_unregister_dev(struct smcd_dev *smcd)
+{
+	spin_lock(&smcd_dev_list.lock);
+	list_del(&smcd->list);
+	spin_unlock(&smcd_dev_list.lock);
+	flush_workqueue(smcd->event_wq);
+	destroy_workqueue(smcd->event_wq);
+	smc_smcd_terminate(smcd, 0);
+
+	device_del(&smcd->dev);
+}
+EXPORT_SYMBOL_GPL(smcd_unregister_dev);
+
+void smcd_free_dev(struct smcd_dev *smcd)
+{
+	put_device(&smcd->dev);
+}
+EXPORT_SYMBOL_GPL(smcd_free_dev);
+
+/* SMCD Device event handler. Called from ISM device interrupt handler.
+ * Parameters are smcd device pointer,
+ * - event->type (0 --> DMB, 1 --> GID),
+ * - event->code (event code),
+ * - event->tok (either DMB token when event type 0, or GID when event type 1)
+ * - event->time (time of day)
+ * - event->info (debug info).
+ *
+ * Context:
+ * - Function called in IRQ context from ISM device driver event handler.
+ */
+void smcd_handle_event(struct smcd_dev *smcd, struct smcd_event *event)
+{
+	struct smc_ism_event_work *wrk;
+
+	/* copy event to event work queue, and let it be handled there */
+	wrk = kmalloc(sizeof(*wrk), GFP_ATOMIC);
+	if (!wrk)
+		return;
+	INIT_WORK(&wrk->work, smc_ism_event_work);
+	wrk->smcd = smcd;
+	wrk->event = *event;
+	queue_work(smcd->event_wq, &wrk->work);
+}
+EXPORT_SYMBOL_GPL(smcd_handle_event);
+
+/* SMCD Device interrupt handler. Called from ISM device interrupt handler.
+ * Parameters are smcd device pointer and DMB number. Find the connection and
+ * schedule the tasklet for this connection.
+ *
+ * Context:
+ * - Function called in IRQ context from ISM device driver IRQ handler.
+ */
+void smcd_handle_irq(struct smcd_dev *smcd, unsigned int dmbno)
+{
+}
+EXPORT_SYMBOL_GPL(smcd_handle_irq);
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
new file mode 100644
index 000000000000..aee45b860b79
--- /dev/null
+++ b/net/smc/smc_ism.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Shared Memory Communications Direct over ISM devices (SMC-D)
+ *
+ * SMC-D ISM device structure definitions.
+ *
+ * Copyright IBM Corp. 2018
+ */
+
+#ifndef SMCD_ISM_H
+#define SMCD_ISM_H
+
+#include <linux/uio.h>
+
+#include "smc.h"
+
+struct smcd_dev_list {	/* List of SMCD devices */
+	struct list_head list;
+	spinlock_t lock;	/* Protects list of devices */
+};
+
+extern struct smcd_dev_list	smcd_dev_list; /* list of smcd devices */
+
+struct smc_ism_vlanid {			/* VLAN id set on ISM device */
+	struct list_head list;
+	unsigned short vlanid;		/* Vlan id */
+	refcount_t refcnt;		/* Reference count */
+};
+
+struct smc_ism_position {	/* ISM device position to write to */
+	u64 token;		/* Token of DMB */
+	u32 offset;		/* Offset into DMBE */
+	u8 index;		/* Index of DMBE */
+	u8 signal;		/* Generate interrupt on owner side */
+};
+
+struct smcd_dev;
+
+int smc_ism_cantalk(u64 peer_gid, unsigned short vlan_id, struct smcd_dev *dev);
+void smc_ism_set_conn(struct smc_connection *conn);
+void smc_ism_unset_conn(struct smc_connection *conn);
+int smc_ism_get_vlan(struct smcd_dev *dev, unsigned short vlan_id);
+int smc_ism_put_vlan(struct smcd_dev *dev, unsigned short vlan_id);
+int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
+			 struct smc_buf_desc *dmb_desc);
+int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+int smc_ism_write(struct smcd_dev *dev, const struct smc_ism_position *pos,
+		  void *data, size_t len);
+#endif
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 06/10] net/smc: add SMC-D support in CLC messages
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Hans Wippel <hwippel@linux.ibm.com>

There are two types of SMC: SMC-R and SMC-D. These types are signaled
within the CLC messages during the CLC handshake. This patch adds
support for and checks of the SMC type.

Also, SMC-R and SMC-D need to exchange different information during the
CLC handshake. So, this patch extends the current message formats to
support the SMC-D header fields. The Proposal message can contain both
SMC-R and SMC-D information. The Accept and Confirm messages contain
either SMC-R or SMC-D information.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
---
 net/smc/af_smc.c  |   9 +--
 net/smc/smc_clc.c | 193 ++++++++++++++++++++++++++++++++++++++----------------
 net/smc/smc_clc.h |  81 ++++++++++++++++++-----
 3 files changed, 205 insertions(+), 78 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 8ce48799cf68..20afa94be8bb 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -451,14 +451,14 @@ static int smc_check_rdma(struct smc_sock *smc, struct smc_ib_device **ibdev,
 }
 
 /* CLC handshake during connect */
-static int smc_connect_clc(struct smc_sock *smc,
+static int smc_connect_clc(struct smc_sock *smc, int smc_type,
 			   struct smc_clc_msg_accept_confirm *aclc,
 			   struct smc_ib_device *ibdev, u8 ibport)
 {
 	int rc = 0;
 
 	/* do inband token exchange */
-	rc = smc_clc_send_proposal(smc, ibdev, ibport);
+	rc = smc_clc_send_proposal(smc, smc_type, ibdev, ibport, NULL);
 	if (rc)
 		return rc;
 	/* receive SMC Accept CLC message */
@@ -564,7 +564,7 @@ static int __smc_connect(struct smc_sock *smc)
 		return smc_connect_decline_fallback(smc, SMC_CLC_DECL_CNFERR);
 
 	/* perform CLC handshake */
-	rc = smc_connect_clc(smc, &aclc, ibdev, ibport);
+	rc = smc_connect_clc(smc, SMC_TYPE_R, &aclc, ibdev, ibport);
 	if (rc)
 		return smc_connect_decline_fallback(smc, rc);
 
@@ -1008,7 +1008,8 @@ static void smc_listen_work(struct work_struct *work)
 	smc_tx_init(new_smc);
 
 	/* check if RDMA is available */
-	if (smc_check_rdma(new_smc, &ibdev, &ibport) ||
+	if ((pclc->hdr.path != SMC_TYPE_R && pclc->hdr.path != SMC_TYPE_B) ||
+	    smc_check_rdma(new_smc, &ibdev, &ibport) ||
 	    smc_listen_rdma_check(new_smc, pclc) ||
 	    smc_listen_rdma_init(new_smc, pclc, ibdev, ibport,
 				 &local_contact) ||
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index 717449b1da0b..038d70ef7892 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -23,9 +23,15 @@
 #include "smc_core.h"
 #include "smc_clc.h"
 #include "smc_ib.h"
+#include "smc_ism.h"
+
+#define SMCR_CLC_ACCEPT_CONFIRM_LEN 68
+#define SMCD_CLC_ACCEPT_CONFIRM_LEN 48
 
 /* eye catcher "SMCR" EBCDIC for CLC messages */
 static const char SMC_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xd9'};
+/* eye catcher "SMCD" EBCDIC for CLC messages */
+static const char SMCD_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xc4'};
 
 /* check if received message has a correct header length and contains valid
  * heading and trailing eyecatchers
@@ -38,10 +44,14 @@ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr *clcm)
 	struct smc_clc_msg_decline *dclc;
 	struct smc_clc_msg_trail *trl;
 
-	if (memcmp(clcm->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)))
+	if (memcmp(clcm->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)) &&
+	    memcmp(clcm->eyecatcher, SMCD_EYECATCHER, sizeof(SMCD_EYECATCHER)))
 		return false;
 	switch (clcm->type) {
 	case SMC_CLC_PROPOSAL:
+		if (clcm->path != SMC_TYPE_R && clcm->path != SMC_TYPE_D &&
+		    clcm->path != SMC_TYPE_B)
+			return false;
 		pclc = (struct smc_clc_msg_proposal *)clcm;
 		pclc_prfx = smc_clc_proposal_get_prefix(pclc);
 		if (ntohs(pclc->hdr.length) !=
@@ -56,10 +66,16 @@ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr *clcm)
 		break;
 	case SMC_CLC_ACCEPT:
 	case SMC_CLC_CONFIRM:
+		if (clcm->path != SMC_TYPE_R && clcm->path != SMC_TYPE_D)
+			return false;
 		clc = (struct smc_clc_msg_accept_confirm *)clcm;
-		if (ntohs(clc->hdr.length) != sizeof(*clc))
+		if ((clcm->path == SMC_TYPE_R &&
+		     ntohs(clc->hdr.length) != SMCR_CLC_ACCEPT_CONFIRM_LEN) ||
+		    (clcm->path == SMC_TYPE_D &&
+		     ntohs(clc->hdr.length) != SMCD_CLC_ACCEPT_CONFIRM_LEN))
 			return false;
-		trl = &clc->trl;
+		trl = (struct smc_clc_msg_trail *)
+			((u8 *)clc + ntohs(clc->hdr.length) - sizeof(*trl));
 		break;
 	case SMC_CLC_DECLINE:
 		dclc = (struct smc_clc_msg_decline *)clcm;
@@ -70,7 +86,8 @@ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr *clcm)
 	default:
 		return false;
 	}
-	if (memcmp(trl->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)))
+	if (memcmp(trl->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)) &&
+	    memcmp(trl->eyecatcher, SMCD_EYECATCHER, sizeof(SMCD_EYECATCHER)))
 		return false;
 	return true;
 }
@@ -295,6 +312,9 @@ int smc_clc_wait_msg(struct smc_sock *smc, void *buf, int buflen,
 	datlen = ntohs(clcm->length);
 	if ((len < sizeof(struct smc_clc_msg_hdr)) ||
 	    (datlen > buflen) ||
+	    (clcm->version != SMC_CLC_V1) ||
+	    (clcm->path != SMC_TYPE_R && clcm->path != SMC_TYPE_D &&
+	     clcm->path != SMC_TYPE_B) ||
 	    ((clcm->type != SMC_CLC_DECLINE) &&
 	     (clcm->type != expected_type))) {
 		smc->sk.sk_err = EPROTO;
@@ -356,17 +376,18 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info)
 }
 
 /* send CLC PROPOSAL message across internal TCP socket */
-int smc_clc_send_proposal(struct smc_sock *smc,
-			  struct smc_ib_device *smcibdev,
-			  u8 ibport)
+int smc_clc_send_proposal(struct smc_sock *smc, int smc_type,
+			  struct smc_ib_device *ibdev, u8 ibport,
+			  struct smcd_dev *ismdev)
 {
 	struct smc_clc_ipv6_prefix ipv6_prfx[SMC_CLC_MAX_V6_PREFIX];
 	struct smc_clc_msg_proposal_prefix pclc_prfx;
+	struct smc_clc_msg_smcd pclc_smcd;
 	struct smc_clc_msg_proposal pclc;
 	struct smc_clc_msg_trail trl;
 	int len, i, plen, rc;
 	int reason_code = 0;
-	struct kvec vec[4];
+	struct kvec vec[5];
 	struct msghdr msg;
 
 	/* retrieve ip prefixes for CLC proposal msg */
@@ -381,18 +402,34 @@ int smc_clc_send_proposal(struct smc_sock *smc,
 	memset(&pclc, 0, sizeof(pclc));
 	memcpy(pclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
 	pclc.hdr.type = SMC_CLC_PROPOSAL;
-	pclc.hdr.length = htons(plen);
 	pclc.hdr.version = SMC_CLC_V1;		/* SMC version */
-	memcpy(pclc.lcl.id_for_peer, local_systemid, sizeof(local_systemid));
-	memcpy(&pclc.lcl.gid, &smcibdev->gid[ibport - 1], SMC_GID_SIZE);
-	memcpy(&pclc.lcl.mac, &smcibdev->mac[ibport - 1], ETH_ALEN);
-	pclc.iparea_offset = htons(0);
+	pclc.hdr.path = smc_type;
+	if (smc_type == SMC_TYPE_R || smc_type == SMC_TYPE_B) {
+		/* add SMC-R specifics */
+		memcpy(pclc.lcl.id_for_peer, local_systemid,
+		       sizeof(local_systemid));
+		memcpy(&pclc.lcl.gid, &ibdev->gid[ibport - 1], SMC_GID_SIZE);
+		memcpy(&pclc.lcl.mac, &ibdev->mac[ibport - 1], ETH_ALEN);
+		pclc.iparea_offset = htons(0);
+	}
+	if (smc_type == SMC_TYPE_D || smc_type == SMC_TYPE_B) {
+		/* add SMC-D specifics */
+		memset(&pclc_smcd, 0, sizeof(pclc_smcd));
+		plen += sizeof(pclc_smcd);
+		pclc.iparea_offset = htons(SMC_CLC_PROPOSAL_MAX_OFFSET);
+		pclc_smcd.gid = ismdev->local_gid;
+	}
+	pclc.hdr.length = htons(plen);
 
 	memcpy(trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
 	memset(&msg, 0, sizeof(msg));
 	i = 0;
 	vec[i].iov_base = &pclc;
 	vec[i++].iov_len = sizeof(pclc);
+	if (smc_type == SMC_TYPE_D || smc_type == SMC_TYPE_B) {
+		vec[i].iov_base = &pclc_smcd;
+		vec[i++].iov_len = sizeof(pclc_smcd);
+	}
 	vec[i].iov_base = &pclc_prfx;
 	vec[i++].iov_len = sizeof(pclc_prfx);
 	if (pclc_prfx.ipv6_prefixes_cnt > 0) {
@@ -428,35 +465,56 @@ int smc_clc_send_confirm(struct smc_sock *smc)
 	struct kvec vec;
 	int len;
 
-	link = &conn->lgr->lnk[SMC_SINGLE_LINK];
 	/* send SMC Confirm CLC msg */
 	memset(&cclc, 0, sizeof(cclc));
-	memcpy(cclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
 	cclc.hdr.type = SMC_CLC_CONFIRM;
-	cclc.hdr.length = htons(sizeof(cclc));
 	cclc.hdr.version = SMC_CLC_V1;		/* SMC version */
-	memcpy(cclc.lcl.id_for_peer, local_systemid, sizeof(local_systemid));
-	memcpy(&cclc.lcl.gid, &link->smcibdev->gid[link->ibport - 1],
-	       SMC_GID_SIZE);
-	memcpy(&cclc.lcl.mac, &link->smcibdev->mac[link->ibport - 1], ETH_ALEN);
-	hton24(cclc.qpn, link->roce_qp->qp_num);
-	cclc.rmb_rkey =
-		htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
-	cclc.rmbe_idx = 1; /* for now: 1 RMB = 1 RMBE */
-	cclc.rmbe_alert_token = htonl(conn->alert_token_local);
-	cclc.qp_mtu = min(link->path_mtu, link->peer_mtu);
-	cclc.rmbe_size = conn->rmbe_size_short;
-	cclc.rmb_dma_addr = cpu_to_be64(
-		(u64)sg_dma_address(conn->rmb_desc->sgt[SMC_SINGLE_LINK].sgl));
-	hton24(cclc.psn, link->psn_initial);
-
-	memcpy(cclc.trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
+	if (smc->conn.lgr->is_smcd) {
+		/* SMC-D specific settings */
+		memcpy(cclc.hdr.eyecatcher, SMCD_EYECATCHER,
+		       sizeof(SMCD_EYECATCHER));
+		cclc.hdr.path = SMC_TYPE_D;
+		cclc.hdr.length = htons(SMCD_CLC_ACCEPT_CONFIRM_LEN);
+		cclc.gid = conn->lgr->smcd->local_gid;
+		cclc.token = conn->rmb_desc->token;
+		cclc.dmbe_size = conn->rmbe_size_short;
+		cclc.dmbe_idx = 0;
+		memcpy(&cclc.linkid, conn->lgr->id, SMC_LGR_ID_SIZE);
+		memcpy(cclc.smcd_trl.eyecatcher, SMCD_EYECATCHER,
+		       sizeof(SMCD_EYECATCHER));
+	} else {
+		/* SMC-R specific settings */
+		link = &conn->lgr->lnk[SMC_SINGLE_LINK];
+		memcpy(cclc.hdr.eyecatcher, SMC_EYECATCHER,
+		       sizeof(SMC_EYECATCHER));
+		cclc.hdr.path = SMC_TYPE_R;
+		cclc.hdr.length = htons(SMCR_CLC_ACCEPT_CONFIRM_LEN);
+		memcpy(cclc.lcl.id_for_peer, local_systemid,
+		       sizeof(local_systemid));
+		memcpy(&cclc.lcl.gid, &link->smcibdev->gid[link->ibport - 1],
+		       SMC_GID_SIZE);
+		memcpy(&cclc.lcl.mac, &link->smcibdev->mac[link->ibport - 1],
+		       ETH_ALEN);
+		hton24(cclc.qpn, link->roce_qp->qp_num);
+		cclc.rmb_rkey =
+			htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
+		cclc.rmbe_idx = 1; /* for now: 1 RMB = 1 RMBE */
+		cclc.rmbe_alert_token = htonl(conn->alert_token_local);
+		cclc.qp_mtu = min(link->path_mtu, link->peer_mtu);
+		cclc.rmbe_size = conn->rmbe_size_short;
+		cclc.rmb_dma_addr = cpu_to_be64((u64)sg_dma_address
+				(conn->rmb_desc->sgt[SMC_SINGLE_LINK].sgl));
+		hton24(cclc.psn, link->psn_initial);
+		memcpy(cclc.smcr_trl.eyecatcher, SMC_EYECATCHER,
+		       sizeof(SMC_EYECATCHER));
+	}
 
 	memset(&msg, 0, sizeof(msg));
 	vec.iov_base = &cclc;
-	vec.iov_len = sizeof(cclc);
-	len = kernel_sendmsg(smc->clcsock, &msg, &vec, 1, sizeof(cclc));
-	if (len < sizeof(cclc)) {
+	vec.iov_len = ntohs(cclc.hdr.length);
+	len = kernel_sendmsg(smc->clcsock, &msg, &vec, 1,
+			     ntohs(cclc.hdr.length));
+	if (len < ntohs(cclc.hdr.length)) {
 		if (len >= 0) {
 			reason_code = -ENETUNREACH;
 			smc->sk.sk_err = -reason_code;
@@ -479,35 +537,58 @@ int smc_clc_send_accept(struct smc_sock *new_smc, int srv_first_contact)
 	int rc = 0;
 	int len;
 
-	link = &conn->lgr->lnk[SMC_SINGLE_LINK];
 	memset(&aclc, 0, sizeof(aclc));
-	memcpy(aclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
 	aclc.hdr.type = SMC_CLC_ACCEPT;
-	aclc.hdr.length = htons(sizeof(aclc));
 	aclc.hdr.version = SMC_CLC_V1;		/* SMC version */
 	if (srv_first_contact)
 		aclc.hdr.flag = 1;
-	memcpy(aclc.lcl.id_for_peer, local_systemid, sizeof(local_systemid));
-	memcpy(&aclc.lcl.gid, &link->smcibdev->gid[link->ibport - 1],
-	       SMC_GID_SIZE);
-	memcpy(&aclc.lcl.mac, link->smcibdev->mac[link->ibport - 1], ETH_ALEN);
-	hton24(aclc.qpn, link->roce_qp->qp_num);
-	aclc.rmb_rkey =
-		htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
-	aclc.rmbe_idx = 1;			/* as long as 1 RMB = 1 RMBE */
-	aclc.rmbe_alert_token = htonl(conn->alert_token_local);
-	aclc.qp_mtu = link->path_mtu;
-	aclc.rmbe_size = conn->rmbe_size_short,
-	aclc.rmb_dma_addr = cpu_to_be64(
-		(u64)sg_dma_address(conn->rmb_desc->sgt[SMC_SINGLE_LINK].sgl));
-	hton24(aclc.psn, link->psn_initial);
-	memcpy(aclc.trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
+
+	if (new_smc->conn.lgr->is_smcd) {
+		/* SMC-D specific settings */
+		aclc.hdr.length = htons(SMCD_CLC_ACCEPT_CONFIRM_LEN);
+		memcpy(aclc.hdr.eyecatcher, SMCD_EYECATCHER,
+		       sizeof(SMCD_EYECATCHER));
+		aclc.hdr.path = SMC_TYPE_D;
+		aclc.gid = conn->lgr->smcd->local_gid;
+		aclc.token = conn->rmb_desc->token;
+		aclc.dmbe_size = conn->rmbe_size_short;
+		aclc.dmbe_idx = 0;
+		memcpy(&aclc.linkid, conn->lgr->id, SMC_LGR_ID_SIZE);
+		memcpy(aclc.smcd_trl.eyecatcher, SMCD_EYECATCHER,
+		       sizeof(SMCD_EYECATCHER));
+	} else {
+		/* SMC-R specific settings */
+		aclc.hdr.length = htons(SMCR_CLC_ACCEPT_CONFIRM_LEN);
+		memcpy(aclc.hdr.eyecatcher, SMC_EYECATCHER,
+		       sizeof(SMC_EYECATCHER));
+		aclc.hdr.path = SMC_TYPE_R;
+		link = &conn->lgr->lnk[SMC_SINGLE_LINK];
+		memcpy(aclc.lcl.id_for_peer, local_systemid,
+		       sizeof(local_systemid));
+		memcpy(&aclc.lcl.gid, &link->smcibdev->gid[link->ibport - 1],
+		       SMC_GID_SIZE);
+		memcpy(&aclc.lcl.mac, link->smcibdev->mac[link->ibport - 1],
+		       ETH_ALEN);
+		hton24(aclc.qpn, link->roce_qp->qp_num);
+		aclc.rmb_rkey =
+			htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
+		aclc.rmbe_idx = 1;		/* as long as 1 RMB = 1 RMBE */
+		aclc.rmbe_alert_token = htonl(conn->alert_token_local);
+		aclc.qp_mtu = link->path_mtu;
+		aclc.rmbe_size = conn->rmbe_size_short,
+		aclc.rmb_dma_addr = cpu_to_be64((u64)sg_dma_address
+				(conn->rmb_desc->sgt[SMC_SINGLE_LINK].sgl));
+		hton24(aclc.psn, link->psn_initial);
+		memcpy(aclc.smcr_trl.eyecatcher, SMC_EYECATCHER,
+		       sizeof(SMC_EYECATCHER));
+	}
 
 	memset(&msg, 0, sizeof(msg));
 	vec.iov_base = &aclc;
-	vec.iov_len = sizeof(aclc);
-	len = kernel_sendmsg(new_smc->clcsock, &msg, &vec, 1, sizeof(aclc));
-	if (len < sizeof(aclc)) {
+	vec.iov_len = ntohs(aclc.hdr.length);
+	len = kernel_sendmsg(new_smc->clcsock, &msg, &vec, 1,
+			     ntohs(aclc.hdr.length));
+	if (len < ntohs(aclc.hdr.length)) {
 		if (len >= 0)
 			new_smc->sk.sk_err = EPROTO;
 		else
diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h
index 41ff9ea96139..100e988ad1a8 100644
--- a/net/smc/smc_clc.h
+++ b/net/smc/smc_clc.h
@@ -23,6 +23,9 @@
 #define SMC_CLC_DECLINE		0x04
 
 #define SMC_CLC_V1		0x1		/* SMC version                */
+#define SMC_TYPE_R		0		/* SMC-R only		      */
+#define SMC_TYPE_D		1		/* SMC-D only		      */
+#define SMC_TYPE_B		3		/* SMC-R and SMC-D	      */
 #define CLC_WAIT_TIME		(6 * HZ)	/* max. wait time on clcsock  */
 #define SMC_CLC_DECL_MEM	0x01010000  /* insufficient memory resources  */
 #define SMC_CLC_DECL_TIMEOUT	0x02000000  /* timeout                        */
@@ -42,9 +45,11 @@ struct smc_clc_msg_hdr {	/* header1 of clc messages */
 #if defined(__BIG_ENDIAN_BITFIELD)
 	u8 version : 4,
 	   flag    : 1,
-	   rsvd    : 3;
+	   rsvd	   : 1,
+	   path	   : 2;
 #elif defined(__LITTLE_ENDIAN_BITFIELD)
-	u8 rsvd    : 3,
+	u8 path    : 2,
+	   rsvd    : 1,
 	   flag    : 1,
 	   version : 4;
 #endif
@@ -77,6 +82,11 @@ struct smc_clc_msg_proposal_prefix {	/* prefix part of clc proposal message*/
 	u8 ipv6_prefixes_cnt;	/* number of IPv6 prefixes in prefix array */
 } __aligned(4);
 
+struct smc_clc_msg_smcd {	/* SMC-D GID information */
+	u64 gid;		/* ISM GID of requestor */
+	u8 res[32];
+};
+
 struct smc_clc_msg_proposal {	/* clc proposal message sent by Linux */
 	struct smc_clc_msg_hdr hdr;
 	struct smc_clc_msg_local lcl;
@@ -94,23 +104,45 @@ struct smc_clc_msg_proposal {	/* clc proposal message sent by Linux */
 
 struct smc_clc_msg_accept_confirm {	/* clc accept / confirm message */
 	struct smc_clc_msg_hdr hdr;
-	struct smc_clc_msg_local lcl;
-	u8 qpn[3];		/* QP number */
-	__be32 rmb_rkey;	/* RMB rkey */
-	u8 rmbe_idx;		/* Index of RMBE in RMB */
-	__be32 rmbe_alert_token;/* unique connection id */
+	union {
+		struct { /* SMC-R */
+			struct smc_clc_msg_local lcl;
+			u8 qpn[3];		/* QP number */
+			__be32 rmb_rkey;	/* RMB rkey */
+			u8 rmbe_idx;		/* Index of RMBE in RMB */
+			__be32 rmbe_alert_token;/* unique connection id */
 #if defined(__BIG_ENDIAN_BITFIELD)
-	u8 rmbe_size : 4,	/* RMBE buf size (compressed notation) */
-	   qp_mtu   : 4;	/* QP mtu */
+			u8 rmbe_size : 4,	/* buf size (compressed) */
+			   qp_mtu   : 4;	/* QP mtu */
 #elif defined(__LITTLE_ENDIAN_BITFIELD)
-	u8 qp_mtu   : 4,
-	   rmbe_size : 4;
+			u8 qp_mtu   : 4,
+			   rmbe_size : 4;
 #endif
-	u8 reserved;
-	__be64 rmb_dma_addr;	/* RMB virtual address */
-	u8 reserved2;
-	u8 psn[3];		/* initial packet sequence number */
-	struct smc_clc_msg_trail trl; /* eye catcher "SMCR" EBCDIC */
+			u8 reserved;
+			__be64 rmb_dma_addr;	/* RMB virtual address */
+			u8 reserved2;
+			u8 psn[3];		/* packet sequence number */
+			struct smc_clc_msg_trail smcr_trl;
+						/* eye catcher "SMCR" EBCDIC */
+		} __packed;
+		struct { /* SMC-D */
+			u64 gid;		/* Sender GID */
+			u64 token;		/* DMB token */
+			u8 dmbe_idx;		/* DMBE index */
+#if defined(__BIG_ENDIAN_BITFIELD)
+			u8 dmbe_size : 4,	/* buf size (compressed) */
+			   reserved3 : 4;
+#elif defined(__LITTLE_ENDIAN_BITFIELD)
+			u8 reserved3 : 4,
+			   dmbe_size : 4;
+#endif
+			u16 reserved4;
+			u32 linkid;		/* Link identifier */
+			u32 reserved5[3];
+			struct smc_clc_msg_trail smcd_trl;
+						/* eye catcher "SMCD" EBCDIC */
+		} __packed;
+	};
 } __packed;			/* format defined in RFC7609 */
 
 struct smc_clc_msg_decline {	/* clc decline message */
@@ -129,13 +161,26 @@ smc_clc_proposal_get_prefix(struct smc_clc_msg_proposal *pclc)
 	       ((u8 *)pclc + sizeof(*pclc) + ntohs(pclc->iparea_offset));
 }
 
+/* get SMC-D info from proposal message */
+static inline struct smc_clc_msg_smcd *
+smc_get_clc_msg_smcd(struct smc_clc_msg_proposal *prop)
+{
+	if (ntohs(prop->iparea_offset) != sizeof(struct smc_clc_msg_smcd))
+		return NULL;
+
+	return (struct smc_clc_msg_smcd *)(prop + 1);
+}
+
+struct smcd_dev;
+
 int smc_clc_prfx_match(struct socket *clcsock,
 		       struct smc_clc_msg_proposal_prefix *prop);
 int smc_clc_wait_msg(struct smc_sock *smc, void *buf, int buflen,
 		     u8 expected_type);
 int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info);
-int smc_clc_send_proposal(struct smc_sock *smc, struct smc_ib_device *smcibdev,
-			  u8 ibport);
+int smc_clc_send_proposal(struct smc_sock *smc, int smc_type,
+			  struct smc_ib_device *smcibdev, u8 ibport,
+			  struct smcd_dev *ismdev);
 int smc_clc_send_confirm(struct smc_sock *smc);
 int smc_clc_send_accept(struct smc_sock *smc, int srv_first_contact);
 
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 03/10] net/smc: optimize consumer cursor updates
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

From: Ursula Braun <ursula.braun@de.ibm.com>

The SMC protocol requires to send a separate consumer cursor update,
if it cannot be piggybacked to updates of the producer cursor.
Currently the decision to send a separate consumer cursor update
just considers the amount of data already received by the socket
program. It does not consider the amount of data already arrived, but
not yet consumed by the receiver. Basing the decision on the
difference between already confirmed and already arrived data
(instead of difference between already confirmed and already consumed
data), may lead to a somewhat earlier consumer cursor update send in
fast unidirectional traffic scenarios, and thus to better throughput.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
---
 net/smc/smc_tx.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index cee666400752..f82886b7d1d8 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -495,7 +495,8 @@ void smc_tx_work(struct work_struct *work)
 
 void smc_tx_consumer_update(struct smc_connection *conn, bool force)
 {
-	union smc_host_cursor cfed, cons;
+	union smc_host_cursor cfed, cons, prod;
+	int sender_free = conn->rmb_desc->len;
 	int to_confirm;
 
 	smc_curs_write(&cons,
@@ -505,11 +506,18 @@ void smc_tx_consumer_update(struct smc_connection *conn, bool force)
 		       smc_curs_read(&conn->rx_curs_confirmed, conn),
 		       conn);
 	to_confirm = smc_curs_diff(conn->rmb_desc->len, &cfed, &cons);
+	if (to_confirm > conn->rmbe_update_limit) {
+		smc_curs_write(&prod,
+			       smc_curs_read(&conn->local_rx_ctrl.prod, conn),
+			       conn);
+		sender_free = conn->rmb_desc->len -
+			      smc_curs_diff(conn->rmb_desc->len, &prod, &cfed);
+	}
 
 	if (conn->local_rx_ctrl.prod_flags.cons_curs_upd_req ||
 	    force ||
 	    ((to_confirm > conn->rmbe_update_limit) &&
-	     ((to_confirm > (conn->rmb_desc->len / 2)) ||
+	     ((sender_free <= (conn->rmb_desc->len / 2)) ||
 	      conn->local_rx_ctrl.prod_flags.write_blocked))) {
 		if ((smc_cdc_get_slot_and_msg_send(conn) < 0) &&
 		    conn->alert_token_local) { /* connection healthy */
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 02/10] net/smc: add pnetid support
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

s390 hardware supports the definition of a so-call Physical NETwork
IDentifier (short PNETID) per network device port. These PNETIDS
can be used to identify network devices that are attached to the same
physical network (broadcast domain).

On s390 try to use the PNETID of the ethernet device port used for
initial connecting, and derive the IB device port used for SMC RDMA
traffic.

On platforms without PNETID support fall back to the existing
solution of a configured pnet table.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 include/net/smc.h  |   2 +
 net/smc/smc_ib.c   |   6 ++-
 net/smc/smc_ib.h   |   3 ++
 net/smc/smc_pnet.c | 109 +++++++++++++++++++++++++++++++++++++++++++----------
 net/smc/smc_pnet.h |  14 +++++++
 5 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/include/net/smc.h b/include/net/smc.h
index 8381d163fefa..2173932fab9d 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -11,6 +11,8 @@
 #ifndef _SMC_H
 #define _SMC_H
 
+#define SMC_MAX_PNETID_LEN	16	/* Max. length of PNET id */
+
 struct smc_hashinfo {
 	rwlock_t lock;
 	struct hlist_head ht;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index f8b159ced032..36de2fd76170 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -504,8 +504,12 @@ static void smc_ib_add_dev(struct ib_device *ibdev)
 	port_cnt = smcibdev->ibdev->phys_port_cnt;
 	for (i = 0;
 	     i < min_t(size_t, port_cnt, SMC_MAX_PORTS);
-	     i++)
+	     i++) {
 		set_bit(i, &smcibdev->port_event_mask);
+		/* determine pnetids of the port */
+		smc_pnetid_by_dev_port(ibdev->dev.parent, i,
+				       smcibdev->pnetid[i]);
+	}
 	schedule_work(&smcibdev->port_event_work);
 }
 
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 2c480b352928..7c1223c91229 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -15,6 +15,7 @@
 #include <linux/interrupt.h>
 #include <linux/if_ether.h>
 #include <rdma/ib_verbs.h>
+#include <net/smc.h>
 
 #define SMC_MAX_PORTS			2	/* Max # of ports */
 #define SMC_GID_SIZE			sizeof(union ib_gid)
@@ -40,6 +41,8 @@ struct smc_ib_device {				/* ib-device infos for smc */
 	char			mac[SMC_MAX_PORTS][ETH_ALEN];
 						/* mac address per port*/
 	union ib_gid		gid[SMC_MAX_PORTS]; /* gid per port */
+	u8			pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN];
+						/* pnetid per port */
 	u8			initialized : 1; /* ib dev CQ, evthdl done */
 	struct work_struct	port_event_work;
 	unsigned long		port_event_mask;
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index a82a5cad0282..cdc6e23b6ce1 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -23,12 +23,10 @@
 #include "smc_pnet.h"
 #include "smc_ib.h"
 
-#define SMC_MAX_PNET_ID_LEN	16	/* Max. length of PNET id */
-
 static struct nla_policy smc_pnet_policy[SMC_PNETID_MAX + 1] = {
 	[SMC_PNETID_NAME] = {
 		.type = NLA_NUL_STRING,
-		.len = SMC_MAX_PNET_ID_LEN - 1
+		.len = SMC_MAX_PNETID_LEN - 1
 	},
 	[SMC_PNETID_ETHNAME] = {
 		.type = NLA_NUL_STRING,
@@ -65,7 +63,7 @@ static struct smc_pnettable {
  */
 struct smc_pnetentry {
 	struct list_head list;
-	char pnet_name[SMC_MAX_PNET_ID_LEN + 1];
+	char pnet_name[SMC_MAX_PNETID_LEN + 1];
 	struct net_device *ndev;
 	struct smc_ib_device *smcibdev;
 	u8 ib_port;
@@ -209,7 +207,7 @@ static bool smc_pnetid_valid(const char *pnet_name, char *pnetid)
 		return false;
 	while (--end >= bf && isspace(*end))
 		;
-	if (end - bf >= SMC_MAX_PNET_ID_LEN)
+	if (end - bf >= SMC_MAX_PNETID_LEN)
 		return false;
 	while (bf <= end) {
 		if (!isalnum(*bf))
@@ -512,26 +510,70 @@ void smc_pnet_exit(void)
 	genl_unregister_family(&smc_pnet_nl_family);
 }
 
-/* PNET table analysis for a given sock:
- * determine ib_device and port belonging to used internal TCP socket
- * ethernet interface.
+/* Determine one base device for stacked net devices.
+ * If the lower device level contains more than one devices
+ * (for instance with bonding slaves), just the first device
+ * is used to reach a base device.
  */
-void smc_pnet_find_roce_resource(struct sock *sk,
-				 struct smc_ib_device **smcibdev, u8 *ibport)
+static struct net_device *pnet_find_base_ndev(struct net_device *ndev)
 {
-	struct dst_entry *dst = sk_dst_get(sk);
-	struct smc_pnetentry *pnetelem;
+	int i, nest_lvl;
 
-	*smcibdev = NULL;
-	*ibport = 0;
+	rtnl_lock();
+	nest_lvl = dev_get_nest_level(ndev);
+	for (i = 0; i < nest_lvl; i++) {
+		struct list_head *lower = &ndev->adj_list.lower;
+
+		if (list_empty(lower))
+			break;
+		lower = lower->next;
+		ndev = netdev_lower_get_next(ndev, &lower);
+	}
+	rtnl_unlock();
+	return ndev;
+}
+
+/* Determine the corresponding IB device port based on the hardware PNETID.
+ * Searching stops at the first matching active IB device port.
+ */
+static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
+					 struct smc_ib_device **smcibdev,
+					 u8 *ibport)
+{
+	u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
+	struct smc_ib_device *ibdev;
+	int i;
+
+	ndev = pnet_find_base_ndev(ndev);
+	if (smc_pnetid_by_dev_port(ndev->dev.parent, ndev->dev_port,
+				   ndev_pnetid))
+		return; /* pnetid could not be determined */
+
+	spin_lock(&smc_ib_devices.lock);
+	list_for_each_entry(ibdev, &smc_ib_devices.list, list) {
+		for (i = 1; i <= SMC_MAX_PORTS; i++) {
+			if (!memcmp(ibdev->pnetid[i - 1], ndev_pnetid,
+				    SMC_MAX_PNETID_LEN) &&
+			    smc_ib_port_active(ibdev, i)) {
+				*smcibdev = ibdev;
+				*ibport = i;
+				break;
+			}
+		}
+	}
+	spin_unlock(&smc_ib_devices.lock);
+}
+
+/* Lookup of coupled ib_device via SMC pnet table */
+static void smc_pnet_find_roce_by_table(struct net_device *netdev,
+					struct smc_ib_device **smcibdev,
+					u8 *ibport)
+{
+	struct smc_pnetentry *pnetelem;
 
-	if (!dst)
-		return;
-	if (!dst->dev)
-		goto out_rel;
 	read_lock(&smc_pnettable.lock);
 	list_for_each_entry(pnetelem, &smc_pnettable.pnetlist, list) {
-		if (dst->dev == pnetelem->ndev) {
+		if (netdev == pnetelem->ndev) {
 			if (smc_ib_port_active(pnetelem->smcibdev,
 					       pnetelem->ib_port)) {
 				*smcibdev = pnetelem->smcibdev;
@@ -541,6 +583,35 @@ void smc_pnet_find_roce_resource(struct sock *sk,
 		}
 	}
 	read_unlock(&smc_pnettable.lock);
+}
+
+/* PNET table analysis for a given sock:
+ * determine ib_device and port belonging to used internal TCP socket
+ * ethernet interface.
+ */
+void smc_pnet_find_roce_resource(struct sock *sk,
+				 struct smc_ib_device **smcibdev, u8 *ibport)
+{
+	struct dst_entry *dst = sk_dst_get(sk);
+
+	*smcibdev = NULL;
+	*ibport = 0;
+
+	if (!dst)
+		goto out;
+	if (!dst->dev)
+		goto out_rel;
+
+	/* if possible, lookup via hardware-defined pnetid */
+	smc_pnet_find_roce_by_pnetid(dst->dev, smcibdev, ibport);
+	if (*smcibdev)
+		goto out_rel;
+
+	/* lookup via SMC PNET table */
+	smc_pnet_find_roce_by_table(dst->dev, smcibdev, ibport);
+
 out_rel:
 	dst_release(dst);
+out:
+	return;
 }
diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
index 5a29519db976..ad4455cde9e7 100644
--- a/net/smc/smc_pnet.h
+++ b/net/smc/smc_pnet.h
@@ -12,8 +12,22 @@
 #ifndef _SMC_PNET_H
 #define _SMC_PNET_H
 
+#if IS_ENABLED(CONFIG_HAVE_PNETID)
+#include <asm/pnet.h>
+#endif
+
 struct smc_ib_device;
 
+static inline int smc_pnetid_by_dev_port(struct device *dev,
+					 unsigned short port, u8 *pnetid)
+{
+#if IS_ENABLED(CONFIG_HAVE_PNETID)
+	return pnet_id_by_dev_port(dev, port, pnetid);
+#else
+	return -ENOENT;
+#endif
+}
+
 int smc_pnet_init(void) __init;
 void smc_pnet_exit(void);
 int smc_pnet_remove_by_ibdev(struct smc_ib_device *ibdev);
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 01/10] net/smc: determine port attributes independent from pnet table
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun
In-Reply-To: <20180628170513.5089-1-ubraun@linux.ibm.com>

For SMC it is important to know the current port state of RoCE devices.
Monitoring port states has been triggered, when a RoCE device was added
to the pnet table. To support future alternatives to the pnet table the
monitoring of ports is made independent of the existence of a pnet table.
It starts once the smc_ib_device is established.

Due to this change smc_ib_remember_port_attr() is now a local function
and shuffling its location and the location of its used functions
makes any forward references obsolete.

And the duplicate SMC_MAX_PORTS definition is removed.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/smc.h      |   2 -
 net/smc/smc_ib.c   | 130 ++++++++++++++++++++++++++++-------------------------
 net/smc/smc_ib.h   |   1 -
 net/smc/smc_pnet.c |   7 +--
 4 files changed, 72 insertions(+), 68 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 51ae1f10d81a..7c86f716a92e 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -21,8 +21,6 @@
 #define SMCPROTO_SMC		0	/* SMC protocol, IPv4 */
 #define SMCPROTO_SMC6		1	/* SMC protocol, IPv6 */
 
-#define SMC_MAX_PORTS		2	/* Max # of ports */
-
 extern struct proto smc_proto;
 extern struct proto smc_proto6;
 
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0eed7ab9f28b..f8b159ced032 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -143,6 +143,62 @@ int smc_ib_ready_link(struct smc_link *lnk)
 	return rc;
 }
 
+static int smc_ib_fill_gid_and_mac(struct smc_ib_device *smcibdev, u8 ibport)
+{
+	struct ib_gid_attr gattr;
+	int rc;
+
+	rc = ib_query_gid(smcibdev->ibdev, ibport, 0,
+			  &smcibdev->gid[ibport - 1], &gattr);
+	if (rc || !gattr.ndev)
+		return -ENODEV;
+
+	memcpy(smcibdev->mac[ibport - 1], gattr.ndev->dev_addr, ETH_ALEN);
+	dev_put(gattr.ndev);
+	return 0;
+}
+
+/* Create an identifier unique for this instance of SMC-R.
+ * The MAC-address of the first active registered IB device
+ * plus a random 2-byte number is used to create this identifier.
+ * This name is delivered to the peer during connection initialization.
+ */
+static inline void smc_ib_define_local_systemid(struct smc_ib_device *smcibdev,
+						u8 ibport)
+{
+	memcpy(&local_systemid[2], &smcibdev->mac[ibport - 1],
+	       sizeof(smcibdev->mac[ibport - 1]));
+	get_random_bytes(&local_systemid[0], 2);
+}
+
+bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport)
+{
+	return smcibdev->pattr[ibport - 1].state == IB_PORT_ACTIVE;
+}
+
+static int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport)
+{
+	int rc;
+
+	memset(&smcibdev->pattr[ibport - 1], 0,
+	       sizeof(smcibdev->pattr[ibport - 1]));
+	rc = ib_query_port(smcibdev->ibdev, ibport,
+			   &smcibdev->pattr[ibport - 1]);
+	if (rc)
+		goto out;
+	/* the SMC protocol requires specification of the RoCE MAC address */
+	rc = smc_ib_fill_gid_and_mac(smcibdev, ibport);
+	if (rc)
+		goto out;
+	if (!strncmp(local_systemid, SMC_LOCAL_SYSTEMID_RESET,
+		     sizeof(local_systemid)) &&
+	    smc_ib_port_active(smcibdev, ibport))
+		/* create unique system identifier */
+		smc_ib_define_local_systemid(smcibdev, ibport);
+out:
+	return rc;
+}
+
 /* process context wrapper for might_sleep smc_ib_remember_port_attr */
 static void smc_ib_port_event_work(struct work_struct *work)
 {
@@ -370,62 +426,6 @@ void smc_ib_buf_unmap_sg(struct smc_ib_device *smcibdev,
 	buf_slot->sgt[SMC_SINGLE_LINK].sgl->dma_address = 0;
 }
 
-static int smc_ib_fill_gid_and_mac(struct smc_ib_device *smcibdev, u8 ibport)
-{
-	struct ib_gid_attr gattr;
-	int rc;
-
-	rc = ib_query_gid(smcibdev->ibdev, ibport, 0,
-			  &smcibdev->gid[ibport - 1], &gattr);
-	if (rc || !gattr.ndev)
-		return -ENODEV;
-
-	memcpy(smcibdev->mac[ibport - 1], gattr.ndev->dev_addr, ETH_ALEN);
-	dev_put(gattr.ndev);
-	return 0;
-}
-
-/* Create an identifier unique for this instance of SMC-R.
- * The MAC-address of the first active registered IB device
- * plus a random 2-byte number is used to create this identifier.
- * This name is delivered to the peer during connection initialization.
- */
-static inline void smc_ib_define_local_systemid(struct smc_ib_device *smcibdev,
-						u8 ibport)
-{
-	memcpy(&local_systemid[2], &smcibdev->mac[ibport - 1],
-	       sizeof(smcibdev->mac[ibport - 1]));
-	get_random_bytes(&local_systemid[0], 2);
-}
-
-bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport)
-{
-	return smcibdev->pattr[ibport - 1].state == IB_PORT_ACTIVE;
-}
-
-int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport)
-{
-	int rc;
-
-	memset(&smcibdev->pattr[ibport - 1], 0,
-	       sizeof(smcibdev->pattr[ibport - 1]));
-	rc = ib_query_port(smcibdev->ibdev, ibport,
-			   &smcibdev->pattr[ibport - 1]);
-	if (rc)
-		goto out;
-	/* the SMC protocol requires specification of the RoCE MAC address */
-	rc = smc_ib_fill_gid_and_mac(smcibdev, ibport);
-	if (rc)
-		goto out;
-	if (!strncmp(local_systemid, SMC_LOCAL_SYSTEMID_RESET,
-		     sizeof(local_systemid)) &&
-	    smc_ib_port_active(smcibdev, ibport))
-		/* create unique system identifier */
-		smc_ib_define_local_systemid(smcibdev, ibport);
-out:
-	return rc;
-}
-
 long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 {
 	struct ib_cq_init_attr cqattr =	{
@@ -454,9 +454,6 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 		smcibdev->roce_cq_recv = NULL;
 		goto err;
 	}
-	INIT_IB_EVENT_HANDLER(&smcibdev->event_handler, smcibdev->ibdev,
-			      smc_ib_global_event_handler);
-	ib_register_event_handler(&smcibdev->event_handler);
 	smc_wr_add_dev(smcibdev);
 	smcibdev->initialized = 1;
 	return rc;
@@ -472,7 +469,6 @@ static void smc_ib_cleanup_per_ibdev(struct smc_ib_device *smcibdev)
 		return;
 	smcibdev->initialized = 0;
 	smc_wr_remove_dev(smcibdev);
-	ib_unregister_event_handler(&smcibdev->event_handler);
 	ib_destroy_cq(smcibdev->roce_cq_recv);
 	ib_destroy_cq(smcibdev->roce_cq_send);
 }
@@ -483,6 +479,8 @@ static struct ib_client smc_ib_client;
 static void smc_ib_add_dev(struct ib_device *ibdev)
 {
 	struct smc_ib_device *smcibdev;
+	u8 port_cnt;
+	int i;
 
 	if (ibdev->node_type != RDMA_NODE_IB_CA)
 		return;
@@ -498,6 +496,17 @@ static void smc_ib_add_dev(struct ib_device *ibdev)
 	list_add_tail(&smcibdev->list, &smc_ib_devices.list);
 	spin_unlock(&smc_ib_devices.lock);
 	ib_set_client_data(ibdev, &smc_ib_client, smcibdev);
+	INIT_IB_EVENT_HANDLER(&smcibdev->event_handler, smcibdev->ibdev,
+			      smc_ib_global_event_handler);
+	ib_register_event_handler(&smcibdev->event_handler);
+
+	/* trigger reading of the port attributes */
+	port_cnt = smcibdev->ibdev->phys_port_cnt;
+	for (i = 0;
+	     i < min_t(size_t, port_cnt, SMC_MAX_PORTS);
+	     i++)
+		set_bit(i, &smcibdev->port_event_mask);
+	schedule_work(&smcibdev->port_event_work);
 }
 
 /* callback function for ib_register_client() */
@@ -512,6 +521,7 @@ static void smc_ib_remove_dev(struct ib_device *ibdev, void *client_data)
 	spin_unlock(&smc_ib_devices.lock);
 	smc_pnet_remove_by_ibdev(smcibdev);
 	smc_ib_cleanup_per_ibdev(smcibdev);
+	ib_unregister_event_handler(&smcibdev->event_handler);
 	kfree(smcibdev);
 }
 
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index e90630dadf8e..2c480b352928 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -51,7 +51,6 @@ struct smc_link;
 int smc_ib_register_client(void) __init;
 void smc_ib_unregister_client(void);
 bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport);
-int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport);
 int smc_ib_buf_map_sg(struct smc_ib_device *smcibdev,
 		      struct smc_buf_desc *buf_slot,
 		      enum dma_data_direction data_direction);
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index d7b88b2d1b22..a82a5cad0282 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -358,9 +358,6 @@ static int smc_pnet_add(struct sk_buff *skb, struct genl_info *info)
 		kfree(pnetelem);
 		return rc;
 	}
-	rc = smc_ib_remember_port_attr(pnetelem->smcibdev, pnetelem->ib_port);
-	if (rc)
-		smc_pnet_remove_by_pnetid(pnetelem->pnet_name);
 	return rc;
 }
 
@@ -485,10 +482,10 @@ static int smc_pnet_netdev_event(struct notifier_block *this,
 	case NETDEV_REBOOT:
 	case NETDEV_UNREGISTER:
 		smc_pnet_remove_by_ndev(event_dev);
+		return NOTIFY_OK;
 	default:
-		break;
+		return NOTIFY_DONE;
 	}
-	return NOTIFY_DONE;
 }
 
 static struct notifier_block smc_netdev_notifier = {
-- 
2.16.4

^ permalink raw reply related

* [PATCH net-next 00/10] pnetid and SMC-D support
From: Ursula Braun @ 2018-06-28 17:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, hwippel,
	sebott, ubraun

Dave,

SMC requires a configured pnet table to map Ethernet interfaces to
RoCE adapter ports. For s390 there exists hardware support to group
such devices. The first three patches cover the s390 pnetid support,
enabling SMC-R usage on s390 without configuring an extra pnet table.

SMC currently requires RoCE adapters, and uses RDMA-techniques
implemented with IB-verbs. But s390 offers another method for
intra-CEC Shared Memory communication. The following seven patches
implement a solution to run SMC traffic based on intra-CEC DMA,
called SMC-D.

Thanks, Ursula

Hans Wippel (6):
  net/smc: add base infrastructure for SMC-D and ISM
  net/smc: add pnetid support for SMC-D and ISM
  net/smc: add SMC-D support in CLC messages
  net/smc: add SMC-D support in data transfer
  net/smc: add SMC-D support in af_smc
  net/smc: add SMC-D diag support

Sebastian Ott (1):
  s390/ism: add device driver for internal shared memory

Ursula Braun (3):
  net/smc: determine port attributes independent from pnet table
  net/smc: add pnetid support
  net/smc: optimize consumer cursor updates

 drivers/s390/net/Kconfig      |  10 +
 drivers/s390/net/Makefile     |   3 +
 drivers/s390/net/ism.h        | 221 +++++++++++++++
 drivers/s390/net/ism_drv.c    | 623 ++++++++++++++++++++++++++++++++++++++++++
 include/net/smc.h             |  65 +++++
 include/uapi/linux/smc_diag.h |  10 +
 net/smc/Makefile              |   2 +-
 net/smc/af_smc.c              | 228 ++++++++++++++--
 net/smc/smc.h                 |   7 +-
 net/smc/smc_cdc.c             |  86 +++++-
 net/smc/smc_cdc.h             |  43 ++-
 net/smc/smc_clc.c             | 193 +++++++++----
 net/smc/smc_clc.h             |  81 ++++--
 net/smc/smc_core.c            | 285 ++++++++++++++-----
 net/smc/smc_core.h            |  72 +++--
 net/smc/smc_diag.c            |  18 +-
 net/smc/smc_ib.c              | 134 +++++----
 net/smc/smc_ib.h              |   4 +-
 net/smc/smc_ism.c             | 314 +++++++++++++++++++++
 net/smc/smc_ism.h             |  48 ++++
 net/smc/smc_pnet.c            | 157 +++++++++--
 net/smc/smc_pnet.h            |  16 ++
 net/smc/smc_rx.c              |   2 +-
 net/smc/smc_tx.c              | 205 +++++++++++---
 net/smc/smc_tx.h              |   2 +
 25 files changed, 2505 insertions(+), 324 deletions(-)
 create mode 100644 drivers/s390/net/ism.h
 create mode 100644 drivers/s390/net/ism_drv.c
 create mode 100644 net/smc/smc_ism.c
 create mode 100644 net/smc/smc_ism.h

-- 
2.16.4

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Jiri Benc @ 2018-06-28 17:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, davem, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	oss-drivers, netdev, Pieter Jansen van Vuuren
In-Reply-To: <20180628095452.6f23fdf4@cakuba.netronome.com>

On Thu, 28 Jun 2018 09:54:52 -0700, Jakub Kicinski wrote:
> Hmm... in practice we could steal top bits of the size parameter for
> some flags, since it seems to be limited to values < 256 today?  Is it
> worth it?
> 
> It would look something along the lines of:

Something like that, yes. I'll leave to Daniel to review how much sense
it makes from the BPF side.

Thanks!

 Jiri

^ permalink raw reply

* [PATCH net-next 4/4] selftests: forwarding: mirror_gre_changes: Fix waiting for neighbor
From: Petr Machata @ 2018-06-28 16:56 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1530204784.git.petrm@mellanox.com>

When running the test on soft devices, there's no mechanism to
gratuitously start resolving the neighbor for remote tunnel endpoint.
So instead of passively waiting, wait for the device to be up, and then
probe the neighbor with a ping.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 tools/testing/selftests/net/forwarding/mirror_gre_changes.sh | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh b/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
index aa29d46186a8..135902aa8b11 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
@@ -122,15 +122,8 @@ test_span_gre_egress_up()
 	# After setting the device up, wait for neighbor to get resolved so that
 	# we can expect mirroring to work.
 	ip link set dev $swp3 up
-	while true; do
-		ip neigh sh dev $swp3 $remote_ip nud reachable |
-		    grep -q ^
-		if [[ $? -ne 0 ]]; then
-			sleep 1
-		else
-			break
-		fi
-	done
+	setup_wait_dev $swp3
+	ping -c 1 -I $swp3 $remote_ip &>/dev/null
 
 	quick_test_span_gre_dir $tundev ingress
 	mirror_uninstall $swp1 ingress
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 3/4] selftests: forwarding: Tweak tc filters for mirror-to-gretap tests
From: Petr Machata @ 2018-06-28 16:56 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1530204784.git.petrm@mellanox.com>

When running mirror_gre_bridge_1d_vlan tests on veth, several issues
cause spurious failures:

- vlan_ethtype should be ip, not ipv6 even in mirror-to-ip6gretap case,
  because the overlay packet is still IPv4.
- Similarly ip_proto matches the innermost IP protocol, so can't be used
  to filter out GRE packet. Drop the corresponding condition.
- Because the above fixes the filters to match in slow path as well,
  they need to be made skip_hw so as not to double-count packets.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh | 6 ++++--
 tools/testing/selftests/net/forwarding/mirror_gre_lib.sh            | 2 +-
 tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh | 6 ++++--
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh
index 3bb4c2ba7b14..197e769c2ed1 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh
@@ -74,12 +74,14 @@ test_vlan_match()
 
 test_gretap()
 {
-	test_vlan_match gt4 'vlan_id 555 vlan_ethtype ip' "mirror to gretap"
+	test_vlan_match gt4 'skip_hw vlan_id 555 vlan_ethtype ip' \
+			"mirror to gretap"
 }
 
 test_ip6gretap()
 {
-	test_vlan_match gt6 'vlan_id 555 vlan_ethtype ipv6' "mirror to ip6gretap"
+	test_vlan_match gt6 'skip_hw vlan_id 555 vlan_ethtype ip' \
+			"mirror to ip6gretap"
 }
 
 test_gretap_stp()
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh b/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
index 619b469365be..1c18e332cd4f 100644
--- a/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
@@ -62,7 +62,7 @@ full_test_span_gre_dir_vlan_ips()
 			  "$backward_type" "$ip1" "$ip2"
 
 	tc filter add dev $h3 ingress pref 77 prot 802.1q \
-		flower $vlan_match ip_proto 0x2f \
+		flower $vlan_match \
 		action pass
 	mirror_test v$h1 $ip1 $ip2 $h3 77 10
 	tc filter del dev $h3 ingress pref 77
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
index 1ac5038ae256..d3e75bb6a2d8 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
@@ -88,12 +88,14 @@ test_vlan_match()
 
 test_gretap()
 {
-	test_vlan_match gt4 'vlan_id 555 vlan_ethtype ip' "mirror to gretap"
+	test_vlan_match gt4 'skip_hw vlan_id 555 vlan_ethtype ip' \
+			"mirror to gretap"
 }
 
 test_ip6gretap()
 {
-	test_vlan_match gt6 'vlan_id 555 vlan_ethtype ipv6' "mirror to ip6gretap"
+	test_vlan_match gt6 'skip_hw vlan_id 555 vlan_ethtype ip' \
+			"mirror to ip6gretap"
 }
 
 test_span_gre_forbidden_cpu()
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 2/4] selftests: forwarding: lib: Avoid trapping soft devices
From: Petr Machata @ 2018-06-28 16:56 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1530204784.git.petrm@mellanox.com>

There are several cases where traffic that would normally be forwarded
in silicon needs to be observed in slow path. That's achieved by
trapping such traffic, and the functions trap_install() and
trap_uninstall() realize that. However, such treatment is obviously
wrong if the device in question is actually a soft device not backed by
an ASIC.

Therefore try to trap if possible, but fall back to inserting a continue
if not.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 tools/testing/selftests/net/forwarding/lib.sh | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index ac1df4860fbe..d1f14f83979e 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -479,9 +479,15 @@ trap_install()
 	local dev=$1; shift
 	local direction=$1; shift
 
-	# For slow-path testing, we need to install a trap to get to
-	# slow path the packets that would otherwise be switched in HW.
-	tc filter add dev $dev $direction pref 1 flower skip_sw action trap
+	# Some devices may not support or need in-hardware trapping of traffic
+	# (e.g. the veth pairs that this library creates for non-existent
+	# loopbacks). Use continue instead, so that there is a filter in there
+	# (some tests check counters), and so that other filters are still
+	# processed.
+	tc filter add dev $dev $direction pref 1 \
+		flower skip_sw action trap 2>/dev/null \
+	    || tc filter add dev $dev $direction pref 1 \
+		       flower action continue
 }
 
 trap_uninstall()
@@ -489,11 +495,13 @@ trap_uninstall()
 	local dev=$1; shift
 	local direction=$1; shift
 
-	tc filter del dev $dev $direction pref 1 flower skip_sw
+	tc filter del dev $dev $direction pref 1 flower
 }
 
 slow_path_trap_install()
 {
+	# For slow-path testing, we need to install a trap to get to
+	# slow path the packets that would otherwise be switched in HW.
 	if [ "${tcflags/skip_hw}" != "$tcflags" ]; then
 		trap_install "$@"
 	fi
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 1/4] selftests: forwarding: lib: Split out setup_wait_dev()
From: Petr Machata @ 2018-06-28 16:56 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1530204784.git.petrm@mellanox.com>

Split out of setup_wait() a function setup_wait_dev() that waits for a
single device. This gives tests the opportunity to wait for a selected
device after they tinkered with its upness.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 tools/testing/selftests/net/forwarding/lib.sh | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 1dfdf14894e2..ac1df4860fbe 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -185,18 +185,25 @@ log_info()
 	echo "INFO: $msg"
 }
 
+setup_wait_dev()
+{
+	local dev=$1; shift
+
+	while true; do
+		ip link show dev $dev up \
+			| grep 'state UP' &> /dev/null
+		if [[ $? -ne 0 ]]; then
+			sleep 1
+		else
+			break
+		fi
+	done
+}
+
 setup_wait()
 {
 	for i in $(eval echo {1..$NUM_NETIFS}); do
-		while true; do
-			ip link show dev ${NETIFS[p$i]} up \
-				| grep 'state UP' &> /dev/null
-			if [[ $? -ne 0 ]]; then
-				sleep 1
-			else
-				break
-			fi
-		done
+		setup_wait_dev ${NETIFS[p$i]}
 	done
 
 	# Make sure links are ready.
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 0/4] Fixes for running mirror-to-gretap tests on veth
From: Petr Machata @ 2018-06-28 16:56 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah

The forwarding selftests infrastructure makes it possible to run the
individual tests on a purely software netdevices. Names of interfaces to
run the test with can be passed as command line arguments to a test.
lib.sh then creates veth pairs backing the interfaces if none exist in
the system.

However, the tests need to recognize that they might be run on a soft
device. Many mirror-to-gretap tests are buggy in this regard. This patch
set aims to fix the problems in running mirror-to-gretap tests on veth
devices.

In patch #1, a service function is split out of setup_wait().
In patch #2, installing a trap is made optional.
In patch #3, tc filters in several tests are tweaked to work with veth.
In patch #4, the logic for waiting for neighbor is fixed for veth.

Petr Machata (4):
  selftests: forwarding: lib: Split out setup_wait_dev()
  selftests: forwarding: lib: Avoid trapping soft devices
  selftests: forwarding: Tweak tc filters for mirror-to-gretap tests
  selftests: forwarding: mirror_gre_changes: Fix waiting for neighbor

 tools/testing/selftests/net/forwarding/lib.sh      | 41 +++++++++++++++-------
 .../net/forwarding/mirror_gre_bridge_1d_vlan.sh    |  6 ++--
 .../selftests/net/forwarding/mirror_gre_changes.sh | 11 ++----
 .../selftests/net/forwarding/mirror_gre_lib.sh     |  2 +-
 .../net/forwarding/mirror_gre_vlan_bridge_1q.sh    |  6 ++--
 5 files changed, 39 insertions(+), 27 deletions(-)

-- 
2.4.11

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Jakub Kicinski @ 2018-06-28 16:54 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Daniel Borkmann, davem, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	oss-drivers, netdev, Pieter Jansen van Vuuren
In-Reply-To: <20180628094206.62b6d8e2@redhat.com>

On Thu, 28 Jun 2018 09:42:06 +0200, Jiri Benc wrote:
> On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote:
> > Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT is
> > right approach since this is opaque info and solely defined by the BPF prog
> > that is using the generic helper.  
> 
> Wouldn't it make sense to introduce some safeguards here (in a backward
> compatible way, of course)? It's easy to mistakenly set data for a
> different tunnel type in a BPF program and then be surprised by the
> result. It might help users if such usage was detected by the kernel,
> one way or another.

Well, that's how it works today ;)

> I'm thinking about something like the BPF program voluntarily
> specifying the type of the data; if not specified, the wildcard would be
> used as it is now.

Hmm... in practice we could steal top bits of the size parameter for
some flags, since it seems to be limited to values < 256 today?  Is it
worth it?

It would look something along the lines of:

---

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..194b40efa8e8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2213,6 +2213,13 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
 #define BPF_F_CTXLEN_MASK              (0xfffffULL << 32)
 
+#define BPF_F_TUN_VXLAN                        (1U << 31)
+#define BPF_F_TUN_GENEVE               (1U << 30)
+#define BPF_F_TUN_ERSPAN               (1U << 29)
+#define BPF_F_TUN_FLAGS_ALL            (BPF_F_TUN_VXLAN | \
+                                        BPF_F_TUN_GENEVE | \
+                                        BPF_F_TUN_ERSPAN)
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
        BPF_ADJ_ROOM_NET,
diff --git a/net/core/filter.c b/net/core/filter.c
index dade922678f6..cc592a1e8945 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3576,6 +3576,22 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
 {
        struct ip_tunnel_info *info = skb_tunnel_info(skb);
        const struct metadata_dst *md = this_cpu_ptr(md_dst);
+       __be16 tun_flags;
+       u32 flags;
+
+       BUILD_BUG_ON(BPF_F_TUN_FLAGS_ALL & IP_TUNNEL_OPTS_MAX);
+
+       flags = size & BPF_F_TUN_FLAGS_ALL;
+       size &= ~flags;
+       if (flags & BPF_F_TUN_VXLAN)
+               tun_flags |= TUNNEL_VXLAN_OPT;
+       if (flags & BPF_F_TUN_GENEVE)
+               tun_flags |= TUNNEL_GENEVE_OPT;
+       if (flags & BPF_F_TUN_ERSPAN)
+               tun_flags |= TUNNEL_ERSPAN_OPT;
+       /* User didn't specify the tunnel type, for backward compat set all */
+       if (!(tun_flags & TUNNEL_OPTIONS_PRESENT))
+               tun_flags |= TUNNEL_OPTIONS_PRESENT;
 
        if (unlikely(info != &md->u.tun_info || (size & (sizeof(u32) - 1))))
                return -EINVAL;

^ permalink raw reply related

* [PATCH bpf-next 03/14] bpf: pass a pointer to a cgroup storage using pcpu variable
From: Roman Gushchin @ 2018-06-28 16:47 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel-team, tj, Roman Gushchin, Alexei Starovoitov,
	Daniel Borkmann
In-Reply-To: <20180628164719.28215-1-guro@fb.com>

This commit introduces the bpf_cgroup_storage_set() helper,
which will be used to pass a pointer to a cgroup storage
to the bpf helper.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf-cgroup.h | 14 ++++++++++++++
 kernel/bpf/local_storage.c |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index b4e2e42c1d2a..128fb0e39b4d 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -20,6 +20,8 @@ struct bpf_cgroup_storage;
 extern struct static_key_false cgroup_bpf_enabled_key;
 #define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
 
+DECLARE_PER_CPU(void*, bpf_cgroup_storage);
+
 struct bpf_cgroup_storage_map;
 
 struct bpf_storage_buffer {
@@ -96,6 +98,17 @@ int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 				      short access, enum bpf_attach_type type);
 
+static inline void bpf_cgroup_storage_set(struct bpf_cgroup_storage *storage)
+{
+	struct bpf_storage_buffer *buf;
+
+	if (!storage)
+		return;
+
+	buf = rcu_dereference(storage->buf);
+	this_cpu_write(bpf_cgroup_storage, &buf->data[0]);
+}
+
 struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog);
 void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage);
 void bpf_cgroup_storage_link(struct bpf_cgroup_storage *storage,
@@ -223,6 +236,7 @@ struct cgroup_bpf {};
 static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 
+static inline void bpf_cgroup_storage_set(struct bpf_cgroup_storage *storage) {}
 static inline int bpf_cgroup_storage_assign(struct bpf_prog *prog,
 					    struct bpf_map *map) { return 0; }
 static inline void bpf_cgroup_storage_release(struct bpf_prog *prog,
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 940889eda2c7..38810a712971 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -7,6 +7,8 @@
 #include <linux/rbtree.h>
 #include <linux/slab.h>
 
+DEFINE_PER_CPU(void*, bpf_cgroup_storage);
+
 #ifdef CONFIG_CGROUP_BPF
 
 struct bpf_cgroup_storage_map {
-- 
2.14.4

^ permalink raw reply related

* [PATCH bpf-next 14/14] samples/bpf: extend test_cgrp2_attach2 test to use cgroup storage
From: Roman Gushchin @ 2018-06-28 16:47 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel-team, tj, Roman Gushchin, Alexei Starovoitov,
	Daniel Borkmann
In-Reply-To: <20180628164719.28215-1-guro@fb.com>

The test_cgrp2_attach test covers bpf cgroup attachment code well,
so let's re-use it for testing allocation/releasing of cgroup storage.

The extension is pretty straightforward: the bpf program will use
the cgroup storage to save the number of transmitted bytes.

Expected output:
  $ ./test_cgrp2_attach2
  Attached DROP prog. This ping in cgroup /foo should fail...
  ping: sendmsg: Operation not permitted
  Attached DROP prog. This ping in cgroup /foo/bar should fail...
  ping: sendmsg: Operation not permitted
  Attached PASS prog. This ping in cgroup /foo/bar should pass...
  Detached PASS from /foo/bar while DROP is attached to /foo.
  This ping in cgroup /foo/bar should fail...
  ping: sendmsg: Operation not permitted
  Attached PASS from /foo/bar and detached DROP from /foo.
  This ping in cgroup /foo/bar should pass...
  ### override:PASS
  ### multi:PASS

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 samples/bpf/test_cgrp2_attach2.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/test_cgrp2_attach2.c b/samples/bpf/test_cgrp2_attach2.c
index b453e6a161be..f682e0b8aa83 100644
--- a/samples/bpf/test_cgrp2_attach2.c
+++ b/samples/bpf/test_cgrp2_attach2.c
@@ -8,7 +8,8 @@
  *   information. The number of invocations of the program, which maps
  *   to the number of packets received, is stored to key 0. Key 1 is
  *   incremented on each iteration by the number of bytes stored in
- *   the skb.
+ *   the skb. The program also stores the number of received bytes
+ *   in the cgroup storage.
  *
  * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
  *
@@ -21,12 +22,15 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <assert.h>
+#include <sys/resource.h>
+#include <sys/time.h>
 #include <unistd.h>
 
 #include <linux/bpf.h>
 #include <bpf/bpf.h>
 
 #include "bpf_insn.h"
+#include "bpf_rlimit.h"
 #include "cgroup_helpers.h"
 
 #define FOO		"/foo"
@@ -205,6 +209,8 @@ static int map_fd = -1;
 
 static int prog_load_cnt(int verdict, int val)
 {
+	int cgroup_storage_fd;
+
 	if (map_fd < 0)
 		map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, 4, 8, 1, 0);
 	if (map_fd < 0) {
@@ -212,6 +218,13 @@ static int prog_load_cnt(int verdict, int val)
 		return -1;
 	}
 
+	cgroup_storage_fd = bpf_create_map(BPF_MAP_TYPE_CGROUP_STORAGE,
+				sizeof(struct bpf_cgroup_storage_key), 8, 0, 0);
+	if (cgroup_storage_fd < 0) {
+		printf("failed to create map '%s'\n", strerror(errno));
+		return -1;
+	}
+
 	struct bpf_insn prog[] = {
 		BPF_MOV32_IMM(BPF_REG_0, 0),
 		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
@@ -222,6 +235,11 @@ static int prog_load_cnt(int verdict, int val)
 		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 		BPF_MOV64_IMM(BPF_REG_1, val), /* r1 = 1 */
 		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_LD_MAP_FD(BPF_REG_1, cgroup_storage_fd),
+		BPF_MOV64_IMM(BPF_REG_2, 0),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_local_storage),
+		BPF_MOV64_IMM(BPF_REG_1, val),
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_W, BPF_REG_0, BPF_REG_1, 0, 0),
 		BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
 		BPF_EXIT_INSN(),
 	};
@@ -237,6 +255,7 @@ static int prog_load_cnt(int verdict, int val)
 		printf("Output from verifier:\n%s\n-------\n", bpf_log_buf);
 		return 0;
 	}
+	close(cgroup_storage_fd);
 	return ret;
 }
 
@@ -414,6 +433,12 @@ static int test_multiprog(void)
 int main(int argc, char **argv)
 {
 	int rc = 0;
+	struct rlimit r = {1024*1024, RLIM_INFINITY};
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		log_err("Setrlimit(RLIMIT_MEMLOCK) failed");
+		return 1;
+	}
 
 	rc = test_foo_bar();
 	if (rc)
-- 
2.14.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox