* [RFC 1/7] IB/core: Introduce peer client interface
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 2/7] IB/core: Get/put peer memory client Artemy Kovalyov
` (6 subsequent siblings)
7 siblings, 0 replies; 33+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Introduces an API between IB core to peer memory clients,(e.g. GPU cards)
to provide access for the HCA to read/write GPU memory.
As a result it allows RDMA-based application to use GPU computing power,
and RDMA interconnect at the same time w/o copying the data between the
P2P devices.
Each peer memory client should register with IB core. In the registration
request, it should supply callbacks to its memory basic functionality such
as get/put pages, get_page_size, dma map/unmap.
The client can optionally require the ability to invalidate memory it
provided, by requesting an invalidation callback details.
Upon successful registration, IB core will provide the client with a unique
registration handle and an invalidate callback function in case required by
the peer.
The handle should be used when unregistering the client, the callback
function can be used by the client in later patches, for a request from
the client to immediately release pinned pages.
Each peer must be able to recognize whether it's the owner of a specific
virtual address range. In case the answer is YES, further calls for memory
functionality will be tunneled to that peer.
The recognition is done via the 'acquire' call. The call arguments provide
the address and size of the memory requested. Upon recognition, the
acquire call returns a peer direct client specific context. The context
will be provided by the peer direct controller to the peer direct client
callbacks when referring the specific address range.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/Kconfig | 10 ++
drivers/infiniband/core/Makefile | 1 +
drivers/infiniband/core/peer_mem.c | 82 +++++++++++++
include/rdma/ib_peer_mem.h | 44 +++++++
include/rdma/peer_mem.h | 238 +++++++++++++++++++++++++++++++++++++
5 files changed, 375 insertions(+)
create mode 100644 drivers/infiniband/core/peer_mem.c
create mode 100644 include/rdma/ib_peer_mem.h
create mode 100644 include/rdma/peer_mem.h
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 8a8440c..2837d66 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -64,6 +64,16 @@ config INFINIBAND_ADDR_TRANS_CONFIGFS
This allows the user to config the default GID type that the CM
uses for each device, when initiaing new connections.
+config INFINIBAND_PEER_MEM
+ bool "InfiniBand Peer memory access"
+ depends on INFINIBAND_USER_MEM
+ depends on MMU_NOTIFIER
+ default y
+ ---help---
+ Peer memory access feature allows RDMA operations to directly target
+ memory in external hardware devices, such as GPU cards, SSD based
+ storage, dedicated ASIC accelerators, etc.
+
source "drivers/infiniband/hw/mthca/Kconfig"
source "drivers/infiniband/hw/qib/Kconfig"
source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index f818538..9882d00 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,6 +13,7 @@ ib_core-y := packer.o ud_header.o verbs.o cq.o sysfs.o \
roce_gid_mgmt.o
ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
+ib_core-$(CONFIG_INFINIBAND_PEER_MEM) += peer_mem.o
ib_mad-y := mad.o smi.o agent.o mad_rmpp.o
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
new file mode 100644
index 0000000..2c26a39
--- /dev/null
+++ b/drivers/infiniband/core/peer_mem.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/ib_peer_mem.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_umem.h>
+
+static DEFINE_MUTEX(peer_memory_mutex);
+static LIST_HEAD(peer_memory_list);
+
+static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
+{
+ return -ENOSYS;
+}
+
+void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
+ int (**invalidate_callback)
+ (void *reg_handle, u64 core_context))
+{
+ struct ib_peer_memory_client *ib_peer_client;
+
+ ib_peer_client = kzalloc(sizeof(*ib_peer_client), GFP_KERNEL);
+ if (!ib_peer_client)
+ return NULL;
+
+ ib_peer_client->peer_mem = peer_client;
+ /* Once peer supplied a non NULL callback it's an indication that
+ * invalidation support is required for any memory owning.
+ */
+ if (invalidate_callback) {
+ *invalidate_callback = ib_invalidate_peer_memory;
+ ib_peer_client->invalidation_required = 1;
+ }
+
+ mutex_lock(&peer_memory_mutex);
+ list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
+ mutex_unlock(&peer_memory_mutex);
+
+ return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_register_peer_memory_client);
+
+void ib_unregister_peer_memory_client(void *reg_handle)
+{
+ struct ib_peer_memory_client *ib_peer_client = reg_handle;
+
+ mutex_lock(&peer_memory_mutex);
+ list_del(&ib_peer_client->core_peer_list);
+ mutex_unlock(&peer_memory_mutex);
+
+ kfree(ib_peer_client);
+}
+EXPORT_SYMBOL(ib_unregister_peer_memory_client);
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
new file mode 100644
index 0000000..cbe928e
--- /dev/null
+++ b/include/rdma/ib_peer_mem.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#if !defined(IB_PEER_MEM_H)
+#define IB_PEER_MEM_H
+
+#include <rdma/peer_mem.h>
+
+struct ib_peer_memory_client {
+ const struct peer_memory_client *peer_mem;
+ struct list_head core_peer_list;
+ int invalidation_required;
+};
+
+#endif
diff --git a/include/rdma/peer_mem.h b/include/rdma/peer_mem.h
new file mode 100644
index 0000000..1ec96ea
--- /dev/null
+++ b/include/rdma/peer_mem.h
@@ -0,0 +1,238 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#if !defined(PEER_MEM_H)
+#define PEER_MEM_H
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/export.h>
+#include <linux/scatterlist.h>
+
+/**
+ * struct peer_memory_client - registration information for peer client.
+ * @acquire: callback function to be used by IB core to detect whether a
+ * virtual address in under the responsibility of a specific
+ * peer client.
+ * @get_pages: callback function to be used by IB core asking the peer client
+ * to pin the physical pages of the given address range and returns
+ * that information. It equivalents to the kernel API of
+ * get_user_pages(), but targets peer memory.
+ * @dma_map: callback function to be used by IB core asking the peer client
+ * to fill the dma address mapping for a given address range.
+ * @dma_unmap: callback function to be used by IB core asking the peer client
+ * to take relevant actions to unmap the memory.
+ * @put_pages: callback function to be used by IB core asking the peer client
+ * to remove the pinning from the given memory.
+ * It's the peer-direct equivalent of the kernel API put_page.
+ * @get_page_size: callback function to be used by IB core to query the peer
+ * client for the page size for the given allocation.
+ * @release: callback function to be used by IB core asking peer client to
+ * release all resources associated with previous acquire call.
+ * The call will be performed only for contexts that have been
+ * successfully acquired (i.e. acquire returned a non-zero value).
+ * Additionally, IB core guarentees that there will be no pages
+ * pinned through this context when the callback is called.
+ *
+ * The subsections in this description contain detailed description
+ * of the callback arguments and expected return values for the
+ * callbacks defined in this struct.
+ *
+ * acquire:
+ *
+ * Callback function to be used by IB core to detect
+ * whether a virtual address in under the responsibility
+ * of a specific peer client.
+ *
+ * addr [IN] - virtual address to be checked whether belongs
+ * to peer.
+ *
+ * size [IN] - size of memory area starting at addr.
+ *
+ * client_context [OUT] - peer opaque data which holds
+ * a peer context for the acquired
+ * address range, will be provided
+ * back to the peer memory in
+ * subsequent calls for that given
+ * memory.
+ *
+ * If peer takes responsibility on the given address range further
+ * calls for memory management will be directed to the callbacks
+ * of this peer client.
+ *
+ * Return - 1 in case peer client takes responsibility on that
+ * range, negative value if error
+ * happens during process, 0 otherwise.
+ *
+ * get_pages:
+ *
+ * Callback function to be used by IB core asking the
+ * peer client to pin the physical pages of the given
+ * address range and returns that information. It
+ * equivalents to the kernel API of get_user_pages(), but
+ * targets peer memory.
+ *
+ * addr [IN] - start virtual address of that given
+ * allocation.
+ *
+ * size [IN] - size of memory area starting at addr.
+ *
+ * write [IN] - indicates whether the pages will be
+ * written to by the caller. Same meaning
+ * as of kernel API get_user_pages, can be
+ * ignored if not relevant.
+ *
+ * force [IN] - indicates whether to force write access
+ * even if user mapping is read only. Same
+ * meaning as of kernel API get_user_pages,
+ * can be ignored if not relevant.
+ *
+ * sg_head [IN/OUT] - pointer to head of struct sg_table.
+ * The peer client should allocate a
+ * table big enough to store all of the
+ * required entries. This function should
+ * fill the table with physical addresses
+ * and sizes of the memory segments
+ * composing this memory mapping. The
+ * table allocation can be done using
+ * sg_alloc_table. Filling in the
+ * physical memory addresses and size can
+ * be done using sg_set_page.
+ *
+ * client_context [IN] - peer context for the given allocation, as
+ * received from the acquire call.
+ *
+ * core_context [IN] - IB core context. If the peer client wishes
+ * to invalidate any of the pages pinned
+ * through this API, it must provide this
+ * context as an argument to the invalidate
+ * callback.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * dma_map:
+ *
+ * Callback function to be used by IB core asking the peer client
+ * to fill the dma address mapping for a given address range.
+ *
+ * sg_head [IN/OUT] - pointer to head of struct sg_table.
+ * The peer memory should fill the
+ * dma_address and dma_length for each
+ * scatter gather entry in the table.
+ *
+ * client_context [IN] - peer context for the allocation mapped.
+ *
+ * dma_device [IN] - the RDMA capable device which
+ * requires access to the peer memory.
+ *
+ * dmasync [IN] - flush in-flight DMA when the memory region
+ * is written. Same meaning as with host
+ * memory mapping, can be ignored
+ * if not relevant.
+ *
+ * nmap [OUT] - number of mapped/set entries.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * dma_unmap:
+ *
+ * Callback function to be used by IB core asking the peer client
+ * to take relevant actions to unmap the memory.
+ *
+ * sg_head [IN] - pointer to head of struct sg_table.
+ * The peer memory should release the
+ * dma_address and dma_length for each
+ * scatter gather entry in the table.
+ *
+ * client_context [IN] - peer context for the allocation mapped.
+ *
+ * dma_device [IN] - the RDMA capable device which requires
+ * access to the peer memory.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * put_pages:
+ *
+ * Callback function to be used by IB core asking the peer client
+ * to remove the pinning from the given memory.
+ * It's the peer-direct equivalent of the kernel API put_page.
+ *
+ * sg_head [IN] - pointer to head of struct sg_table.
+ *
+ * client_context [IN] - peer context for that given allocation.
+ *
+ * get_page_size:
+ *
+ * Callback function to be used by IB core to query the
+ * peer client for the page size for the given
+ * allocation.
+ *
+ * client_context [IN] - peer context for that given allocation.
+ *
+ * Return - Page size in bytes
+ *
+ * release:
+ *
+ * Callback function to be used by IB core asking peer
+ * client to release all resources associated with
+ * previous acquire call. The call will be performed only
+ * for contexts that have been successfully acquired
+ * (i.e. acquire returned a non-zero value).
+ * Additionally, IB core guarentees that there will be no
+ * pages pinned through this context when the callback is
+ * called.
+ *
+ * client_context [IN] - peer context for the given allocation.
+ *
+ **/
+struct peer_memory_client {
+ int (*acquire)(unsigned long addr, size_t size, void **client_context);
+ int (*get_pages)(unsigned long addr, size_t size, int write, int force,
+ struct sg_table *sg_head,
+ void *client_context, u64 core_context);
+ int (*dma_map)(struct sg_table *sg_head, void *client_context,
+ struct device *dma_device, int dmasync, int *nmap);
+ int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
+ struct device *dma_device);
+ void (*put_pages)(struct sg_table *sg_head, void *client_context);
+ unsigned long (*get_page_size)(void *client_context);
+ void (*release)(void *client_context);
+};
+
+void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
+ int (**invalidate_callback)
+ (void *reg_handle, u64 core_context));
+void ib_unregister_peer_memory_client(void *reg_handle);
+
+#endif
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 33+ messages in thread* [RFC 2/7] IB/core: Get/put peer memory client
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-11 16:12 ` [RFC 1/7] IB/core: Introduce peer client interface Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 3/7] IB/core: Umem tunneling peer memory APIs Artemy Kovalyov
` (5 subsequent siblings)
7 siblings, 0 replies; 33+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Supplies an API to get/put a peer client functionality.
It encapsulates the details of how to acquire/release a peer client from
its callers and let them get the required peer client in case it exists.
The 'get' call iterates over registered peer clients looking for an
owner of a given address range by calling peer's 'acquire' call.
In case an owner is found the loop is stopped.
The 'put' call does the opposite, lets peer release its resources for
that given address range.
A reference counting/completion mechanism is used to prevent a peer
memory client from going down once there are active users for its memory.
In addition:
- an extra device capability named IB_DEVICE_PEER_MEMORY was introduced,
to be used by low level drivers to mark that they support this
functionality.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 55 ++++++++++++++++++++++++++++++++++++++
include/rdma/ib_peer_mem.h | 12 +++++++++
include/rdma/ib_verbs.h | 5 ++++
3 files changed, 72 insertions(+)
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 2c26a39..74e4caa 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -42,6 +42,14 @@ static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
return -ENOSYS;
}
+static void complete_peer(struct kref *kref)
+{
+ struct ib_peer_memory_client *ib_peer_client =
+ container_of(kref, struct ib_peer_memory_client, ref);
+
+ complete(&ib_peer_client->unload_comp);
+}
+
void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
int (**invalidate_callback)
(void *reg_handle, u64 core_context))
@@ -52,6 +60,8 @@ void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
if (!ib_peer_client)
return NULL;
+ init_completion(&ib_peer_client->unload_comp);
+ kref_init(&ib_peer_client->ref);
ib_peer_client->peer_mem = peer_client;
/* Once peer supplied a non NULL callback it's an indication that
* invalidation support is required for any memory owning.
@@ -77,6 +87,51 @@ void ib_unregister_peer_memory_client(void *reg_handle)
list_del(&ib_peer_client->core_peer_list);
mutex_unlock(&peer_memory_mutex);
+ kref_put(&ib_peer_client->ref, complete_peer);
+ wait_for_completion(&ib_peer_client->unload_comp);
kfree(ib_peer_client);
}
EXPORT_SYMBOL(ib_unregister_peer_memory_client);
+
+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
+ unsigned long addr,
+ size_t size,
+ void **peer_client_context)
+{
+ struct ib_peer_memory_client *ib_peer_client;
+ int ret;
+
+ mutex_lock(&peer_memory_mutex);
+ list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+ ret = ib_peer_client->peer_mem->acquire(addr, size,
+ peer_client_context);
+ if (ret > 0)
+ goto found;
+
+ if (ret < 0) {
+ mutex_unlock(&peer_memory_mutex);
+ return ERR_PTR(ret);
+ }
+ }
+
+ ib_peer_client = NULL;
+
+found:
+ mutex_unlock(&peer_memory_mutex);
+
+ if (ib_peer_client)
+ kref_get(&ib_peer_client->ref);
+
+ return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_get_peer_client);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+ void *peer_client_context)
+{
+ if (ib_peer_client->peer_mem->release)
+ ib_peer_client->peer_mem->release(peer_client_context);
+
+ kref_put(&ib_peer_client->ref, complete_peer);
+}
+EXPORT_SYMBOL(ib_put_peer_client);
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index cbe928e..7f41ce5 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -35,10 +35,22 @@
#include <rdma/peer_mem.h>
+struct ib_ucontext;
+
struct ib_peer_memory_client {
const struct peer_memory_client *peer_mem;
struct list_head core_peer_list;
int invalidation_required;
+ struct kref ref;
+ struct completion unload_comp;
};
+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
+ unsigned long addr,
+ size_t size,
+ void **peer_client_context);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+ void *peer_client_context);
+
#endif
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 284b00c..3917230 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -209,6 +209,11 @@ enum ib_device_cap_flags {
* by hardware.
*/
IB_DEVICE_CROSS_CHANNEL = (1 << 27),
+ /*
+ * Device supports RDMA access to memory registered by
+ * other locally connected devices (e.g. GPU).
+ */
+ IB_DEVICE_PEER_MEMORY = (1 << 28),
IB_DEVICE_MANAGED_FLOW_STEERING = (1 << 29),
IB_DEVICE_SIGNATURE_HANDOVER = (1 << 30),
IB_DEVICE_ON_DEMAND_PAGING = (1 << 31),
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 33+ messages in thread* [RFC 3/7] IB/core: Umem tunneling peer memory APIs
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-11 16:12 ` [RFC 1/7] IB/core: Introduce peer client interface Artemy Kovalyov
2016-02-11 16:12 ` [RFC 2/7] IB/core: Get/put peer memory client Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 4/7] IB/core: Infrastructure to manage peer core context Artemy Kovalyov
` (4 subsequent siblings)
7 siblings, 0 replies; 33+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Builds umem over peer memory client functionality.
It tries to get a peer client for a given address range.
In case it was found further memory calls are tunneled to that peer client.
ib_umem_get_flags was added as successor of ib_umem_get to have additional
flags, for instance indication whether this umem
can be part of a peer client.
Deprecated ib_umem_get left for backward compatibility.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/umem.c | 100 ++++++++++++++++++++++++++++++++++++++---
include/rdma/ib_umem.h | 34 +++++++++++---
2 files changed, 123 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 38acb3c..2eab34e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -43,6 +43,63 @@
#include "uverbs.h"
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
+ struct ib_umem *umem, unsigned long addr,
+ int dmasync)
+{
+ int ret;
+ const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+
+ umem->ib_peer_mem = ib_peer_mem;
+ /*
+ * We always request write permissions to the pages, to force breaking
+ * of any CoW during the registration of the MR. For read-only MRs we
+ * use the "force" flag to indicate that CoW breaking is required but
+ * the registration should not fail if referencing read-only areas.
+ */
+ ret = peer_mem->get_pages(addr, umem->length,
+ 1, !umem->writable,
+ &umem->sg_head,
+ umem->peer_mem_client_context,
+ 0);
+ if (ret)
+ goto out;
+
+ umem->page_size = peer_mem->get_page_size
+ (umem->peer_mem_client_context);
+ ret = peer_mem->dma_map(&umem->sg_head,
+ umem->peer_mem_client_context,
+ umem->context->device->dma_device,
+ dmasync,
+ &umem->nmap);
+ if (ret)
+ goto put_pages;
+
+ return umem;
+
+put_pages:
+ peer_mem->put_pages(&umem->sg_head,
+ umem->peer_mem_client_context);
+out:
+ ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
+ return ERR_PTR(ret);
+}
+
+static void peer_umem_release(struct ib_umem *umem)
+{
+ const struct peer_memory_client *peer_mem =
+ umem->ib_peer_mem->peer_mem;
+
+ peer_mem->dma_unmap(&umem->sg_head,
+ umem->peer_mem_client_context,
+ umem->context->device->dma_device);
+ peer_mem->put_pages(&umem->sg_head,
+ umem->peer_mem_client_context);
+ ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+ kfree(umem);
+}
+#endif
static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
{
@@ -69,7 +126,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
}
/**
- * ib_umem_get - Pin and DMA map userspace memory.
+ * ib_umem_get_flags - Pin and DMA map userspace memory.
*
* If access flags indicate ODP memory, avoid pinning. Instead, stores
* the mm for future page fault handling in conjunction with MMU notifiers.
@@ -78,10 +135,12 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
* @addr: userspace virtual address to start at
* @size: length of region to pin
* @access: IB_ACCESS_xxx flags for memory being pinned
- * @dmasync: flush in-flight DMA when the memory region is written
+ * @flags: IB_UMEM_xxx flags for memory being used
*/
-struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access, int dmasync)
+struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
+ unsigned long addr,
+ size_t size, int access,
+ unsigned long flags)
{
struct ib_umem *umem;
struct page **page_list;
@@ -96,7 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
struct scatterlist *sg, *sg_list_start;
int need_release = 0;
- if (dmasync)
+ if (flags & IB_UMEM_DMA_SYNC)
dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
if (!size)
@@ -144,6 +203,28 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
umem->odp_data = NULL;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ if (flags & IB_UMEM_PEER_ALLOW) {
+ struct ib_peer_memory_client *peer_mem_client;
+ struct ib_umem *peer_umem;
+
+ peer_mem_client =
+ ib_get_peer_client(context, addr, size,
+ &umem->peer_mem_client_context);
+ if (IS_ERR(peer_mem_client)) {
+ kfree(umem);
+ return ERR_CAST(peer_mem_client);
+
+ } else if (peer_mem_client) {
+ peer_umem = peer_umem_get(peer_mem_client, umem, addr,
+ flags & IB_UMEM_DMA_SYNC);
+ if (IS_ERR(peer_umem))
+ kfree(umem);
+ return peer_umem;
+ }
+ }
+#endif
+
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb = 1;
@@ -240,7 +321,7 @@ out:
return ret < 0 ? ERR_PTR(ret) : umem;
}
-EXPORT_SYMBOL(ib_umem_get);
+EXPORT_SYMBOL(ib_umem_get_flags);
static void ib_umem_account(struct work_struct *work)
{
@@ -264,6 +345,13 @@ void ib_umem_release(struct ib_umem *umem)
struct task_struct *task;
unsigned long diff;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ if (umem->ib_peer_mem) {
+ peer_umem_release(umem);
+ return;
+ }
+#endif
+
if (umem->odp_data) {
ib_umem_odp_release(umem);
return;
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 2d83cfd..bb760f9 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -36,6 +36,9 @@
#include <linux/list.h>
#include <linux/scatterlist.h>
#include <linux/workqueue.h>
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+#include <rdma/ib_peer_mem.h>
+#endif
struct ib_ucontext;
struct ib_umem_odp;
@@ -55,6 +58,12 @@ struct ib_umem {
struct sg_table sg_head;
int nmap;
int npages;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ /* peer memory that manages this umem */
+ struct ib_peer_memory_client *ib_peer_mem;
+ /* peer memory private context */
+ void *peer_mem_client_context;
+#endif
};
/* Returns the offset of the umem start relative to the first page. */
@@ -80,10 +89,16 @@ static inline size_t ib_umem_num_pages(struct ib_umem *umem)
return (ib_umem_end(umem) - ib_umem_start(umem)) >> PAGE_SHIFT;
}
+enum ib_peer_mem_flags {
+ IB_UMEM_DMA_SYNC = (1 << 0),
+ IB_UMEM_PEER_ALLOW = (1 << 1),
+};
+
#ifdef CONFIG_INFINIBAND_USER_MEM
+struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access, unsigned long flags);
-struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access, int dmasync);
void ib_umem_release(struct ib_umem *umem);
int ib_umem_page_count(struct ib_umem *umem);
int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
@@ -93,11 +108,13 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
#include <linux/err.h>
-static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
- unsigned long addr, size_t size,
- int access, int dmasync) {
+static inline struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access,
+ unsigned long flags) {
return ERR_PTR(-EINVAL);
}
+
static inline void ib_umem_release(struct ib_umem *umem) { }
static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; }
static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
@@ -106,4 +123,11 @@ static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offs
}
#endif /* CONFIG_INFINIBAND_USER_MEM */
+static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access, int dmasync) {
+ return ib_umem_get_flags(context, addr, size, access,
+ dmasync ? IB_UMEM_DMA_SYNC : 0);
+}
+
#endif /* IB_UMEM_H */
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 33+ messages in thread* [RFC 4/7] IB/core: Infrastructure to manage peer core context
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (2 preceding siblings ...)
2016-02-11 16:12 ` [RFC 3/7] IB/core: Umem tunneling peer memory APIs Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 5/7] IB/core: Invalidation support for peer memory Artemy Kovalyov
` (3 subsequent siblings)
7 siblings, 0 replies; 33+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds an infrastructure to manage core context for a given umem,
it's needed for the invalidation flow.
Core context is supplied to peer clients as some opaque data for a given
memory pages represented by a umem.
If the peer client needs to invalidate memory it provided through the
peer memory callbacks, it should call the invalidation callback, supplyng
the relevant core context. IB core will use this context to invalidate
the relevant memory.
To prevent cases when there are inflight invalidation calls in parallel
to releasing this memory (e.g. by dereg_mr) we must ensure that context
is valid before accessing it, that's why couldn't use the core context
pointer directly. For that reason we added a lookup table to map between
a ticket id to a core context. Peer client will get/supply the ticket
id, core will check whether exists before accessing its corresponding
context.
The ticket id is provided to the peer memory client, as part of the
get_pages API. The only "remote" party using it is the peer memory
client. It is used for invalidation flow, to specify what memory
registration should be invalidated. This flow might be called
asynchronously, in parallel to an ongoing dereg_mr operation. As such,
the invalidation flow might be called after the memory registration
has been completely released. Relying on a pointer-based, or IDR-based
ticket value can result in spurious invalidation of unrelated memory
regions. Internally, we carefully lock the data structures and
synchronize as needed when extracting the context from the
ticket. This ensures a proper, synchronized release of the memory
mapping. The ticket mechanism allows us to safely ignore inflight
invalidation calls that were arrived too late.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 90 ++++++++++++++++++++++++++++++++++++++
include/rdma/ib_peer_mem.h | 19 ++++++++
include/rdma/ib_umem.h | 8 ++++
3 files changed, 117 insertions(+)
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 74e4caa..57afb76 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -42,6 +42,93 @@ static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
return -ENOSYS;
}
+static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
+ void *context,
+ u64 *context_ticket)
+{
+ struct core_ticket *core_ticket;
+
+ core_ticket = kzalloc(sizeof(*core_ticket), GFP_KERNEL);
+ if (!core_ticket)
+ return -ENOMEM;
+
+ mutex_lock(&ib_peer_client->lock);
+ core_ticket->key = ib_peer_client->last_ticket++;
+ core_ticket->context = context;
+ list_add_tail(&core_ticket->ticket_list,
+ &ib_peer_client->core_ticket_list);
+ *context_ticket = core_ticket->key;
+ mutex_unlock(&ib_peer_client->lock);
+
+ return 0;
+}
+
+/* Caller should be holding the peer client lock, specifically,
+ * the caller should hold ib_peer_client->lock
+ */
+static int ib_peer_remove_context(struct ib_peer_memory_client *ib_peer_client,
+ u64 key)
+{
+ struct core_ticket *core_ticket, *safe;
+
+ list_for_each_entry_safe(core_ticket, safe,
+ &ib_peer_client->core_ticket_list,
+ ticket_list) {
+ if (core_ticket->key == key) {
+ list_del(&core_ticket->ticket_list);
+ kfree(core_ticket);
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
+/**
+** ib_peer_create_invalidation_ctx - creates invalidation context for given umem
+** @ib_peer_mem: peer client to be used
+** @umem: umem struct belongs to that context
+** @invalidation_ctx: output context
+**/
+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct ib_umem *umem,
+ struct invalidation_ctx **invalidation_ctx)
+{
+ int ret;
+ struct invalidation_ctx *ctx;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ret = ib_peer_insert_context(ib_peer_mem, ctx,
+ &ctx->context_ticket);
+ if (ret) {
+ kfree(ctx);
+ return ret;
+ }
+
+ ctx->umem = umem;
+ umem->invalidation_ctx = ctx;
+ *invalidation_ctx = ctx;
+ return 0;
+}
+
+/**
+ * ** ib_peer_destroy_invalidation_ctx - destroy a given invalidation context
+ * ** @ib_peer_mem: peer client to be used
+ * ** @invalidation_ctx: context to be invalidated
+ * **/
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct invalidation_ctx *invalidation_ctx)
+{
+ mutex_lock(&ib_peer_mem->lock);
+ ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
+ mutex_unlock(&ib_peer_mem->lock);
+
+ kfree(invalidation_ctx);
+}
+
static void complete_peer(struct kref *kref)
{
struct ib_peer_memory_client *ib_peer_client =
@@ -60,9 +147,12 @@ void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
if (!ib_peer_client)
return NULL;
+ INIT_LIST_HEAD(&ib_peer_client->core_ticket_list);
+ mutex_init(&ib_peer_client->lock);
init_completion(&ib_peer_client->unload_comp);
kref_init(&ib_peer_client->ref);
ib_peer_client->peer_mem = peer_client;
+ ib_peer_client->last_ticket = 1;
/* Once peer supplied a non NULL callback it's an indication that
* invalidation support is required for any memory owning.
*/
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 7f41ce5..6f3dc84 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -36,6 +36,8 @@
#include <rdma/peer_mem.h>
struct ib_ucontext;
+struct ib_umem;
+struct invalidation_ctx;
struct ib_peer_memory_client {
const struct peer_memory_client *peer_mem;
@@ -43,6 +45,16 @@ struct ib_peer_memory_client {
int invalidation_required;
struct kref ref;
struct completion unload_comp;
+ /* lock is used via the invalidation flow */
+ struct mutex lock;
+ struct list_head core_ticket_list;
+ u64 last_ticket;
+};
+
+struct core_ticket {
+ unsigned long key;
+ void *context;
+ struct list_head ticket_list;
};
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
@@ -53,4 +65,11 @@ struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
void *peer_client_context);
+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct ib_umem *umem,
+ struct invalidation_ctx **invalidation_ctx);
+
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct invalidation_ctx *ctx);
+
#endif
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index bb760f9..5d0fb41 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -43,6 +43,13 @@
struct ib_ucontext;
struct ib_umem_odp;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+struct invalidation_ctx {
+ struct ib_umem *umem;
+ u64 context_ticket;
+};
+#endif
+
struct ib_umem {
struct ib_ucontext *context;
size_t length;
@@ -61,6 +68,7 @@ struct ib_umem {
#ifdef CONFIG_INFINIBAND_PEER_MEM
/* peer memory that manages this umem */
struct ib_peer_memory_client *ib_peer_mem;
+ struct invalidation_ctx *invalidation_ctx;
/* peer memory private context */
void *peer_mem_client_context;
#endif
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 33+ messages in thread* [RFC 5/7] IB/core: Invalidation support for peer memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (3 preceding siblings ...)
2016-02-11 16:12 ` [RFC 4/7] IB/core: Infrastructure to manage peer core context Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 6/7] IB/core: Peer memory client for IO memory Artemy Kovalyov
` (2 subsequent siblings)
7 siblings, 0 replies; 33+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds the required functionality to invalidate a given peer
memory represented by some core context.
Each umem that was built over peer memory and supports invalidation has
some invalidation context assigned to it with the required data to
manage, once peer will call the invalidation callback below actions are
taken:
1) Taking lock on peer client to sync with inflight dereg_mr on that
memory.
2) Once lock is taken have a lookup for ticket id to find the matching
core context.
3) In case found will call umem invalidation function, otherwise call is
returned.
Some notes:
1) As peer invalidate callback defined to be blocking it must return
just after that pages are not going to be accessed any more. For that
reason ib_invalidate_peer_memory is waiting for a completion event in
case there is other inflight call coming as part of dereg_mr.
2) The peer memory API assumes that a lock might be taken by a peer
client to protect its memory operations. Specifically, its invalidate
callback might be called under that lock which may lead to an AB/BA
dead-lock in case IB core will call get/put pages APIs with the IB core
peer's lock taken, for that reason as part of
ib_umem_activate_invalidation_notifier lock is taken
then checking for some inflight invalidation state before activating it.
3) Once a peer client admits as part of its registration that it may
require invalidation support, it can't be an owner of a memory range
which doesn't support it.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 85 ++++++++++++++++++++++++++++++++++++--
drivers/infiniband/core/umem.c | 56 +++++++++++++++++++++----
include/rdma/ib_peer_mem.h | 1 +
include/rdma/ib_umem.h | 19 +++++++++
4 files changed, 148 insertions(+), 13 deletions(-)
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 57afb76..f9aaef2 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -37,9 +37,56 @@
static DEFINE_MUTEX(peer_memory_mutex);
static LIST_HEAD(peer_memory_list);
+/* Caller should be holding the peer client lock, ib_peer_client->lock */
+static struct core_ticket *ib_peer_search_context(
+ struct ib_peer_memory_client *ib_peer_client,
+ u64 key)
+{
+ struct core_ticket *core_ticket;
+
+ list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
+ ticket_list) {
+ if (core_ticket->key == key)
+ return core_ticket;
+ }
+
+ return NULL;
+}
+
static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
{
- return -ENOSYS;
+ struct ib_peer_memory_client *ib_peer_client = reg_handle;
+ struct invalidation_ctx *invalidation_ctx;
+ struct core_ticket *core_ticket;
+
+ mutex_lock(&ib_peer_client->lock);
+ core_ticket = ib_peer_search_context(ib_peer_client, core_context);
+ if (!core_ticket) {
+ mutex_unlock(&ib_peer_client->lock);
+ return 0;
+ }
+
+ invalidation_ctx = (struct invalidation_ctx *)core_ticket->context;
+ /* If context is not ready yet, mark it to be invalidated */
+ if (!invalidation_ctx->func) {
+ invalidation_ctx->peer_invalidated = 1;
+ mutex_unlock(&ib_peer_client->lock);
+ return 0;
+ }
+ invalidation_ctx->func(invalidation_ctx->cookie,
+ invalidation_ctx->umem, 0, 0);
+ if (invalidation_ctx->inflight_invalidation) {
+ /* init the completion to wait on
+ * before letting other thread to run
+ */
+ init_completion(&invalidation_ctx->comp);
+ mutex_unlock(&ib_peer_client->lock);
+ wait_for_completion(&invalidation_ctx->comp);
+ }
+
+ kfree(invalidation_ctx);
+
+ return 0;
}
static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
@@ -122,11 +169,33 @@ int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
struct invalidation_ctx *invalidation_ctx)
{
- mutex_lock(&ib_peer_mem->lock);
+ int peer_callback;
+ int inflight_invalidation;
+
+ /* If we are under peer callback lock was already taken.*/
+ if (!invalidation_ctx->peer_callback)
+ mutex_lock(&ib_peer_mem->lock);
ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
- mutex_unlock(&ib_peer_mem->lock);
+ /* Make sure to check inflight flag after took the lock and remove
+ * from tree. In addition, from that point using local variables for
+ * peer_callback and inflight_invalidation as after the complete
+ * invalidation_ctx can't be accessed any more as it may be freed
+ * by the callback.
+ */
+ peer_callback = invalidation_ctx->peer_callback;
+ inflight_invalidation = invalidation_ctx->inflight_invalidation;
+ if (inflight_invalidation)
+ complete(&invalidation_ctx->comp);
- kfree(invalidation_ctx);
+ /* On peer callback lock is handled externally */
+ if (!peer_callback)
+ mutex_unlock(&ib_peer_mem->lock);
+
+ /* In case under callback context or callback is pending
+ * let it free the invalidation context
+ */
+ if (!peer_callback && !inflight_invalidation)
+ kfree(invalidation_ctx);
}
static void complete_peer(struct kref *kref)
@@ -186,6 +255,7 @@ EXPORT_SYMBOL(ib_unregister_peer_memory_client);
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
unsigned long addr,
size_t size,
+ unsigned long flags,
void **peer_client_context)
{
struct ib_peer_memory_client *ib_peer_client;
@@ -193,6 +263,13 @@ struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
mutex_lock(&peer_memory_mutex);
list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+ /* In case peer requires invalidation it can't own memory
+ * which doesn't support it
+ */
+ if (ib_peer_client->invalidation_required &&
+ (!(flags & IB_UMEM_PEER_INVAL_SUPP)))
+ continue;
+
ret = ib_peer_client->peer_mem->acquire(addr, size,
peer_client_context);
if (ret > 0)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2eab34e..f478f63 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -46,12 +46,19 @@
#ifdef CONFIG_INFINIBAND_PEER_MEM
static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
struct ib_umem *umem, unsigned long addr,
- int dmasync)
+ unsigned long flags)
{
int ret;
const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+ struct invalidation_ctx *ictx = NULL;
umem->ib_peer_mem = ib_peer_mem;
+ if (flags & IB_UMEM_PEER_INVAL_SUPP) {
+ ret = ib_peer_create_invalidation_ctx(ib_peer_mem, umem, &ictx);
+ if (ret)
+ goto end;
+ }
+
/*
* We always request write permissions to the pages, to force breaking
* of any CoW during the registration of the MR. For read-only MRs we
@@ -62,7 +69,7 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
1, !umem->writable,
&umem->sg_head,
umem->peer_mem_client_context,
- 0);
+ ictx ? ictx->context_ticket : 0);
if (ret)
goto out;
@@ -71,7 +78,7 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
ret = peer_mem->dma_map(&umem->sg_head,
umem->peer_mem_client_context,
umem->context->device->dma_device,
- dmasync,
+ flags & IB_UMEM_DMA_SYNC,
&umem->nmap);
if (ret)
goto put_pages;
@@ -82,23 +89,54 @@ put_pages:
peer_mem->put_pages(&umem->sg_head,
umem->peer_mem_client_context);
out:
+ if (ictx)
+ ib_peer_destroy_invalidation_ctx(ib_peer_mem, ictx);
+end:
ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
return ERR_PTR(ret);
}
static void peer_umem_release(struct ib_umem *umem)
{
- const struct peer_memory_client *peer_mem =
- umem->ib_peer_mem->peer_mem;
+ struct ib_peer_memory_client *ib_peer_mem = umem->ib_peer_mem;
+ const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+ struct invalidation_ctx *ictx = umem->invalidation_ctx;
+
+ if (ictx)
+ ib_peer_destroy_invalidation_ctx(ib_peer_mem, ictx);
peer_mem->dma_unmap(&umem->sg_head,
umem->peer_mem_client_context,
umem->context->device->dma_device);
peer_mem->put_pages(&umem->sg_head,
umem->peer_mem_client_context);
- ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+ ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
kfree(umem);
}
+
+int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+ void (*func)(void *cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size),
+ void *cookie)
+{
+ struct invalidation_ctx *ictx = umem->invalidation_ctx;
+ int ret = 0;
+
+ mutex_lock(&umem->ib_peer_mem->lock);
+ if (ictx->peer_invalidated) {
+ pr_err("ib_umem_activate_invalidation_notifier: pages were invalidated by peer\n");
+ ret = -EINVAL;
+ goto end;
+ }
+ ictx->func = func;
+ ictx->cookie = cookie;
+ /* from that point any pending invalidations can be called */
+end:
+ mutex_unlock(&umem->ib_peer_mem->lock);
+ return ret;
+}
+EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
#endif
static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
@@ -209,15 +247,15 @@ struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
struct ib_umem *peer_umem;
peer_mem_client =
- ib_get_peer_client(context, addr, size,
+ ib_get_peer_client(context, addr, size, flags,
&umem->peer_mem_client_context);
if (IS_ERR(peer_mem_client)) {
kfree(umem);
return ERR_CAST(peer_mem_client);
} else if (peer_mem_client) {
- peer_umem = peer_umem_get(peer_mem_client, umem, addr,
- flags & IB_UMEM_DMA_SYNC);
+ peer_umem = peer_umem_get(peer_mem_client, umem,
+ addr, flags);
if (IS_ERR(peer_umem))
kfree(umem);
return peer_umem;
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 6f3dc84..d2b2d5f 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -60,6 +60,7 @@ struct core_ticket {
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
unsigned long addr,
size_t size,
+ unsigned long flags,
void **peer_client_context);
void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 5d0fb41..002da1e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -42,11 +42,20 @@
struct ib_ucontext;
struct ib_umem_odp;
+struct ib_umem;
#ifdef CONFIG_INFINIBAND_PEER_MEM
struct invalidation_ctx {
struct ib_umem *umem;
u64 context_ticket;
+ void (*func)(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size);
+ void *cookie;
+ int peer_callback;
+ int inflight_invalidation;
+ int peer_invalidated;
+ struct completion comp;
};
#endif
@@ -100,6 +109,7 @@ static inline size_t ib_umem_num_pages(struct ib_umem *umem)
enum ib_peer_mem_flags {
IB_UMEM_DMA_SYNC = (1 << 0),
IB_UMEM_PEER_ALLOW = (1 << 1),
+ IB_UMEM_PEER_INVAL_SUPP = (1 << 2),
};
#ifdef CONFIG_INFINIBAND_USER_MEM
@@ -112,6 +122,14 @@ int ib_umem_page_count(struct ib_umem *umem);
int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
size_t length);
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+ void (*func)(void *cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size),
+ void *cookie);
+#endif
+
#else /* CONFIG_INFINIBAND_USER_MEM */
#include <linux/err.h>
@@ -129,6 +147,7 @@ static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offs
size_t length) {
return -EINVAL;
}
+
#endif /* CONFIG_INFINIBAND_USER_MEM */
static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 33+ messages in thread* [RFC 6/7] IB/core: Peer memory client for IO memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (4 preceding siblings ...)
2016-02-11 16:12 ` [RFC 5/7] IB/core: Invalidation support for peer memory Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 7/7] IB/mlx5: Invalidation support for MR over peer memory Artemy Kovalyov
2016-02-11 19:18 ` [RFC 0/7] Peer-direct memory Jason Gunthorpe
7 siblings, 0 replies; 33+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds kernel module allowing RDMA transfers with various types of memory
like mmaped devices (that create PFN mappings) and mmaped files from DAX
filesystems, such as those that reside on NVRAM devices.
Module original source [1]
[1] https://github.com/sbates130272/io_peer_mem
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/Kconfig | 9 +
drivers/infiniband/core/Makefile | 4 +
drivers/infiniband/core/io_peer_mem.c | 332 ++++++++++++++++++++++++++++++++++
3 files changed, 345 insertions(+)
create mode 100644 drivers/infiniband/core/io_peer_mem.c
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 2837d66..90811b2 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -74,6 +74,15 @@ config INFINIBAND_PEER_MEM
memory in external hardware devices, such as GPU cards, SSD based
storage, dedicated ASIC accelerators, etc.
+config INFINIBAND_IO_PEER_MEM
+ tristate "InfiniBand Peer memory client for IO memory"
+ depends on INFINIBAND_PEER_MEM
+ default y
+ ---help---
+ Adds kernel module allowing RDMA transfers with various types of memory
+ like mmaped devices (that create PFN mappings) and mmaped files from DAX
+ filesystems, such as those that reside on NVRAM devices.
+
source "drivers/infiniband/hw/mthca/Kconfig"
source "drivers/infiniband/hw/qib/Kconfig"
source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 9882d00..529f1e9 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \
obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o
obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \
$(user_access-y)
+obj-$(CONFIG_INFINIBAND_IO_PEER_MEM) += ib_io_mr.o
ib_core-y := packer.o ud_header.o verbs.o cq.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o \
@@ -36,3 +37,6 @@ ib_umad-y := user_mad.o
ib_ucm-y := ucm.o
ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o
+
+ib_io_mr-y := io_peer_mem.o
+
diff --git a/drivers/infiniband/core/io_peer_mem.c b/drivers/infiniband/core/io_peer_mem.c
new file mode 100644
index 0000000..df7161c
--- /dev/null
+++ b/drivers/infiniband/core/io_peer_mem.c
@@ -0,0 +1,332 @@
+/*
+ * Copyright (c) 2015 PMC-Sierra Inc.
+ * Copyright (c) 2016 Mellanox Technologies, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/peer_mem.h>
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+
+MODULE_AUTHOR("Logan Gunthorpe");
+MODULE_DESCRIPTION("MMAP'd IO memory plug-in");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static void *reg_handle;
+static int (*mem_invalidate_callback)(void *reg_handle, u64 core_context);
+
+struct context {
+ unsigned long addr;
+ size_t size;
+ u64 core_context;
+ struct mmu_notifier mn;
+ struct pid *pid;
+ int active;
+ struct work_struct cleanup_work;
+ struct mutex mmu_mutex;
+};
+
+static void do_invalidate(struct context *ctx)
+{
+ mutex_lock(&ctx->mmu_mutex);
+
+ if (!ctx->active)
+ goto unlock_and_return;
+
+ ctx->active = 0;
+ pr_debug("invalidated addr %lx size %zx\n", ctx->addr, ctx->size);
+ mem_invalidate_callback(reg_handle, ctx->core_context);
+
+unlock_and_return:
+ mutex_unlock(&ctx->mmu_mutex);
+}
+
+static void mmu_release(struct mmu_notifier *mn,
+ struct mm_struct *mm)
+{
+ struct context *ctx = container_of(mn, struct context, mn);
+
+ do_invalidate(ctx);
+}
+
+static void mmu_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct context *ctx = container_of(mn, struct context, mn);
+
+ if (start >= (ctx->addr + ctx->size) || ctx->addr >= end)
+ return;
+
+ pr_debug("mmu_invalidate_range %lx-%lx\n", start, end);
+ do_invalidate(ctx);
+}
+
+static void mmu_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address)
+{
+ struct context *ctx = container_of(mn, struct context, mn);
+
+ if (address < ctx->addr || address < (ctx->addr + ctx->size))
+ return;
+
+ pr_debug("mmu_invalidate_page %lx\n", address);
+ do_invalidate(ctx);
+}
+
+static struct mmu_notifier_ops mmu_notifier_ops = {
+ .release = mmu_release,
+ .invalidate_range = mmu_invalidate_range,
+ .invalidate_page = mmu_invalidate_page,
+};
+
+static void fault_missing_pages(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ unsigned long pfn;
+
+ if (!(vma->vm_flags & VM_MIXEDMAP))
+ return;
+
+ for (; start < end; start += PAGE_SIZE) {
+ if (!follow_pfn(vma, start, &pfn))
+ continue;
+
+ handle_mm_fault(current->mm, vma, start, FAULT_FLAG_WRITE);
+ }
+}
+
+static int acquire(unsigned long addr, size_t size, void **context)
+{
+ struct vm_area_struct *vma;
+ struct context *ctx;
+ unsigned long pfn, end;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return 0;
+
+ ctx->addr = addr;
+ ctx->size = size;
+ ctx->active = 0;
+
+ end = addr + size;
+ vma = find_vma(current->mm, addr);
+
+ if (!vma || vma->vm_end < end)
+ goto err;
+
+ pr_debug("vma: %lx %lx %lx %zx\n", addr, vma->vm_end - vma->vm_start,
+ vma->vm_flags, size);
+
+ if (!(vma->vm_flags & VM_WRITE))
+ goto err;
+
+ fault_missing_pages(vma, addr & PAGE_MASK, end);
+
+ if (follow_pfn(vma, addr, &pfn))
+ goto err;
+
+ pr_debug("pfn: %lx\n", pfn << PAGE_SHIFT);
+
+ mutex_init(&ctx->mmu_mutex);
+
+ ctx->mn.ops = &mmu_notifier_ops;
+
+ if (mmu_notifier_register(&ctx->mn, current->mm)) {
+ pr_err("Failed to register mmu_notifier\n");
+ goto err;
+ }
+
+ ctx->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
+
+ if (!ctx->pid)
+ goto err;
+
+ pr_debug("acquire %p\n", ctx);
+ *context = ctx;
+ return 1;
+
+err:
+ kfree(ctx);
+ return 0;
+}
+
+static void deferred_cleanup(struct work_struct *work)
+{
+ struct context *ctx = container_of(work, struct context, cleanup_work);
+ struct task_struct *owning_process;
+ struct mm_struct *owning_mm;
+
+ pr_debug("cleanup %p\n", ctx);
+
+ owning_process = get_pid_task(ctx->pid, PIDTYPE_PID);
+ if (owning_process) {
+ owning_mm = get_task_mm(owning_process);
+ if (owning_mm) {
+ mmu_notifier_unregister(&ctx->mn, owning_mm);
+ mmput(owning_mm);
+ }
+ put_task_struct(owning_process);
+ }
+ put_pid(ctx->pid);
+ kfree(ctx);
+}
+
+static void release(void *context)
+{
+ struct context *ctx = context;
+
+ pr_debug("release %p\n", ctx);
+
+ INIT_WORK(&ctx->cleanup_work, deferred_cleanup);
+ schedule_work(&ctx->cleanup_work);
+}
+
+static int get_pages(unsigned long addr, size_t size, int write, int force,
+ struct sg_table *sg_head, void *context,
+ u64 core_context)
+{
+ struct context *ctx = context;
+ int ret;
+
+ ctx->core_context = core_context;
+ ctx->active = 1;
+
+ ret = sg_alloc_table(sg_head, (ctx->size + PAGE_SIZE - 1) / PAGE_SIZE,
+ GFP_KERNEL);
+ return ret;
+}
+
+static void put_pages(struct sg_table *sg_head, void *context)
+{
+ struct context *ctx = context;
+
+ ctx->active = 0;
+ sg_free_table(sg_head);
+}
+
+static int dma_map(struct sg_table *sg_head, void *context,
+ struct device *dma_device, int dmasync,
+ int *nmap)
+{
+ struct scatterlist *sg;
+ struct context *ctx = context;
+ unsigned long pfn;
+ unsigned long addr = ctx->addr;
+ unsigned long size = ctx->size;
+ struct task_struct *owning_process;
+ struct mm_struct *owning_mm;
+ struct vm_area_struct *vma = NULL;
+ int i, ret = 1;
+
+ *nmap = (ctx->size + PAGE_SIZE - 1) / PAGE_SIZE;
+
+ owning_process = get_pid_task(ctx->pid, PIDTYPE_PID);
+ if (!owning_process)
+ return ret;
+
+ owning_mm = get_task_mm(owning_process);
+ if (!owning_mm)
+ goto out;
+
+ vma = find_vma(owning_mm, ctx->addr);
+ if (!vma)
+ goto out;
+
+ for_each_sg(sg_head->sgl, sg, *nmap, i) {
+ sg_set_page(sg, NULL, PAGE_SIZE, 0);
+ ret = follow_pfn(vma, addr, &pfn);
+ if (ret)
+ goto out;
+
+ sg->dma_address = (pfn << PAGE_SHIFT);
+ sg->dma_length = PAGE_SIZE;
+ sg->offset = addr & ~PAGE_MASK;
+
+ pr_debug("sg[%d] %lx %x %d\n", i,
+ (unsigned long)sg->dma_address,
+ sg->dma_length, sg->offset);
+
+ addr += sg->dma_length - sg->offset;
+ size -= sg->dma_length - sg->offset;
+ }
+out:
+ put_task_struct(owning_process);
+
+ return 0;
+}
+
+static int dma_unmap(struct sg_table *sg_head, void *context,
+ struct device *dma_device)
+{
+ return 0;
+}
+
+static unsigned long get_page_size(void *context)
+{
+ return PAGE_SIZE;
+}
+
+static struct peer_memory_client io_mem_client = {
+ .acquire = acquire,
+ .get_pages = get_pages,
+ .dma_map = dma_map,
+ .dma_unmap = dma_unmap,
+ .put_pages = put_pages,
+ .get_page_size = get_page_size,
+ .release = release,
+};
+
+static int __init io_mem_init(void)
+{
+ reg_handle = ib_register_peer_memory_client(&io_mem_client,
+ &mem_invalidate_callback);
+
+ if (!reg_handle)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void __exit io_mem_cleanup(void)
+{
+ ib_unregister_peer_memory_client(reg_handle);
+}
+
+module_init(io_mem_init);
+module_exit(io_mem_cleanup);
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 33+ messages in thread* [RFC 7/7] IB/mlx5: Invalidation support for MR over peer memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (5 preceding siblings ...)
2016-02-11 16:12 ` [RFC 6/7] IB/core: Peer memory client for IO memory Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 19:18 ` [RFC 0/7] Peer-direct memory Jason Gunthorpe
7 siblings, 0 replies; 33+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds the required functionality to work with peer memory
clients which require invalidation support.
It includes:
- module moved to use ib_umem_get_flags.
- umem invalidation callback - once called should free any HW
resources assigned to that umem, then free peer resources
corresponding to that umem.
- The MR object relates to that umem is stay alive till dereg_mr is
called.
- synchronizing support between dereg_mr to invalidate callback.
- advertises the P2P device capability.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/cq.c | 15 ++-
drivers/infiniband/hw/mlx5/doorbell.c | 6 +-
drivers/infiniband/hw/mlx5/main.c | 3 +
drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +++
drivers/infiniband/hw/mlx5/mr.c | 166 ++++++++++++++++++++++++++--------
drivers/infiniband/hw/mlx5/qp.c | 3 +-
drivers/infiniband/hw/mlx5/srq.c | 4 +-
7 files changed, 163 insertions(+), 46 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 7ddc790..c41913d 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -651,9 +651,11 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
*cqe_size = ucmd.cqe_size;
- cq->buf.umem = ib_umem_get(context, ucmd.buf_addr,
- entries * ucmd.cqe_size,
- IB_ACCESS_LOCAL_WRITE, 1);
+ cq->buf.umem = ib_umem_get_flags(context, ucmd.buf_addr,
+ entries * ucmd.cqe_size,
+ IB_ACCESS_LOCAL_WRITE,
+ IB_UMEM_DMA_SYNC |
+ IB_UMEM_PEER_ALLOW);
if (IS_ERR(cq->buf.umem)) {
err = PTR_ERR(cq->buf.umem);
return err;
@@ -991,8 +993,11 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
if (ucmd.reserved0 || ucmd.reserved1)
return -EINVAL;
- umem = ib_umem_get(context, ucmd.buf_addr, entries * ucmd.cqe_size,
- IB_ACCESS_LOCAL_WRITE, 1);
+ umem = ib_umem_get_flags(context, ucmd.buf_addr,
+ entries * ucmd.cqe_size,
+ IB_ACCESS_LOCAL_WRITE,
+ IB_UMEM_DMA_SYNC |
+ IB_UMEM_PEER_ALLOW);
if (IS_ERR(umem)) {
err = PTR_ERR(umem);
return err;
diff --git a/drivers/infiniband/hw/mlx5/doorbell.c b/drivers/infiniband/hw/mlx5/doorbell.c
index a0e4e6d..1d76094 100644
--- a/drivers/infiniband/hw/mlx5/doorbell.c
+++ b/drivers/infiniband/hw/mlx5/doorbell.c
@@ -63,8 +63,10 @@ int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
page->user_virt = (virt & PAGE_MASK);
page->refcnt = 0;
- page->umem = ib_umem_get(&context->ibucontext, virt & PAGE_MASK,
- PAGE_SIZE, 0, 0);
+ page->umem = ib_umem_get_flags(&context->ibucontext,
+ virt & PAGE_MASK,
+ PAGE_SIZE, 0,
+ IB_UMEM_PEER_ALLOW);
if (IS_ERR(page->umem)) {
err = PTR_ERR(page->umem);
kfree(page);
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index a55bf05..f198208 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -475,6 +475,9 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
IB_DEVICE_PORT_ACTIVE_EVENT |
IB_DEVICE_SYS_IMAGE_GUID |
IB_DEVICE_RC_RNR_NAK_GEN;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ props->device_cap_flags |= IB_DEVICE_PEER_MEMORY;
+#endif
if (MLX5_CAP_GEN(mdev, pkv))
props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index d475f83..12571e0 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -376,6 +376,13 @@ enum mlx5_ib_mtt_access_flags {
#define MLX5_IB_MTT_PRESENT (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+struct mlx5_ib_peer_id {
+ struct completion comp;
+ struct mlx5_ib_mr *mr;
+};
+#endif
+
struct mlx5_ib_mr {
struct ib_mr ibmr;
void *descs;
@@ -395,6 +402,11 @@ struct mlx5_ib_mr {
struct mlx5_core_sig_ctx *sig;
int live;
void *descs_alloc;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ struct mlx5_ib_peer_id *peer_id;
+ atomic_t invalidated;
+ struct completion invalidation_comp;
+#endif
};
struct mlx5_ib_umr_context {
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 6000f7a..d34199c 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1037,6 +1037,73 @@ err_1:
return ERR_PTR(err);
}
+static int mlx5_ib_invalidate_mr(struct ib_mr *ibmr)
+{
+ struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
+ struct mlx5_ib_mr *mr = to_mmr(ibmr);
+ int npages = mr->npages;
+ struct ib_umem *umem = mr->umem;
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ if (umem && umem->odp_data) {
+ /* Prevent new page faults from succeeding */
+ mr->live = 0;
+ /* Wait for all running page-fault handlers to finish. */
+ synchronize_srcu(&dev->mr_srcu);
+ /* Destroy all page mappings */
+ mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
+ ib_umem_end(umem));
+ /*
+ * We kill the umem before the MR for ODP,
+ * so that there will not be any invalidations in
+ * flight, looking at the *mr struct.
+ */
+ ib_umem_release(umem);
+ atomic_sub(npages, &dev->mdev->priv.reg_pages);
+
+ /* Avoid double-freeing the umem. */
+ umem = NULL;
+ }
+#endif
+
+ clean_mr(mr);
+
+ if (umem) {
+ ib_umem_release(umem);
+ atomic_sub(npages, &dev->mdev->priv.reg_pages);
+ }
+ return 0;
+}
+
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+static void mlx5_invalidate_umem(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size)
+{
+ struct mlx5_ib_mr *mr;
+ struct mlx5_ib_peer_id *peer_id = invalidation_cookie;
+
+ wait_for_completion(&peer_id->comp);
+ if (peer_id->mr == NULL)
+ return;
+
+ mr = peer_id->mr;
+ /* This function is called under client peer lock
+ * so its resources are race protected
+ */
+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ umem->invalidation_ctx->inflight_invalidation = 1;
+ return;
+ }
+
+ umem->invalidation_ctx->peer_callback = 1;
+ mlx5_ib_invalidate_mr(&mr->ibmr);
+ complete(&mr->invalidation_comp);
+}
+#endif
+
+
+
struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 virt_addr, int access_flags,
struct ib_udata *udata)
@@ -1049,16 +1116,38 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
int ncont;
int order;
int err;
-
- mlx5_ib_dbg(dev, "start 0x%llx, virt_addr 0x%llx, length 0x%llx, access_flags 0x%x\n",
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ struct ib_peer_memory_client *ib_peer_mem;
+ struct mlx5_ib_peer_id *mlx5_ib_peer_id = NULL;
+#endif
+ mlx5_ib_dbg(dev, "%llx virt %llx len %llx access_flags %x\n",
start, virt_addr, length, access_flags);
- umem = ib_umem_get(pd->uobject->context, start, length, access_flags,
- 0);
+ umem = ib_umem_get_flags(pd->uobject->context, start, length,
+ access_flags, IB_UMEM_PEER_ALLOW |
+ IB_UMEM_PEER_INVAL_SUPP);
if (IS_ERR(umem)) {
mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
return (void *)umem;
}
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ ib_peer_mem = umem->ib_peer_mem;
+ if (ib_peer_mem) {
+ mlx5_ib_peer_id = kzalloc(sizeof(*mlx5_ib_peer_id), GFP_KERNEL);
+ if (!mlx5_ib_peer_id) {
+ err = -ENOMEM;
+ goto error;
+ }
+ init_completion(&mlx5_ib_peer_id->comp);
+ err = ib_umem_activate_invalidation_notifier
+ (umem,
+ mlx5_invalidate_umem,
+ mlx5_ib_peer_id);
+ if (err)
+ goto error;
+ }
+#endif
+
mlx5_ib_cont_pages(umem, start, &npages, &page_shift, &ncont, &order);
if (!npages) {
mlx5_ib_warn(dev, "avoid zero region\n");
@@ -1098,6 +1187,15 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
atomic_add(npages, &dev->mdev->priv.reg_pages);
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ atomic_set(&mr->invalidated, 0);
+ if (ib_peer_mem) {
+ init_completion(&mr->invalidation_comp);
+ mlx5_ib_peer_id->mr = mr;
+ mr->peer_id = mlx5_ib_peer_id;
+ complete(&mlx5_ib_peer_id->comp);
+ }
+#endif
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
if (umem->odp_data) {
@@ -1127,6 +1225,12 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
return &mr->ibmr;
error:
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ if (mlx5_ib_peer_id) {
+ complete(&mlx5_ib_peer_id->comp);
+ kfree(mlx5_ib_peer_id);
+ }
+#endif
ib_umem_release(umem);
return ERR_PTR(err);
}
@@ -1245,54 +1349,44 @@ static int clean_mr(struct mlx5_ib_mr *mr)
mlx5_ib_warn(dev, "failed unregister\n");
return err;
}
- free_cached_mr(dev, mr);
}
- if (!umred)
- kfree(mr);
-
return 0;
}
int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
{
+#ifdef CONFIG_INFINIBAND_PEER_MEM
struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
struct mlx5_ib_mr *mr = to_mmr(ibmr);
- int npages = mr->npages;
- struct ib_umem *umem = mr->umem;
+ int ret = 0;
+ int umred = mr->umred;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
- if (umem && umem->odp_data) {
- /* Prevent new page faults from succeeding */
- mr->live = 0;
- /* Wait for all running page-fault handlers to finish. */
- synchronize_srcu(&dev->mr_srcu);
- /* Destroy all page mappings */
- mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
- ib_umem_end(umem));
- /*
- * We kill the umem before the MR for ODP,
- * so that there will not be any invalidations in
- * flight, looking at the *mr struct.
+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ /* In case there is inflight invalidation
+ * call pending for its termination
*/
- ib_umem_release(umem);
- atomic_sub(npages, &dev->mdev->priv.reg_pages);
-
- /* Avoid double-freeing the umem. */
- umem = NULL;
+ wait_for_completion(&mr->invalidation_comp);
+ } else {
+ ret = mlx5_ib_invalidate_mr(ibmr);
+ if (ret)
+ return ret;
}
-#endif
-
- clean_mr(mr);
-
- if (umem) {
- ib_umem_release(umem);
- atomic_sub(npages, &dev->mdev->priv.reg_pages);
+ kfree(mr->peer_id);
+ mr->peer_id = NULL;
+ if (umred) {
+ atomic_set(&mr->invalidated, 0);
+ free_cached_mr(dev, mr);
+ } else {
+ kfree(mr);
}
-
return 0;
+#else
+ return mlx5_ib_invalidate_mr(ibmr);
+#endif
}
+
struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
enum ib_mr_type mr_type,
u32 max_num_sg)
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 8fb9c27..5e07909 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -611,7 +611,8 @@ static int mlx5_ib_umem_get(struct mlx5_ib_dev *dev,
{
int err;
- *umem = ib_umem_get(pd->uobject->context, addr, size, 0, 0);
+ *umem = ib_umem_get_flags(pd->uobject->context, addr, size,
+ 0, IB_UMEM_PEER_ALLOW);
if (IS_ERR(*umem)) {
mlx5_ib_dbg(dev, "umem_get failed\n");
return PTR_ERR(*umem);
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 4659256..23b8ab4 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -115,8 +115,8 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);
- srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, buf_size,
- 0, 0);
+ srq->umem = ib_umem_get_flags(pd->uobject->context, ucmd.buf_addr,
+ buf_size, 0, IB_UMEM_PEER_ALLOW);
if (IS_ERR(srq->umem)) {
mlx5_ib_dbg(dev, "failed umem get, size %d\n", buf_size);
err = PTR_ERR(srq->umem);
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 33+ messages in thread* Re: [RFC 0/7] Peer-direct memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (6 preceding siblings ...)
2016-02-11 16:12 ` [RFC 7/7] IB/mlx5: Invalidation support for MR over peer memory Artemy Kovalyov
@ 2016-02-11 19:18 ` Jason Gunthorpe
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-14 14:27 ` Haggai Eran
7 siblings, 2 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2016-02-11 19:18 UTC (permalink / raw)
To: Artemy Kovalyov
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w
On Thu, Feb 11, 2016 at 06:12:50PM +0200, Artemy Kovalyov wrote:
> Recently introduced ZONE_DEVICE patch [1] allows to register devices as
> providers of "device memory" regions, making RDMA operation with them
> transparantly available. This patch is intended for scenarios which not fit
> into ZONE_DEVICE infrastrcture, but device still want to exposure it's
> IO regions to RDMA access.
Most of this stuff (eg the whole peer_memory_client thing) has no
buisness being part of the RDMA stack. We can't ask other drivers to
call IB functions just because their devices have ZONE_DIRECT memory.
Resubmit those parts under the mm subsystem, or another more
appropriate place.
This is the same comment as last time this was submitted.
If you want to make some incremental progress then implement the
existing ZONE_DEVICE API for the IB core and add the invalidate stuff
later, once you've negotiated a common API for that with linux-mm.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 33+ messages in thread[parent not found: <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-02-12 20:13 ` Christoph Hellwig
[not found] ` <20160212201328.GA14122-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-02-14 14:05 ` Haggai Eran
1 sibling, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2016-02-12 20:13 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Artemy Kovalyov, dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w
On Thu, Feb 11, 2016 at 12:18:38PM -0700, Jason Gunthorpe wrote:
> Most of this stuff (eg the whole peer_memory_client thing) has no
> buisness being part of the RDMA stack. We can't ask other drivers to
> call IB functions just because their devices have ZONE_DIRECT memory.
>
> Resubmit those parts under the mm subsystem, or another more
> appropriate place.
>
> This is the same comment as last time this was submitted.
>
> If you want to make some incremental progress then implement the
> existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> later, once you've negotiated a common API for that with linux-mm.
Agreed on all points. Nevermind that seems to be missing a user of all
the newly added infrastructure, which should come with the series.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 33+ messages in thread* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-12 20:13 ` Christoph Hellwig
@ 2016-02-14 14:05 ` Haggai Eran
1 sibling, 0 replies; 33+ messages in thread
From: Haggai Eran @ 2016-02-14 14:05 UTC (permalink / raw)
To: Jason Gunthorpe, Kovalyov Artemy
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
leon-Dj0/KMMU01E@public.gmane.org, Sagi Grimberg
On 11/02/2016 21:18, Jason Gunthorpe wrote:
> Resubmit those parts under the mm subsystem, or another more
> appropriate place.
We want the feedback from linux-mm, and they are Cced.
> If you want to make some incremental progress then implement the
> existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> later, once you've negotiated a common API for that with linux-mm.
So there are couple of issues we currently have with ZONE_DEVICE.
Perhaps they can be solved and then we could use it directly.
First, I'm not sure it is intended to be used for our purpose.
memremap() has this comment [1]:
> memremap() is "ioremap" for cases where it is known that the resource
> being mapped does not have i/o side effects and the __iomem
> annotation is not applicable.
Does this apply also to devm_memremap_pages()? Because the HCA BAR
clearly doesn't fall under this definition.
Second, there's a requirement that ZONE_DEVICE ranges are aligned to
section-boundary, right? We have devices that have 8MB or 32MB BARs,
so they won't work with 128MB sections on x86_64.
Third, I understand there was a desire to place ZONE_DEVICE page structs
in the device itself. This can work for pmem, but obviously won't work
for an I/O device BAR like an HCA.
Regards,
Haggai
[1] http://lxr.free-electrons.com/source/kernel/memremap.c?v=4.4#L38
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-11 19:18 ` [RFC 0/7] Peer-direct memory Jason Gunthorpe
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-02-14 14:27 ` Haggai Eran
2016-02-16 18:22 ` Jason Gunthorpe
1 sibling, 1 reply; 33+ messages in thread
From: Haggai Eran @ 2016-02-14 14:27 UTC (permalink / raw)
To: Jason Gunthorpe, Kovalyov Artemy
Cc: dledford@redhat.com, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg
[apologies: sending again because linux-mm address was wrong]
On 11/02/2016 21:18, Jason Gunthorpe wrote:
> Resubmit those parts under the mm subsystem, or another more
> appropriate place.
We want the feedback from linux-mm, and they are now Cced.
> If you want to make some incremental progress then implement the
> existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> later, once you've negotiated a common API for that with linux-mm.
So there are couple of issues we currently have with ZONE_DEVICE.
Perhaps they can be solved and then we could use it directly.
First, I'm not sure it is intended to be used for our purpose.
memremap() has this comment [1]:
> memremap() is "ioremap" for cases where it is known that the resource
> being mapped does not have i/o side effects and the __iomem
> annotation is not applicable.
Does this apply also to devm_memremap_pages()? Because the HCA BAR
clearly doesn't fall under this definition.
Second, there's a requirement that ZONE_DEVICE ranges are aligned to
section-boundary, right? We have devices that have 8MB or 32MB BARs,
so they won't work with 128MB sections on x86_64.
Third, I understand there was a desire to place ZONE_DEVICE page structs
in the device itself. This can work for pmem, but obviously won't work
for an I/O device BAR like an HCA.
Regards,
Haggai
[1] http://lxr.free-electrons.com/source/kernel/memremap.c?v=4.4#L38
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread* Re: [RFC 0/7] Peer-direct memory
2016-02-14 14:27 ` Haggai Eran
@ 2016-02-16 18:22 ` Jason Gunthorpe
2016-02-17 4:03 ` davide rossetti
0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2016-02-16 18:22 UTC (permalink / raw)
To: Haggai Eran
Cc: Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg
On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> [apologies: sending again because linux-mm address was wrong]
>
> On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > Resubmit those parts under the mm subsystem, or another more
> > appropriate place.
>
> We want the feedback from linux-mm, and they are now Cced.
Resubmit to mm means put this stuff someplace outside
drivers/infiniband in the tree and don't try and inappropriately send
memory management stuff through Doug's tree.
Jason
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-16 18:22 ` Jason Gunthorpe
@ 2016-02-17 4:03 ` davide rossetti
[not found] ` <CAPSaadxbFCOcKV=c3yX7eGw9Wqzn3jvPRZe2LMWYmiQcijT4nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 33+ messages in thread
From: davide rossetti @ 2016-02-17 4:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com,
linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro,
Sagi Grimberg
[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]
On Tue, Feb 16, 2016 at 10:22 AM, Jason Gunthorpe <
jgunthorpe@obsidianresearch.com> wrote:
> On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> > [apologies: sending again because linux-mm address was wrong]
> >
> > On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > > Resubmit those parts under the mm subsystem, or another more
> > > appropriate place.
> >
> > We want the feedback from linux-mm, and they are now Cced.
>
> Resubmit to mm means put this stuff someplace outside
> drivers/infiniband in the tree and don't try and inappropriately send
> memory management stuff through Doug's tree.
>
>
Jason,
I beg to differ.
1) I see mm as appropriate for real memory, i.e. something that user-space
apps can pass around.
This is not totally true for BAR memory, for instance as long as CPU
initiated atomic ops are not supported on BAR space of PCIe devices.
OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s), while high
BW writing requires use of vector instructions (at least on x86_64).
2) Instead, I see appropriate that two sophisticated devices, like an IB
NIC and a storage/accelerator device, can freely target each other for I/O,
i.e. exchanging peer-to-peer PCIe transactions. And as long as the existing
sophisticated initiators are confined to the RDMA subsystem, that is where
this support belongs to.
On a different note, this reminds me that the current patch set may be
missing a way to disable the use of platform PCIe atomics when the target
is the BAR of a peer device.
--
sincerely,
d.
email: davide DOT rossetti AT gmail DOT com
work: drossetti AT nvidia DOT com
facebook: http://www.facebook.com/dado.rossetti
twitter: @dado_rossetti
skype: d.rossetti
[-- Attachment #2: Type: text/html, Size: 2617 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread